Get image matching text plus image, also get descriptions of images
Get image matching text plus image, also get descriptions of images
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
arXiv paper abstract https://arxiv.org/abs/2102.05918
arXiv PDF paper https://arxiv.org/pdf/2102.05918.pdf
... In this paper, we leverage a noisy dataset of over one billion image alt-text pairs
... A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
... The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments