Better image captioning and question answering using weakly supervised training
Better image captioning and question answering using weakly supervised training
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
arXiv paper abstract https://arxiv.org/abs/2108.10904
arXiv PDF paper https://arxiv.org/pdf/2108.10904.pdf
... Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.
However, the requirement for expensive annotations ... limits the scalability of existing approaches
... relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM).
... by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective.
Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).
... demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments