Better image captioning and question answering using weakly supervised training

morrislee
Oct 26, 2021
1 min read

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

arXiv paper abstract https://arxiv.org/abs/2108.10904

arXiv PDF paper https://arxiv.org/pdf/2108.10904.pdf

... Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.

However, the requirement for expensive annotations ... limits the scalability of existing approaches

... relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM).

... by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective.

Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score).

... demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

Please like and share this post if you enjoyed it using the buttons at the bottom!

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website

#ComputerVision #VisualQuestionAnswering #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

Better image captioning and question answering using weakly supervised training

Recent Posts

Comments