Answer question about an image using structured information graph with SA-VQA
Answer question about an image using structured information graph with SA-VQA
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
arXiv paper abstract https://arxiv.org/abs/2201.10654v1
arXiv PDF paper https://arxiv.org/pdf/2201.10654v1.pdf
Visual Question Answering (VQA) ... is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations.
Previous approaches ... employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features.
These attempts aim to improve the cross-modality representations, while ignoring their internal relations.
... propose to apply structured alignments, which work with graph representation of visual and textual content
... solve ... by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments.
... model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments