Answer questions about a video using an image-language model with Tem-adapter
Answer questions about a video using an image-language model with Tem-adapter
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
arXiv paper abstract https://arxiv.org/abs/2308.08414
arXiv PDF paper https://arxiv.org/pdf/2308.08414.pdf
Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However ... training ... video-based models ... higher costs than ... image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
To bridge these gaps ... propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner.
... the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression.
Besides, to reduce the semantic gap and adapt the textual representation for better event description, ... introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement.
... significant performance improvement demonstrates the effectiveness of ... method.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments