top of page

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

As an Amazon Associate I earn

from qualifying purchases

Writer's picturemorrislee

Answer questions about a video using an image-language model with Tem-adapter

Answer questions about a video using an image-language model with Tem-adapter


Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

arXiv paper abstract https://arxiv.org/abs/2308.08414



Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However ... training ... video-based models ... higher costs than ... image-based ones.


This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.


To bridge these gaps ... propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner.


... the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression.


Besides, to reduce the semantic gap and adapt the textual representation for better event description, ... introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement.


... significant performance improvement demonstrates the effectiveness of ... method.



Please like and share this post if you enjoyed it using the buttons at the bottom!


Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website



40 views0 comments

Comments


ClickBank paid link

bottom of page