Improve video retrieval with text by comparing coarse and fine features with X-CLIP

morrislee
Sep 15, 2022
1 min read

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

arXiv paper abstract https://arxiv.org/abs/2207.07285

arXiv PDF paper https://arxiv.org/pdf/2207.07285.pdf

Appliciton of X-CLIP for zero-shot video classification

Twitter video https://twitter.com/fcakyon/status/1569294816428556289

Online demo https://huggingface.co/spaces/fcakyon/zero-shot-video-classification

Video-text retrieval ... a crucial ... task ... However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored

... cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation

... presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval.

... another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity.

... propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results.

... outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks ...

Please like and share this post if you enjoyed it using the buttons at the bottom! Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

#ComputerVision #VideoRetrieval #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

Improve video retrieval with text by comparing coarse and fine features with X-CLIP

Recent Posts

Comments