Improve video retrieval with text by comparing coarse and fine features with X-CLIP
Improve video retrieval with text by comparing coarse and fine features with X-CLIP
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
arXiv paper abstract https://arxiv.org/abs/2207.07285
arXiv PDF paper https://arxiv.org/pdf/2207.07285.pdf
Appliciton of X-CLIP for zero-shot video classification
Twitter video https://twitter.com/fcakyon/status/1569294816428556289
Video-text retrieval ... a crucial ... task ... However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored
... cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation
... presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval.
... another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity.
... propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results.
... outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks ...
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments