top of page

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

As an Amazon Associate I earn

from qualifying purchases

Improve video retrieval with text by comparing coarse and fine features with X-CLIP

Improve video retrieval with text by comparing coarse and fine features with X-CLIP


X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

arXiv paper abstract https://arxiv.org/abs/2207.07285

Appliciton of X-CLIP for zero-shot video classification



Video-text retrieval ... a crucial ... task ... However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored


... cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation


... presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval.


... another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity.


... propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results.


... outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks ...



Please like and share this post if you enjoyed it using the buttons at the bottom! Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact Web site with my other posts by category https://morrislee1234.wixsite.com/website



40 views0 comments

ClickBank paid link

bottom of page