Find location in video matching a sentence with TAN
Find location in video matching a sentence with TAN
Temporal Alignment Networks for Long-term Video
arXiv paper abstract https://arxiv.org/abs/2204.02968
arXiv PDF paper https://arxiv.org/pdf/2204.02968.pdf
The objective ... is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to:
(1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
The challenge is to train such networks from large-scale datasets ... where the associated text sentences have significant noise, and are only weakly aligned when relevant.
... describe a novel co-training method that enables to denoise and train on raw instructional videos without using manual annotation, despite the considerable noise
... proposed model ... outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset by a significant margin
... apply the trained model in the zero-shot settings to multiple downstream video understanding tasks and achieve state-of-the-art results ...
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments