Classification of long videos using state-space with ViS4mer

morrislee
Apr 6, 2022
1 min read

Long Movie Clip Classification with State-Space Video Models

arXiv paper abstract https://arxiv.org/abs/2204.01692v1

arXiv PDF paper https://arxiv.org/pdf/2204.01692v1.pdf

Most modern video recognition ... operate on short video clips (e.g., 5-10s in length).

... challenging to ... long movie understanding tasks, which typically require sophisticated long-range temporal reasoning capabilities.

... video transformers ... address this ... by ... long-range temporal self-attention. However, ... quadratic cost of self-attention ...

... propose ViS4mer ... uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for subsequent long-range temporal reasoning.

... ViS4mer is 2.63x faster and requires 8x less GPU memory than the corresponding pure self-attention-based model.

... ViS4mer achieves state-of-the-art results in 7 out of 9 long-form movie video classification tasks on the LVU benchmark. ...

Please like and share this post if you enjoyed it using the buttons at the bottom!

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

#ComputerVision #ActionRecognition #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

Classification of long videos using state-space with ViS4mer

Recent Posts

Comments