Classification of long videos using state-space with ViS4mer
Classification of long videos using state-space with ViS4mer
Long Movie Clip Classification with State-Space Video Models
arXiv paper abstract https://arxiv.org/abs/2204.01692v1
arXiv PDF paper https://arxiv.org/pdf/2204.01692v1.pdf
Most modern video recognition ... operate on short video clips (e.g., 5-10s in length).
... challenging to ... long movie understanding tasks, which typically require sophisticated long-range temporal reasoning capabilities.
... video transformers ... address this ... by ... long-range temporal self-attention. However, ... quadratic cost of self-attention ...
... propose ViS4mer ... uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for subsequent long-range temporal reasoning.
... ViS4mer is 2.63x faster and requires 8x less GPU memory than the corresponding pure self-attention-based model.
... ViS4mer achieves state-of-the-art results in 7 out of 9 long-form movie video classification tasks on the LVU benchmark. ...
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments