Transformers for action recognition 40 times faster by focus attention in time

morrislee
Apr 19, 2021
1 min read

An Image is Worth 16x16 Words, What is a Video Worth?

arXiv paper abstract https://arxiv.org/abs/2103.13915

arXiv PDF paper https://arxiv.org/pdf/2103.13915.pdf

GitHub https://github.com/Alibaba-MIIL/STAM

... significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with x30 less frames per video, and x40 faster inference than the current leading method.

Please like and share this post if you enjoyed it using the buttons at the bottom!

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website

#ActionRecognition #ComputerVision #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

Transformers for action recognition 40 times faster by focus attention in time

Recent Posts

Comments