Transformers for action recognition 40 times faster by focus attention in time
Transformers for action recognition 40 times faster by focus attention in time
An Image is Worth 16x16 Words, What is a Video Worth?
arXiv paper abstract https://arxiv.org/abs/2103.13915
arXiv PDF paper https://arxiv.org/pdf/2103.13915.pdf
... significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with x30 less frames per video, and x40 faster inference than the current leading method.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments