Vision Mamba more accurate than transformers and is 2.8x faster and uses 86.8% less GPU memory
Survey of image classification with vision transformers
Survey of vision transformer efficiency
Survey of transformer inference optimization techniques
Survey of transformers for vision language
Survey of transformers for 2D object detection
Survey of vision transformers and hybrid CNN-transformer variants
1.5x faster vision transformers by using activation sparsity with SparseViT
Vision transformer beats CNN on mobile devices for accuracy and speed with ElasticViT
Survey of computer vision using transformers by Jamil
Survey of transformers for video
Survey of computer vision using transformers
Improve vision transformer by using anti-aliasing
MobileViT: an accurate, light-weight, mobile-friendly vision transformer
Transformer handles images, video, point clouds, and audio to understand world
Training with modified data like using 10 times more data
Get 3D pose and shape of people from monocular images
Transformer for super-resolution video
Survey of transformers for vision, text, and audio
Advantages of nested transformers for computer vision