MobileViT: an accurate, light-weight, mobile-friendly vision transformer

morrislee
Oct 6, 2021
1 min read

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

arXiv paper abstract https://arxiv.org/abs/2110.02178

arXiv PDF paper https://arxiv.org/pdf/2110.02178.pdf

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. ... However, these networks are spatially local.

To learn global representations, self-attention-based vision transformers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight.

... introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. ... presents ... transformers as convolutions.

... MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets.

On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters.

On the MS-COCO object detection task, MobileViT is 5.7% more accurate than Mo-bileNetv3 for a similar number of parameters.

Please like and share this post if you enjoyed it using the buttons at the bottom! Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact Web site with my other posts by category https://morrislee1234.wixsite.com/website #ComputerVision #Transformers #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

MobileViT: an accurate, light-weight, mobile-friendly vision transformer

Recent Posts

Comments