Vision transformer beats CNN on mobile devices for accuracy and speed with ElasticViT
Vision transformer beats CNN on mobile devices for accuracy and speed with ElasticViT
ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices
arXiv paper abstract https://arxiv.org/abs/2303.09730
arXiv PDF paper https://arxiv.org/pdf/2303.09730.pdf
... designing lightweight and low-latency ViT models for diverse mobile devices remains a big challenge.
... propose ElasticViT, a two-stage NAS approach that trains a high-quality ViT supernet over a very large search space that supports a wide range of mobile devices, and then searches an optimal sub-network (subnet) for direct deployment.
... Complexity-aware sampling limits the FLOPs difference among the subnets sampled across adjacent training steps, while covering different-sized subnets in the search space.
Performance-aware sampling further selects subnets that have good accuracy, which can reduce gradient conflicts and improve supernet quality.
... discovered models, ElasticViT models, achieve top-1 accuracy ... without extra retraining, outperforming all prior CNNs and ViTs in terms of accuracy and latency.
... the first ViT models that surpass state-of-the-art CNNs with significantly lower latency on mobile devices.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments