Many types of computer vision tasks possible with new customizable vision foundation model, Florence
Many types of computer vision tasks possible with new customizable vision foundation model, Florence
Florence: A New Foundation Model for Computer Vision
arXiv paper abstract https://arxiv.org/abs/2111.11432
arXiv PDF paper https://arxiv.org/pdf/2111.11432.pdf
... understanding ... diverse ... world demands computer vision models to generalize well with minimal customization for specific tasks
... vision foundation models ... such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation,
... new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth).
... incorporating ... Web-scale image-text data ... model ... easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.
... outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects.
... achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.
Please like and share this post if you enjoyed it using the buttons at the bottom! Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact Web site with my other posts by category https://morrislee1234.wixsite.com/website #ComputerVision #Foundation #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning
Comments