Untrained object detection using over 100 times less data by flexible captioning with OTTER
Untrained object detection using over 100 times less data by flexible captioning with OTTER
Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation
arXiv paper abstract https://arxiv.org/abs/2112.09445v2
arXiv PDF paper https://arxiv.org/pdf/2112.09445v2.pdf
... Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions.
CLIP, however, is data hungry and requires more than 400M image-text pairs for training.
The inefficiency can be partially attributed to the fact that the image-text pairs are noisy.
... propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning.
Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs.
... Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comentários