Segment scene using information from vision-language models without neural training with PnP-OVSS
Segment scene using information from vision-language models without neural training with PnP-OVSS
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models
arXiv paper abstract https://arxiv.org/abs/2311.17095
arXiv PDF paper https://arxiv.org/pdf/2311.17095.pdf
From an enormous amount of image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words
... propose ... Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) ... leverages a VLM with direct text-to-image cross-attention and an image-text matching loss to produce semantic segmentation.
However, cross-attention alone tends to over-segment, whereas cross-attention plus GradCAM tend to under-segment.
To alleviate this issue, ... introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, ... are able to better resolve the entire extent of the segmentation mask.
... method does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set.
PnP-OVSS ... substantial improvements over a comparable baseline ... and even outperforms most baselines that conduct additional network training on top of pretrained VLMs.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments