Segment object in image described by text more simply using SeqTR

morrislee
Apr 1, 2022
1 min read

SeqTR: A Simple yet Universal Network for Visual Grounding

arXiv paper abstract https://arxiv.org/abs/2203.16265v1

arXiv PDF paper https://arxiv.org/pdf/2203.16265v1.pdf

GitHub https://github.com/sean-zhuh/seqtr

... propose ... network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES).

... visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks.

To simplify ... cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens.

... visual grounding ... unified in ... SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling.

... SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible.

Please like and share this post if you enjoyed it using the buttons at the bottom!

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website

LinkedIn https://www.linkedin.com/in/morris-lee-47877b7b

#ComputerVision #Segmentation #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

Segment object in image described by text more simply using SeqTR

Recent Posts

Comments