Transformer handles images, video, point clouds, and audio to understand world

morrislee
Jul 7, 2021
1 min read

Perceiver: General Perception with Iterative Attention

arXiv paper abstract https://arxiv.org/abs/2103.03206

arXiv PDF paper https://arxiv.org/pdf/2103.03206.pdf

Papers With Code https://paperswithcode.com/paper/perceiver-general-perception-with-iterative

Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc.

... introduce the Perceiver ... builds upon Transformers ... makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

... this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.

... performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels.

It is also competitive in all modalities in AudioSet.

Please like and share this post if you enjoyed it using the buttons at the bottom!

Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website

#ComputerVision #Transformers #AINewsClips #AI #ML #ArtificialIntelligence #MachineLearning

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

Transformer handles images, video, point clouds, and audio to understand world

Recent Posts

Comments