top of page

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware

As an Amazon Associate I earn

from qualifying purchases

Writer's picturemorrislee

Transformer handles images, video, point clouds, and audio to understand world

Transformer handles images, video, point clouds, and audio to understand world


Perceiver: General Perception with Iterative Attention

arXiv paper abstract https://arxiv.org/abs/2103.03206



Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc.


... introduce the Perceiver ... builds upon Transformers ... makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.


... this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.


... performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels.


It is also competitive in all modalities in AudioSet.



Please like and share this post if you enjoyed it using the buttons at the bottom!


Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact

Web site with my other posts by category https://morrislee1234.wixsite.com/website


51 views0 comments

Comments


ClickBank paid link

bottom of page