Transformer handles images, video, point clouds, and audio to understand world
Transformer handles images, video, point clouds, and audio to understand world
Perceiver: General Perception with Iterative Attention
arXiv paper abstract https://arxiv.org/abs/2103.03206
arXiv PDF paper https://arxiv.org/pdf/2103.03206.pdf
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc.
... introduce the Perceiver ... builds upon Transformers ... makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
... this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.
... performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels.
It is also competitive in all modalities in AudioSet.
Please like and share this post if you enjoyed it using the buttons at the bottom!
Stay up to date. Subscribe to my posts https://morrislee1234.wixsite.com/website/contact
Web site with my other posts by category https://morrislee1234.wixsite.com/website
Comments