Vision Transformers vs. Convolutional Neural Networks

This blog post is inspired by the paper titled <a href="https://arxiv.org/pdf/2010.11929.pdf" rel="noopener ugc nofollow" target="_blank">AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE</a> from google’s research team. The paper proposes using a pure Transformer applied directly to image patches for image classification tasks. The Vision Transformer (ViT) outperforms state-of-the-art convolutional networks in multiple benchmarks while requiring fewer computational resources to train, after being pre-trained on large amounts of data. Transformers have become the model of choice in NLP due to their computational efficiency and scalability. In computer vision, convolutional neural network (CNN) architectures remain dominant, but some researchers have tried combining CNNs with self-attention. The authors experimented with applying a standard Transformer directly to images and found that when trained on mid-sized datasets, the models had modest accuracy compared to ResNet-like architectures. However, when trained on larger datasets, the Vision Transformer (ViT) achieved excellent results and approached or surpassed the state of the art on multiple image recognition benchmarks. <a href="https://medium.com/@faheemrustamy/vision-transformers-vs-convolutional-neural-networks-5fe8f9e18efc">Read More</a>