Vision transformer

A Vision Transformer (ViT) is a transformer designed for computer vision. Transformers were introduced in 2017,[1] and have found widespread use in Natural Language Processing. In 2020, they were adapted for computer vision, yielding ViT.[2] The basic structure is to break down input images as a series of patches, then tokenized, before applying the tokens to a standard Transformer architecture.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a Transformers, they incorporate more and more semantic relations between words, from syntax to semantics.

ViT has found applications in image recognition, image segmentation, and autonomous driving.[3]

Architecture

The basic architecture, used by the original 2020 paper,[2] is as follows. In summary, it is a BERT-like encoder-only Transformer.

The input image is of type , where are height, width, channel (RGB). It is then split into square-shaped patches of type .

For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding". The two vectors are added, then pushed through several Transformer encoders.

Classification

The above architecture turns an image into a sequence of vector representations. To use the vector representation for downstream applications, one needs to add some network modules on top of it.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network.[2]

Vision Transformers

Vision Transformer Architecture for Image Classification

Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet,[4] DenseNet,[5] and Inception.[3]

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.[3]

As in the case of BERT, a fundamental role in classification tasks is played by the class token. A special token that is used as the only input of the final MLP Head as it has been influenced by all the others.

The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used.

History

In 2021 a pure transformer model demonstrated better performance and greater efficiency than CNNs on image classification.[3]

A study in June 2021 added a transformer backend to ResNet, which dramatically reduced costs and increased accuracy.[6][7][8]

In the same year, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Among the most relevant is the Swin Transformer,[9] which through some modifications to the attention mechanism and a multi-stage approach achieved state-of-the-art results on some object detection datasets such as COCO. Another interesting variant is the TimeSformer, designed for video understanding tasks and able to capture spatial and temporal information through the use of divided space-time attention.[10][11]

Comparison with Convolutional Neural Networks

Due to the commonly used (comparatively) large patch size, ViT performance depends more heavily on decisions including that of the optimizer, dataset-specific hyperparameters, and network depth than convolutional networks. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability.[8]

The CNN translates from the basic pixel level to a feature map. A tokenizer translates the feature map into a series of tokens that are then fed into the transformer, which applies the attention mechanism to produce a series of output tokens. Finally, a projector reconnects the output tokens to the feature map. The latter allows the analysis to exploit potentially significant pixel-level details. This drastically reduces the number of tokens that need to be analyzed, reducing costs accordingly.[6]

The differences between CNNs and Vision Transformers are many and lie mainly in their architectural differences.

In fact, CNNs achieve excellent results even with training based on data volumes that are not as large as those required by Vision Transformers.

This different behaviour seems to derive from the different inductive biases they possess. The filter-oriented architecture of CNNs can be somehow exploited by these networks to grasp more quickly the particularities of the analysed images even if, on the other hand, they end up limiting them making it more complex to grasp global relations.[12][13]

On the other hand, the Vision Transformers possess a different kind of bias toward exploring topological relationships between patches, which leads them to be able to capture also global and wider range relations but at the cost of a more onerous training in terms of data.

Vision Transformers also proved to be much more robust to input image distortions such as adversarial patches or permutations.[14]

However, choosing one architecture over another is not always the wisest choice, and excellent results have been obtained in several Computer Vision tasks through hybrid architectures combining convolutional layers with Vision Transformers.[15][16][17]

The Role of Self-Supervised Learning

The considerable need for data during the training phase has made it essential to find alternative methods to train these models,[18] and a central role is now played by self-supervised methods. Using these approaches, it is possible to train a neural network in an almost autonomous way, allowing it to deduce the peculiarities of a specific problem without having to build a large dataset or provide it with accurately assigned labels. Being able to train a Vision Transformer without having to have a huge vision dataset at its disposal could be the key to the widespread dissemination of this promising new architecture.

Applications

Vision Transformers have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art.

Among the most relevant areas of application are:

Implementations

There are many implementations of Vision Transformers and its variants available in open source online. The main versions of this architecture have been implemented in PyTorch[19] but implementations have also been made available for TensorFlow.[20]

See also

References

  1. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017-12-05). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
  2. Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  3. Sarkar, Arjun (2021-05-20). "Are Transformers better than CNN's at Image Recognition?". Medium. Retrieved 2021-07-11.
  4. Tan, Mingxing; Le, Quoc V. (23 June 2021). "EfficientNet V2: Smaller Models and Faster Training". arXiv:2104.00298 [cs.CV].
  5. Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Q. Weinberger, Kilian (28 Jan 2018). "Densely Connected Convolutional Networks". arXiv:1608.06993 [cs.CV].
  6. Synced (2020-06-12). "Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers". Medium. Retrieved 2021-07-11.
  7. Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Masayoshi, Tomizuka; Gonzalez, Joseph; Keutzer, Kurt; Vajda, Peter (2020). "Visual Transformers: Token-based Image Representation and Processing for Computer Vision". arXiv:2006.03677 [cs.CV].
  8. Xiao, Tete; Singh, Mannat; Mintun, Eric; Darrell, Trevor; Dollár, Piotr; Girshick, Ross (2021-06-28). "Early Convolutions Help Transformers See Better". arXiv:2106.14881 [cs.CV].
  9. Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining (2021-03-25). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows". arXiv:2103.14030 [cs.CV].
  10. Bertasius, Gedas; Wang, Heng; Torresani, Lorenzo (2021-02-09). "Is Space-Time Attention All You Need for Video Understanding?". arXiv:2102.05095 [cs.CV].
  11. Coccomini, Davide (2021-03-31). "On Transformers, TimeSformers, and Attention. An exciting revolution from text to videos". Towards Data Science.
  12. Raghu, Maithra; Unterthiner, Thomas; Kornblith, Simon; Zhang, Chiyuan; Dosovitskiy, Alexey (2021-08-19). "Do Vision Transformers See Like Convolutional Neural Networks?". arXiv:2108.08810 [cs.CV].
  13. Coccomini, Davide (2021-07-24). "Vision Transformers or Convolutional Neural Networks? Both!". Towards Data Science.
  14. Naseer, Muzammal; Ranasinghe, Kanchana; Khan, Salman; Hayat, Munawar; Khan, Fahad Shahbaz; Yang, Ming-Hsuan (2021-05-21). "Intriguing Properties of Vision Transformers". arXiv:2105.10497 [cs.CV].
  15. Dai, Zihang; Liu, Hanxiao; Le, Quoc V.; Tan, Mingxing (2021-06-09). "CoAtNet: Marrying Convolution and Attention for All Data Sizes". arXiv:2106.04803 [cs.CV].
  16. Wu, Haiping; Xiao, Bin; Codella, Noel; Liu, Mengchen; Dai, Xiyang; Yuan, Lu; Zhang, Lei (2021-03-29). "CvT: Introducing Convolutions to Vision Transformers". arXiv:2103.15808 [cs.CV].
  17. Coccomini, Davide; Messina, Nicola; Gennaro, Claudio; Falchi, Fabrizio (2022). "Combining Efficient Net and Vision Transformers for Video Deepfake Detection". Image Analysis and Processing – ICIAP 2022. Lecture Notes in Computer Science. Vol. 13233. pp. 219–229. arXiv:2107.02612. doi:10.1007/978-3-031-06433-3_19. ISBN 978-3-031-06432-6. S2CID 235742764.
  18. Coccomini, Davide (2021-07-24). "Self-Supervised Learning in Vision Transformers". Towards Data Science.
  19. vit-pytorch on GitHub
  20. Salama, Khalid (2021-01-18). "Image classification with Vision Transformer". keras.io.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.