June 2, 2022? ? Lee? Yuan Geng? Yang Wen? Eric Hu? St George evangelicals? Sergei Tulyakov? Wang Yanzhi? Jianren? ? Edit social preview
/Snapshot-Research/Efficiency Model
Visual Deformators (ViT) have made rapid progress in computer vision tasks and achieved encouraging results in various benchmark tests. However, due to a large number of parameters and model design, such as attention mechanism, ViT-based models are usually several times slower than lightweight convolutional networks. Therefore, it is particularly challenging to deploy ViT for real-time applications, especially on hardware with limited resources such as mobile devices. Recent efforts have tried to reduce the computational complexity of ViT through network architecture search or mixed design with MobileNet block, but the reasoning speed is still unsatisfactory. This leads to an important question: can the transformer run as fast as MobileNet while obtaining high performance? To answer this question, we first review the network architecture and operators used in the ViT-based model, and find out the inefficient design. Then we introduce a pure transformer with the same size (without MobileNet module) as a design example. Finally, we perform delay-driven slimming to obtain a series of final models called EfficientFormer. A large number of experiments show the superiority of Efficiency Former in performance and speed on mobile devices. Our fastest model, EfficientFormer-L 1, achieves 79.2% accuracy of top- 1 on ImageNet- 1K, and only has a reasoning delay of 1.6 milliseconds on iPhone 12 (compiled with CoreML). Even compared with mobilenetv2 (1.7ms, 7 1.8% top- 1) and our largest model, efficient, our work proves that properly designed transformers can achieve extremely low delay on mobile devices while maintaining high performance.
Vision converter (ViT) has made rapid progress in computer vision tasks and achieved encouraging results in various benchmark tests. However, due to a large number of parameters and model design, such as attention mechanism, ViT-based models are usually several times slower than lightweight convolutional networks. Therefore, deploying ViT for real-time applications is particularly challenging, especially on hardware with limited resources, such as mobile devices. In recent years, people have tried to reduce the computational complexity of ViT through network architecture search or mixed design with MobileNet block, but the reasoning speed is still not satisfactory. This leads to an important question: can the converter run as fast as MobileNet and get high performance at the same time? In order to answer this question, we first review the network architecture and operators used in the ViT-based model, and determine the inefficient design. Then, we introduce a pure converter with uniform dimensions (without MobileNet block) as a design example. Finally, we perform delay-driven slimming to obtain a series of final models called EfficientFormer. A large number of experiments on mobile devices show that EfficientFormer has advantages in performance and speed. Our fastest model, EfficientFormer-L 1, achieves 79.2% accuracy of top- 1 on ImageNet- 1K, while the reasoning delay on iPhone 12(CoreML compilation) is only 1.6 ms, even higher than that of Mobile TV. Our largest model, EfficientFormer-L7, achieved an accuracy of 83.3% with a delay of only 7.0 ms. Our work proves that a properly designed converter can achieve extremely low delay on mobile devices while maintaining high performance.