Defensa de tesi de Juan Borrego Carazo
Defensa de tesi de Juan Borrego Carazo el pròxim 21 de novembre a les 12. Sala de Graus de l’Escola d’Enginyeria – Edifici Q.
Doctorand: Juan Borrego Carazo.
Títol: Efficient Neural Network Inference for Resource Constrained Devices.
Directors: Jordi Carrabina Bordoll, David Castells Rufas.
Data i hora lectura: 21/11/2022, 12:00h.
Lloc lectura: Sala de Graus de l’Escola d’Enginyeria – Edifici Q.
Programa de Doctorat: Enginyeria Electrònica i de Telecomunicació.
Departament on està inscrita la tesi: Departament de Microelectrònica i Sistemes Electrònics.
Abstract
The last decade’s advances in deep learning have supposed a great leap in state-of-the-art results with regard to tasks such as image classification, language translation, and many others. However, with such success, there has been a related increase in model complexity and size, which has incremented the hardware requirements both for training and inference (both generally and initially limited to GPUs). Moreover, the hardware capabilities (OPS performance, memory, throughput, latency, energy) have supposed an initial limitation to deploying applications in resource-constrained platforms and applications, such as mobile or embedded platforms.
There have been many initiatives to reduce training time, and energy costs, and improve data efficiency during the development phase. Equally, there has also been profound research to optimize deep learning models with a focus on inference and deployment: decreasing model complexity, size, latency, and memory consumption. In such direction, there are five optimization methods that have stood out: pruning, quantization, neural architecture search, efficient operations, and distillation.
In parallel, in order to enable inference deployment in specialized hard-ware platforms, new frameworks have appeared (such as CMSIS-NN or uTensor for MCUS, and TF Lite for mobile platforms). Those frameworks include several features for the deployment of models, but most importantly, the crucial point is if they support the specific model operations and optimizations, ensuring the final application deployment.
All in all, from optimization procedures to conversion and deployment frameworks, the procedure of developing efficient NN-based models and deploying them to constrained hardware has certainly improved, albeit with still some limitations. In such a sense, this thesis is framed by such improvements and limitations: first, with the development and improvement of NN optimization techniques, and second, with the use and development of software for porting the optimized models. All with a special focus on three industrial and practical cases which are the main drivers of the developments: automotive human-machine interaction, ITM in mobile devices, and bronchoscopy guidance.
In the first case, we show the deployment and optimization of RNNs in MCUs, as well as the usage and improvement of Bayesian optimization and NAS methods to deliver minimal but well-performing networks. Altogether we deliver a framework for automatically converting and deploying networks in Cortex-M-based MCUs. In the second environment, we employ quantization and efficient operations to bring an ITM network to mobile devices for efficient inference, providing improvements in latency up to 100x with only3% accuracy loss. Finally, we develop an efficient bronchoscopy guidance network with structured pruning and efficient operations, that provides a reduction of x4 of NN size and an improvement of≈14% in accuracy for position localization.