Skip to main content

WPI - Computer Science Department, PhD Dissertation Defense, Samuel S. Ogden : Mobile-Oriented Deep Learning Inference: Fine- and Coarse-grained Approaches

Wednesday, June 29, 2022
10:00 am to 11:00 am


Location: Fuller Labs 320

Committee Members: 

Prof. Tian Guo – Advisor, WPI – Computer Science 

Prof. Emmanuel Agu, WPI – Computer Science  

Prof. Craig Shue, WPI – Computer Science 

Prof. Xiangnan Kong, WPI – Computer Science  

Prof. Yue Cheng, George Mason University (External Committee Member) — Computer Science

Abstract : 

   Deep learning is becoming a ubiquitous component of mobile applications. However, leveraging them faces several core challenges. Chief among these is the accuracy of deep learning models is enabled by high resource demand, which is inherently at odds with the constrained mobile resources. While offloading of computation is a common technique, access to remote resources is only possible across highly variable networks. Further, managing resources to effectively execute these models in the cloud is difficult. Taken together, these make it difficult to both execute models on-device and serve models using remote execution. 

  In this dissertation, I argue that addressing these challenges should be done from a mobile-oriented perspective. I approach the problem of serving deep learning models as a mobile-oriented task, enabling adaptations based on mobile resource constraints, their network variation, and the demands of a large and disparate workload. I do this by focusing on individual requests, adapting their execution to enable timely responses, and considering the impact of model resource needs in terms of the overall workload. Finally, I present an approach for selecting execution location at runtime enabling both low latency and high accuracy executions for a wide range of applications, while reducing the usage of cloud-based resources.

To this end, my research has three core components. First, I approach how to improve response latency and accuracy of individual inference requests.  Through characterization and modeling of input data processing and transfer I reduce response latency.  Further, I use this to enable time budgets and improve accuracy for deep learning serving.  Second, I address resource management constraints for deep learning serving by analyzing real-world traces and demonstrating the need for model-level caching to enable the scaling of inference serving systems, and propose an initial system to demonstrate the validity of this approach. 

Finally, I present recent research that leverages both on-device and in-cloud resources to support execution across a wide range of latency and accuracy targets.  By leveraging both on-device and in-cloud resources for execution it is possible to take advantage of high accuracy models in the cloud when possible, but also avoid the network transfer time when necessary to meet SLOs.  Overall, this system can achieve good latency target attainment with high accuracy across a range of demands while decreasing the impact of cloud resources.



Sam Ogden