Samuel S. Ogden
WPI – Computer Science Department
Thursday, July 24, 2021
Time: 10:00 AM – 12:00 PM
Prof. Tian Guo – Advisor , WPI – Computer Science
Prof.. Emmanuel, Agu WPI – Computer Science
Prof. Craig Shue, WPI – Computer Science
Prof. Xiangnan Kong, WPI – Computer Science
Prof. Yue Cheng (George Mason University) (External Committee Member)
Deep learning is becoming a ubiquitous component of mobile applications. However, leveraging them faces several core challenges. Chief among these is the accuracy of deep learning models is enabled by high resource demand, which is inherently at odds with the constrained mobile resources. While offloading of computation is a common technique, access to remote resources is only possible across highly variable networks. Further, managing resources to effectively execute these models in the cloud is difficult. Taken together, these make it difficult to both execute models on-device and serve models using remote execution.
In this proposal, I argue that addressing these challenges be done from a mobile-oriented perspective. I approach the problem of serving deep learning models as a mobile-oriented task, enabling adaptations to their resource constraints, their network variation, and the demands of a large and disparate workload. I do this by focusing on individual requests, adapting their execution to enable timely responses, and considering the impact of model resource needs in terms of the overall workload. Finally, I propose research into a scheduler for deep learning inference pipelines, that will schedule complex multi-model inference tasks across mobile- and cloud-based resources to improve response latency and resource management.
To this end, my research has three core components. First, I approach how to improve response latency and accuracy of individual inference requests. Through characterization and modeling of input data processing and transfer I reduce response latency. Further, I use this to enable time budgets and improve accuracy for deep learning serving. Second, I address resource management constraints for deep learning serving by analysing real-world traces and demonstrating the need for model-level caching to enable the scaling of inference serving systems, and propose an initial system to demonstrate the validity of this approach.
Finally, I propose research to leverage both on-device and in-cloud resources for complex inference jobs that consist of many individual inference tasks. Specifically, I propose a decentralized inference pipeline scheduler to assign individual executions of deep learning models across mobile and cloud devices to reduce response latency and improve resource management. By leveraging on-device resources in addition to cloud-based resources, this scheduler will reduce the response latency by avoiding unnecessary transfer of intermediate data and avoiding resource-related delays for in-cloud execution. This allows for a reduction in response latency as well as improved resource management.