Computer Science Department, PhD Dissertation Defense , Guinevere Gilman " Resource-Efficient Scheduling of Concurrent Deep Learning Workloads on General Purpose GPUs"
2:00 p.m. to 3:00 p.m.
Guinevere Gilman
PhD Candidate
WPI – Computer Science Department
Tuesday, November 18, 2025
Time: 2:00 p.m. – 3:00 p.m.
Location : Unity Hall 243
Zoom link: https://wpi.zoom.us/j/7892087030
Committee Members :
Prof. Robert J. Walls, WPI - Computer Science (Advisor)
Prof. Tian Guo, WPI - Computer Science
Prof. Charles Davis Roberts, WPI - Computer Science
Dr. Neal Crago - NVIDIA (External Member)
Abstract:
As general purpose GPUs become more powerful, it becomes more difficult for any given GPU application to make use of the entirety of its computational resources. A common solution to this problem is concurrency, where multiple GPU applications are executed together on a single GPU so that more of its resources are utilized. However, this poses a novel challenge: latency-sensitive tasks may experience lower responsiveness and higher turnaround times if the tasks it is co-executed with are occupying the resources that it needs when it arrives at the GPU.
The main premise of this dissertation is that with fine-grained control of GPU resources and awareness of resource saturation points, we can leverage the colocation of thread blocks from different tasks on the same streaming multiprocessors to increase system utilization while protecting the quality of service requirements of latency-sensitive tasks. We first provide a characterization of the performance of currently available concurrency mechanisms on NVIDIA GPUs, analyzing how their lack of flexibility and coarse-grained resource control can lead to unpredictable performance and reduced system utilization. Next, we present a set of kernel profiles, hardware mechanisms, and scheduling components that are designed to reduce resource contention between tasks and effectively prioritize the execution of high priority latency-sensitive tasks over best-effort low priority tasks. Finally, we outline a thread block scheduling policy and register oversubscription mechanism that allow the GPU to maintain responsiveness for high priority tasks while also allowing opportunities to increase throughput through the use of spatial concurrency.