DS Ph.D. Dissertation Defense | Dongyu Zhang | Tuesday, April 23rd @ 1:00PM - Rec Center, RC61
1:00 pm to 2:00 pm
DATA SCIENCE
Dongyu Zhang
PhD Dissertation Defense
Time: Tuesday, April 23, 1 PM-2PM
Location: Rec Center, RC61
Instructions to access the room: Go into front doors of Rec Center, then towards the right to the staircase next to the elevator and head to the 4th floor. On 4th floor, right out of elevator, there will be a small corridor. Follow the corridor and open doors to right leading RC 61 meeting rooms.
Dissertation Committee:
Dr. Elke A. Rundensteiner, Worcester Polytechnic Institute, Advisor
Dr. Xiangnan Kong, Worcester Polytechnic Institute
Dr. Nima Kordzadeh, Worcester Polytechnic Institute
Dr. Liang Wang, Visa Research
Title: Harnessing Incomplete, Noisy, and Multi-level Labels for Classification and Annotation Tasks
Abstract:
Deep learning models excel in various tasks but require large amounts of accurate labels. Unfortunately, acquiring quality labels is costly and requires domain expertise. Hence, datasets tend to have missing or noisy labels. Additionally, data might be labeled on multiple levels. For instance, in detecting foodborne illness incidents from a tweet, the aim at the tweet level is to predict illness indication, while at the word level, it is to identify relevant slots like location or food group. However, due to the challenges and costs in acquiring labels for both levels, these levels may have incomplete or noisy labels. This dissertation explores three directions for handling incomplete, noisy, and multi-level labeled data. Direction 1 learns from two-level task datasets where one task has complete labels and the other has incomplete labels. We propose a novel deep learning solution that integrates joint learning of tasks at both levels and strikes a balance between the fully labeled and incompletely labeled tasks. Direction 2 focuses on learning with noisy labeled data. We propose a method that harnesses the Local Intrinsic Dimensionality (LID) score to detect and correct noisy labels. Direction 3 develops strategies for annotating two-level labeled data given mostly unlabeled instances. We develop a Large Language Models (LLMs) based solution that uniquely capitalizes on the relationship between two levels and integrates multi-example retrieval methods. Our experimental studies on real-world domains demonstrate that our proposed methods outperform state-of-the-art methods for each of these difficult label-related challenges.