DS Ph.D. Dissertation Defense | Dongyu Zhang | Tuesday, April 23rd @ 1:00PM - Rec Center, RC61

Tuesday, April 23, 2024
1:00 pm to 2:00 pm
Floor/Room #
RC61

DATA SCIENCE

Dongyu Zhang

PhD Dissertation Defense

Time: Tuesday, April 23, 1 PM-2PM

Location: Rec Center, RC61

 

Instructions to access the room: Go into front doors of Rec Center, then towards the right to the staircase next to the elevator and head to the 4th floor. On 4th floor, right out of elevator, there will be a small corridor. Follow the corridor and open doors to right leading RC 61 meeting rooms. 

Dissertation Committee

Dr. Elke A. Rundensteiner, Worcester Polytechnic Institute, Advisor 

Dr. Xiangnan Kong, Worcester Polytechnic Institute 

Dr. Nima Kordzadeh, Worcester Polytechnic Institute 

Dr. Liang Wang, Visa Research 

 

Title: Harnessing Incomplete, Noisy, and Multi-level Labels for Classification and Annotation Tasks  

 

Abstract

Deep learning models excel in various tasks but require large amounts of accurate labels. Unfortunately, acquiring quality labels is costly and requires domain expertise. Hence, datasets tend to have missing or noisy labels.  Additionally, data might be labeled on multiple levels. For instance, in detecting foodborne illness incidents from a tweet, the aim at the tweet level is to predict illness indication, while at the word level, it is to identify relevant slots like location or food group. However, due to the challenges and costs in acquiring labels for both levels, these levels may have incomplete or noisy labels.  This dissertation explores three directions for handling incomplete, noisy, and multi-level labeled data. Direction 1 learns from two-level task datasets where one task has complete labels and the other has incomplete labels. We propose a novel deep learning solution that integrates joint learning of tasks at both levels and strikes a balance between the fully labeled and incompletely labeled tasks. Direction 2 focuses on learning with noisy labeled data. We propose a method that harnesses the Local Intrinsic Dimensionality (LID) score to detect and correct noisy labels. Direction 3 develops strategies for annotating two-level labeled data given mostly unlabeled instances. We develop a Large Language Models (LLMs) based solution that uniquely capitalizes on the relationship between two levels and integrates multi-example retrieval methods. Our experimental studies on real-world domains demonstrate that our proposed methods outperform state-of-the-art methods for each of these difficult label-related challenges.   


 

Audience(s)

DEPARTMENT(S):

Data Science
Contact Person
Kelsey Briggs

PHONE NUMBER: