DS Ph.D. Dissertation Proposal | Dongyu Zhang | Tuesday, Nov. 28th @ 10:00AM

Tuesday, November 28, 2023
10:00 am to 12:00 pm

MA
United States

Floor/Room #
Mid-Century Room

DATA SCIENCE 

Ph.D. Dissertation Proposal

Dongyu Zhang, Ph.D. Candidate

Tuesday, Nov. 28th, 2023 | 10:00AM - 12:00PM EST 

Location: Campus Center, Mid-Century Room

 

Dissertation Committee: 

Dr. Elke A. Rundensteiner, Worcester Polytechnic Institute, Advisor 

Dr. Xiangnan Kong, Worcester Polytechnic Institute 

Dr. Nima Kordzadeh, Worcester Polytechnic Institute 

Dr. Liang Wang, Visa Research 

Title: Learning with Incomplete, Inaccurate, and Multi-level Labeled Data  

Abstract: 

Deep learning models excel in various tasks but require large amounts of accurate labels. Unfortunately, acquiring quality labels is costly and requires domain expertise. Hence, datasets tend to have missing or noisy labels.  Additionally, data might be labeled on multiple levels. For instance, in detecting foodborne illness incidents from a tweet, the aim at the tweet level is to predict illness indication, while at the word level, it is to identify relevant slots like location or food group. However, both levels may have missing and noisy labels with label quality and completeness potentially varying across levels. 

This dissertation explores three directions for handling incomplete, noisy, and multi-level labeled data. Direction 1 aims to learn from two-level task datasets where one task has complete labels and the other has incomplete labels. We propose a novel solution that integrates joint learning of tasks at both levels and strikes a balance between the fully labeled and incompletely labeled tasks. Direction 2 focuses on learning with noisy labeled data. We propose a method that harnesses the Local Intrinsic Dimensionality (LID) score to detect and correct noisy labels. Direction 3 aims to learn with two-level labeled data exhibiting both incomplete and noisy labels. We plan to capitalize on the relationship between tasks and integrate weak labels obtained from Large Language Models (LLMs) to achieve better performance. 

To validate the effectiveness of our proposed methods, we have conducted preliminary experimental studies on real-world domains comparing them with state-of-the-art methods. Our experimental results demonstrate that our proposed methods outperform state-of-the-art methods across these label-related challenges. 


 

Audience(s)

DEPARTMENT(S):

Data Science
Contact Person
Kelsey Briggs

PHONE NUMBER: