Data Science Ph.D. Defense | Walter Gerych | Leveraging Mislabeled Datasets And Improving Imperfect Pretrained Models
9:00 am to 10:00 am
Ph.D. Dissertation Defense
9:00 a.m. - 10:00 a.m.
Location: Innovation Studio 105
Dr. Elke Rundensteiner, Professor, WPI. Advisor.
Dr. Emmanuel Agu, Professor, WPI. Advisor.
Dr. Oren Mangoubi, Assistant Professor, WPI.
Dr. Adam Kalai, Senior Principal Researcher, Microsoft Research New England. External member.
Working With What You've Got:
Leveraging Mislabeled Datasets And Improving Imperfect Pretrained Models
Resources such as OpenML and HuggingFace have made large datasets and powerful pretrained models more accessible than ever before. However, the large-scale datasets typically used to train deep learning systems are often plagued by noisy labels, where the labels associated with many datapoints are incorrect, as well as sampling bias that favors data from certain demographics over others. Likewise, many pretrained models exhibit biased outputs and lack the full range of functionality desired by the end user.
In this dissertation, I study four tasks relating to data and model quality issues. The first two tasks relate to data quality; specifically, I focus on the understudied Positive Unlabeled (PU) setting for noisy labels, where there exists one-sided label noise. In Task 1, I extend methods for learning from PU data - which typically can be utilized for only binary classification - to work with multi-label data and classifiers. To do this, I formalize a novel loss function to train models that are unbiased on the distribution of clean data given only noisy PU data. In Task 2, I study PU learning under the more realistic scenario of a biased sampling strategy that leads to unrepresentative labels, and propose two strategies for identifiable biased PU learning. In Task 3 and Task 4, I move on to studying methods to improve existing pretrained generative models. Task 3 focuses on debiasing pretrained generative models. To do this, I propose a principled approach for re-sampling from the generator's latent space to yield a roughly equal number of samples from each semantic group. Lastly, in Task 4, I propose an approach for converting pretrained unconditional generative models into conditional models that can be made to sample from specific classes. I achieve this by identifying and removing regions of the latent space that correspond to low-density regions in the output space, and clustering the remaining regions - each of which correspond to a semantically meaningful sub-manifold in the output space; e.g., each sub-manifold corresponds to a particular class. We then adaptively sample from each sub-manifold in the output space.