DS Ph.D. Qualifier Presentation | Cate Dunham | Tuesday, May 13th @ 11:00AM, UH471 | Oracle Embeddings for Chemical Data Generation and Chemical Detection
11:00 a.m. to 12:00 p.m.
DATA SCIENCE
Ph.D. Qualifier Presentation
Cate Dunham
Tuesday, May 13th
Unity Hall, UH471
11:00AM - 12:00PM
Committee:
- Professor Randy Paffenroth, PhD Advisor, Mathematical Sciences, Computer Science, Data Science
- Professor Yanhua Li, Co-Advisor, Computer Science
- Professor Raha Moraffah, Co-Advisor, Computer Science
Title: Oracle Embeddings for Chemical Data Generation and Chemical Detection
Abstract: Real-time chemical detection is critical in many contexts, including national security and public safety. Machine learning has emerged as a valuable tool to support real-time chemical detection. However, procuring the large datasets necessary for training Machine Learning (ML) algorithms can be prohibitively time-consuming and costly. Data requirements and associated costs are compounded in the presence of multiple chemical sensors, each of which must be represented by sufficient training data for the ML algorithm to learn from. In such limited data scenarios, a common approach is to augment an existing dataset of experimentally acquired data with synthetic data generated by machine learning models. Herein, we explore the generation of synthetic data which can enhance the performance of downstream chemical detection models. Our research focuses on leveraging synthetic data generation when one is utilizing multiple chemical sensors. Specifically, we explore how data from one sensor modality can be used to support data generation for a different sensor modality. While each sensor captures distinct features, such as charge or fragmentation patterns, none provide complete chemical structure. As a result, data captured by one sensor may not be sufficient to generate data for another. Our approach revolves around mapping data to a fixed, information rich embedding that is based upon chemical structure and thus is common to all sensor types. These fixed embeddings, derived from an external deep learning chemistry model, capture information regarding chemical structure which is advantageous for generating data for various sensor modalities. We build upon our previous research, in which we successfully utilized our fixed embedding architecture to generate synthetic mass spectrometry data. The synthetic data created by our previous model both enhanced classifier accuracy and was correctly identified by an external spectrum matching tool. In this study, we expand our prior work to ion mobility data, laying the foundation for a larger data generation and sensor fusion model.