Data Science Ph.D. Dissertation Defense | Ricardo Flores | Tuesday, Sept. 26th @ 9:00am

Tuesday, September 26, 2023
9:00 am to 10:00 am

United States

Floor/Room #
Hagglund 301 Conference Room



Ricardo Flores, Ph.D. Candidate

Tuesday, September 26, 2023

Time: 9:00 a.m. – 10:00 a.m.

Location: Hagglund 301, Campus Center


Committee Members

Prof. Elke Rundensteiner (Advisor), Computer Science & Data Science, WPI

Prof. Xiaozhong Liu, Computer Science & Data Science, WPI

Prof. Nima Kordzadeh, Business School & Data Science, WPI

Prof. Farah Shamout, Computer Engineering, NYU-Abu Dhabi


Title: Multi-modal Models for Depression Screening



Depression is a very common mental health disorder. Due to the limited number of trained clinicians, mental health screening is very costly. By contrast, with the increasing technology in audio-visual speech recognition, a virtual interviewer may represent an affordable alternative for depression screening.  Prior research has focused on classifying depression with high performance using either audio, text, or images as input.  These modalities introduce new challenges for training a virtual interviewer using deep learning, including, small sample size, multi-modalities, and long recording video-audio. 

First, I study voice recording to quantify the effect of including follow-up questions in clinical interviews for depression screening. Training transfer learning models on the popular Distress Analysis Interview Corpus - Wizard-of-Oz (DAIC-WOZ). Second, I move a step forward and propose AudiFace, a multi-modal deep learning model that consumes temporal facial features, recorded audio, and transcripts to screen for depression. AudiFace combines pre-trained transfer learning models and bidirectional LSTM with self-attention layers to capture the long relationships within sequences. However, AudiFace uses a simple concatenation of the three uni-modal embeddings into one representation. Hence, I also propose WavFace, a multimodal transformer-based model that inputs audio and temporal facial features. WavFace applies an explicit alignment method for both modalities, and then uses a sequential and spatial self-attention over the alignment for depression screening.  Finally, to tackle the above-mentioned small datasets challenge, common in the mental health community, I leverage multi-task learning with auxiliary task to boost mental health performance. Where the two tasks are depression screening and post-traumatic stress disorder (PTSD), while the auxiliary task is the missing value imputation. The results achieved in all 15 datasets from DAIC-WOZ suggest that these advanced strategies of both multi-model and multi-task learning allow to improve the uni-modal representation, and consequently the depression screening metrics. I believe these models provide valuable findings for the future of both mental health screening applications, as well as the clinical screening interviews. 




Data Science
Contact Person
Kelsey Briggs