Skip to main content

Computer Science Department, Ermal Toto - PhD Candidate " Towards Instantaneous Mental Health Screening From Voice Using Machine and Deep Learning"

Wednesday, March 10, 2021
9:00 am to 10:00 am


Ermal  Toto

PhD Candidate

WPI  – Computer Science 

Committee Members

Prof. Elke A. Rundensteiner, WPI, Advisor

Prof. Carolina Ruiz, WPI, Committee Member

Prof. Lane T. Harrison, WPI, Committee Member

Dr. Francis (Lee) Stevens, Reliant Medical Group, External Committee Member


 The World Health Organization (WHO) has identified mental health disorders as a serious global epidemic. In the US mental health disorders affect up to a quarter of the population, and are the leading cause of disability, responsible for 18.7% of all years of life lost to disability and premature mortality. Despite of early detection being crucial to improving prognosis, mental illness remains largely undiagnosed. Given the recent high prevalence of voice clips from digital assistants and smartphone technologies, there are now tremendous opportunities that hold the promise of a disruptive transformation in mental health screening. This holds the promise that screening could become ubiquitously integrated into virtual assistants and smartphone technologies. However, several challenges must be overcome to achieve accurate mental health screening from voice.

First, due to privacy concerns, audio datasets with mental health labels have a small number of participants, causing current classification models to suffer from low performance.  To alleviate data shortages, we have collected retroactive and voice data from smartphones and applied machine learning algorithms to detect depression and suicidal ideation.

Second, emotion cues are not evenly distributed within a voice clip, and a considerable part of the voice content might hold only a neutral signal. To tackle this issue, we have developed and evaluated a novel sub-clip-based algorithm that outperform state-of-the-art results using only audio signals.

Finally, we must recognize the multimodal nature of voice, that is both verbal and non-verbal content must be considered. This makes the problem particularly hard, as voice clips containing the same sentence could mean different things based on the tone.  In this regard, we have developed and evaluated a multimodal deep learning model that takes advantage of audio signals and transcripts, and achieves F1 scores of up to 0.91, further advancing the state-of-the-art.

This research has the potential to have a broad impact in expanding mental health screening to voice-enabled devices, alleviating a severe global need.