Dmitry Korkin stands before a depiction of a network of proteins that are affected by type 2 diabetes (in pink). The lines represent protein-protein interactions that are expected to affected by the mutations that are linked to types 2 diabetes.

Computer Scientists at WPI Are Developing Tools to Probe the Molecular Basis of Complex Diseases

With an award from the National Institutes of Health, a team led by Dmitry Korkin will develop next-generation machine learning algorithms that could advance our understanding of the molecular biology of disease and the field of personalized medicine
September 27, 2018

A team of computer scientists at Worcester Polytechnic Institute (WPI) has received a two-year, $347,000 award from the National Institutes of Health to develop and evaluate new computational techniques that will provide a better understanding of the genetic and molecular interactions that underpin complex diseases. For example, the tools will help predict the likelihood that specific genetic mutations or patterns of mutations will lead to diabetes, neurological disorders, cancer, and other maladies; the likely outcome of those diseases; and how well those conditions will respond to treatment.

Led by Dmitry Korkin, associate professor of computer science and director of WPI’s Bioinformatics and Computational Biology Program, the team will develop tools for sifting through the vast amount of data now being produced by next-generation sequencing techniques about genetic mutations linked to various diseases, as well as the alternative gene products that occur in diseased tissues, to develop a deeper understanding of the complex interactions of genes, RNA molecules, and proteins within cells that ultimately shape the inception and progress of diseases.

“The more we learn about how complex diseases work at the molecular level,” Korkin said, “the more we come to appreciate the intricate web of molecular interactions that are key to why one person gets sick, while another with similar mutations does not, or why one person’s cancer responds to chemotherapy, while another is unaffected. Studying these complex interaction networks in the laboratory with high-throughput techniques is extremely time consuming and expensive, which is why our understanding of these networks is very limited.”

Korkin says existing big data tools are limited in their ability to model complex biological networks and how they change in a disease state. And while the hope is that such tools could one day replace laboratory experiments, or at least help scientists determine which experiments are likely to yield the most useful results, they are not yet up to that task. With the NIH award, Korkin said he hopes to bridge that gap by developing new kinds of computational methods that draw on an area of artificial intelligence known as machine learning.

He said the goal is to better model the complex web of molecular interactions within cells that begins when genes, which contain the genetic code for making proteins, are transcribed into RNA molecules. RNA, in turn, transfers the genetic information to machinery within the cell that uses it to assemble specific proteins. Finally, the myriad proteins produced within the cell interact in an intricate molecular ballet. In particular, Korkin said his aim is to create algorithms that can predict how this dense web of interactions changes when the genes develop mutations. These insights could help lay the foundation for personalized medicine, in which physicians will have the tools to predict the likely course of a particular disease in individual patients and prescribe individualized treatments.

The tools Korkin and his team are developing reflect a new way of looking at the molecular machinery of diseases, he said. It is now known that many complex diseases are associated with dozens or even hundreds of mutations. Each of these mutations will produce a protein that is different from the one that the normal, non-mutated gene codes for. An emerging model in biology contends that focusing on the interactions between these altered proteins (biologists refer to these interactions as “edges” in the protein network) yields a far more accurate picture of how mutations translate into changes in cellular functions, and will, therefore, provide a better picture of the way in which diseases begin and progress than looking at the mutations alone.

“Rather than talking about genotype, we are now talking more about ‘edgotype,’” Korkin says. “We are learning that if we understand just the mutations associated with a disease, we really understand very little about the functions those mutations affect.”

In a recent paper in the Journal of Molecular Biology (“Multilayer View of Pathogenic SNVs in Human Interactome through In Silico Edgetic Profiling”), Korkin and his research team describe the first machine learning tools capable of modeling those protein interactions en masse and building a profile of a cell’s edgotype. As a case study, the team used the tools to trace the connections between the dozens of genes that are linked to type-2 diabetes and the actual cellular damage that produces the disease’s symptoms. They looked systematically at the interactions between the proteins produced by the normal versions of those genes, and then looked to see how those interactions change when the proteins come from mutated genes. The analysis provided a first-of-its kind look at how the mutations ultimately translate to alterations in the way cells and tissues function.

“For the first time,” Korkin said, “we provided a large-scale view of diabetes and the role of its mutations in protein interactions. We learned that the mutations associated with diabetes appear to act synergistically to alter the molecular interactions in the cell. In particular, if you look at the interactions of the key proteins known to be associated with diabetes, you see that their interactions are rewired quite dramatically.”

While modelling the interactions of protein networks is challenging enough, Korkin said another emerging concept in molecular biology adds a new layer of complexity. It’s called alternative splicing, and it expands the classic model of genetics: that the information coded into a single gene represented the blueprint for one and only one protein. It is now known that most genes are capable of producing multiple proteins, depending on which sections of the gene are transcribed by RNA molecules. Cells use a variety of regulatory mechanism to determine which proteins will be produced at any one moment, but various diseases, including cancer, can also change the way a gene is “spliced.”

“Alternative splicing is now seen as one of the cell’s most important regulatory mechanisms,” Korkin said. “With each gene potentially able to produce five, six, or even dozens of different proteins, understanding how these variations affect cell function could be a very powerful tool for biology and medicine, since alternative splicing appears to be a more powerful mechanism for bringing about profound changes in the cell than mutations or alterations in gene expression.”

Korkin says the computational tools his group will develop with the new NIH award will also be able to account for the effects of alternative splicing, and the knowledge gained with those tools could advance our understanding of biology and improve healthcare. For example, he said it is believed that some genes produce different proteins in different tissues, at different times of day, or under different environmental stressors. It is also believed that the genes in tumors may express different proteins at different pathological stages. Having a tool that can predict these alterations could significantly enhance how diseases are diagnosed and how new treatments are developed and administered, he said.

In a recent paper in the journal RNA (“Biological classification with RNA-Seq data: Can alternative spliced transcript expression enhance machine learning classifier?”), Korkin and his colleagues tested whether machine learning tools that use data about alternative splicing perform better than tools that rely on data about gene expression (or data about how the genetic code is translated into proteins). In the paper, the challenge presented to the algorithms was to take molecular data about tissue samples and identify the tissue types, the age and gender of the individuals from which samples were taken, whether the tissues were healthy or cancerous, and the pathological stages of individual tumors.

They found that in virtually every case, the alternative splicing data was better at classifying the samples than genetic sequencing data, and that in many cases it produced classifications with 100 percent accuracy. “Whether we looked at tissue-specific effects, developmental changes, or disease stage data, everything was classified with greater accuracy by using the alternative splicing data,” Korkin said.

“With the new NIH award, we will have the resources to take our machine learning tools to the next level and help contribute to the most exciting emerging areas of biology and personalized medicine.”