The Ties That Bind: NSF Funds Research at WPI Aimed at Cracking the Hidden Genetic Code Across Species

A team led by Dmitry Korkin will use advanced math and computing power to sift through the genomes of animals, plants, fungi, and other organisms to find shared genetic sequences that may point to fundamental cellular functions.
October 22, 2015

Dmitry Korkin, WPI associate professor of computer science

If a human being, a worm, a broccoli plant, and a yeast cell share common genetic elements, those snippets of DNA, having remained unchanged over millions of years of evolution, are likely to perform fundamental biological functions.

The National Science Foundation (NSF) has awarded Worcester Polytechnic Institute (WPI) a $768,000 research grant to identify such elements across all known genomes of plants, animals, fungi, and other complex organisms to gain insight into the roles they play in our cells. Dmitry Korkin, PhD, associate professor of computer science and principal investigator for the new project, will use mathematical algorithms and advanced computing technology to analyze vast amounts of genomic data to identify common genetic elements.

"We call these sequences long identical multispecies elements, or LIMEs," said Korkin. "To be conserved across species that diverged hundreds of millions years ago, these elements must carry out some very basic and vital functions in the cells."

Korkin is a member of WPI’s Bioinformatics and Computational Biology Program, which uses advanced mathematics and computer science to shed light on basic biology. In the new project, Korkin's team will analyze all the available genomes of eukaryotes, which are organisms whose genetic material is contained within a nucleus. (Bacteria and other simple single-celled organisms do not have nuclei and are called prokaryotes.) Currently, the genomes of some 925 eukaryotic species are sufficiently sequenced for Korkin’s analysis; they include many plants and animals, as well as the human genome.

"Just a few years ago, we could not even approach this question, because there was too much data to deal with," Korkin said. "With the technology we had then, the algorithms would have to run, literally, for a thousand years to get a result."

Korkin and his team have made technical leaps, developing new "cache-oblivious" algorithms that are designed not only to answer genetic questions, but also to maximize the efficiency of available computer processing power. "You have to understand the hardware you’re running on to optimize the algorithms," Korkin said. "What we're seeing in early results is a thousand-fold improvement. What we were doing on big servers that took weeks, we can now do on a laptop in a couple of hours."

A genome is the complete set of DNA molecules that carry the genetic information needed for development and function of an organism. Famously dubbed "the double-helix", a DNA molecule looks like a twisted ladder with two side rails linked by pairs of only four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). Those four letters are the entire genetic alphabet. The microscopic worm C. elegans has about 100 million base pairs of A, C, G, and T in its genome, while the human genome runs to 3 billion base pairs.

Genes are large sequences of base pairs that provide specific instructions for production of proteins in cells. Genes that code for proteins, however, account for less than two percent of the DNA in human cells. For many years, the remaining 98 percent was called "junk DNA" and thought to be inactive leftovers built up from millennia of evolution. "We now know that it's really not junk at all," Korkin said. "Those non-coding regions of the genome are emerging as very important for basic development and regulatory functions."

Over the next three years, Korkin’s team will work to identify identical (or nearly identical) patterns of base pairs that exist across species and develop some understanding of the evolutionary history of those genetic elements and their roles in normal development or the onset of disease. Korkin expects most of the LIMEs will fall in non-coding regions, given that those areas dominate the genome, but the project may also identify some common genes.