Title: Assessing RNA-Seq on biological classification: Can alternative splicing enhance machine learning?
The preference on which genes are expressed in the cell can be simplistically defined as a function of one or more factors of environment, lifestyle, and genetics. Gene expression has been an essential measure when assessing various biological phenotypes such as cell types or disease states, and exploring novel phenomena. Estimating expression of a gene is commonly done utilizing microarrays and, to a lesser extent, RNA-Sequencing (RNA-Seq), a recent technique that has been rapidly replacing microarrays. RNA-Seq data is expected to gain better insights to a number of biological and biomedical questions, compared with the microarray data; however leveraging these data requires development of new data mining and analytics methods. Being one of the commonly used approaches for biological data analysis, the supervised learning methods have recently gain attention for use with RNA Seq data. In this work, we present the first large-scale assessment of supervised learning classification methods that analyze RNA-Seq data through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. We hypothesized that alternative splicing expression data are more suitable for biological classification tasks than gene expression data when using machine learning. Overall, we performed and assessed 75 biological classification tasks using 3 normalization techniques on each of 4 mRNA RNA-Seq datasets that represent over 2,500 samples, multiple organisms, lab groups, and RNA-Seq analysis techniques. The 75 tasks include predictions of the tissue type, gender, or age that a healthy sample comes, whether or not the sample comes from cancerous tissue or not, and what breast cancer stage it is in. Remarkably, for more than 95% of the classification tasks, the alternative splicing based methods outperform or are comparable with gene expression based methods. In fact, in some cases this difference resulted in as high as a 15% accuracy difference. Furthermore, for some supervised learning techniques, the classification reached 100% accuracy, demonstrating great promise in utilizing supervised machine learning methods for analyzing the RNA-Seq data.