Towards faster gene expression prediction via dimensionality reduction and feature selection


The majority of genes have a genetic component to their expression. Elastic nets have been shown effective at predicting tissue-specific, individual-level gene expression from genotype data. We apply principal component analysis (PCA), linkage disequilibrium pruning, or the combination of the two to reduce, or generate, a lower-dimensional representation of the genetic variants used as inputs to the elastic net models for the prediction of gene expression. Our results show that, in general, elastic nets attain their best performance when all genetic variants are included as inputs; however, a relatively low number of principal components can effectively summarize the majority of genetic variation while reducing the overall computation time. Specifically, 100 principal components reduce the computational time of the models by over 80% with only an 8% loss in R-squared. Finally, linkage disequilibrium pruning does not effectively reduce the genetic variants for predicting gene expression. As predictive models are commonly made for over 27,000 genes for more than 50 tissues, PCA may provide an effective method for reducing the computational burden of gene expression analysis.

In 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
Theodore Papamarkou
Theodore Papamarkou
Professor in maths of data science

Knowing is not enough, one must compute.