Barbara Engelhardt, a computer scientist at Princeton University, is using machine learning to understand disease. Credit: Sarah Blesener, Quanta Magazine

Machine Learning Works to Make Sense of Genes

When scientists analyze the myriad of diseases that lurk in the trillions upon trillions of human genomes, they are faced with an astronomical number of mutations in the human gene pool. Machine learning can help break down the genetic code variations into understandable bits of information to seek treatments.

In a long, and very detailed interview/article by Jordana Cepelwicz, that first appeared in Quanta Magazine, the website Wired.com, republished a story on Barbra Engelhardt’s search among the genomes for cures. The Princeton University computer scientist gave careful explanations of the work she is attempting to analyze and understand with the tools of AI-driven machine learning methods.

“Engelhardt likens the effort to detective work, as it involves combing through constellations of genetic variation, and even discarded data, for hidden gems. In research published last October, for example, she used one of her models to determine how mutations relate to the regulation of genes on other chromosomes (referred to as distal genes) in 44 human tissues. Among other findings, the results pointed to a potential genetic target for thyroid cancer therapies. Her work has similarly linked mutations and gene expression to specific features found in pathology images.”

There is a short video below from the interview for the article. It’s a glimpse into the depth of the science of exploring the human genome. Engelhardt designed machine-learning models to comprehend the hidden information she is uncovering.

Engelhardt explains, “My group relies heavily on what we call sparse latent factor models, which can sound quite mathematically complicated. The fundamental idea is that these models partition all the variation we observed in the samples, with respect to only a very small number of features. One of these partitions might include 10 genes, for example, or 20 mutations. And then as a scientist, I can look at those 10 genes and figure out what they have in common, determine what this given partition represents in terms of a biological signal that affects sample variance.”

The Quanta/Wired.com article is is linked below. A truly fascinating piece of writing on a very technical subjects, it’s worth the time it takes to read it. more at wired.com.