ML Algorithms Uncover new Knowledge Hidden in Old Studies

In basically every scientific field, there are several decades worth of research that continue to pile up. Bringing something to the research table, each new study contains many references to studies that come out before.

But no matter how much free time scientists would have, they can’t sift through the entire scientific literature to glean the information they need for their new work.

Overwhelming, this collective knowledge made of millions of published papers provides a treasure trove to Machine Learning (ML).

ML algorithms can make use of the mountains of research papers in ways we wouldn’t think possible. Just by scanning the abstract texts of published studies, ML algorithms can make discoveries.

And yes, there’s a new research paper about this!

Leave it to ML Algorithms to do the Paperwork

Can Machine Learning make sense of millions of material science papers to predict discoveries of new materials and suggest others yet to be found?

That’s precisely what an ML algorithm did in an experiment conducted at the US DoE’s Lawrence Berkeley National Laboratory, in an unsupervised manner and based only on abstract texts of past studies.

A team, led by Dr. Anubhav Jain, at Berkeley Lab’s Energy Storage & Distributed Resources Division, fed 3.3 million materials science papers into an algorithm called Word2vec.

“Without telling it anything about materials science, it learned concepts like the periodic table and the crystal structure of metals,” said Jain. “That hinted at the potential of the technique. But probably the most interesting thing we figured out is, you can use this algorithm to address gaps in materials research, things that people should study but haven’t studied so far.”

To assess the ML algorithm’s prediction ability, they gave it research papers up to a specific year and saw how it would fare, or experiments “in the past.”

With no prior training in materials science, just by analyzing how words in the abstracts are related to one another, the neural network was able to predict the discovery of some thermoelectric materials, years in advance.

ML Algorithms are no Scientists, Just Good Text Mining

Per the researchers, a “significant number” of the algorithm has predicted new material discoveries turned up in studies years later, “four times more than if materials had just been chosen at random.”

From those millions of abstracts, the ML algorithm found 500,000 distinct words, turned each word into an array of 200 numbers, then used the numbers to see the relationships between words.

“If you train the algorithm on nonscientific text sources and take the vector that results from ‘king minus queen,’ you get the same result as ‘man minus woman.’ It figures out the relationship without you telling it anything.”

That way, Word2vec learned the meaning of many material science concepts and terms, by merely analyzing the positions and co-occurrences of different distinct words in the abstracts.

In the same way, it solved the equation “king – queen + man,” the algorithm could solve equations like “ferromagnetic – NiFe + IrMn,” correctly answering “antiferromagnetic.”

According to Vahe Tshitoyan, a Berkeley Lab postdoctoral fellow and the lead author of the study, “the paper establishes that text mining of scientific literature can uncover hidden knowledge and that pure text-based extraction can establish basic scientific knowledge.”

Berkeley Lab’s research paper on how ML algorithms can make use of research papers is published in Nature.