Citrine is a company that builds data infrastructure and predictive data analysis software for the materials industry. Machine learning is a key tool in our toolbox. I have had a few professors and students in materials departments ask me (1) how machine learning could help in their research; and (2) how to quickly come up to speed in machine learning without going back to school for a degree in computer science. While a variety of machine learning courses and how-tos exist on the web already (see here, here, or here), none are specific to the field of materials science.

Enter this blog post. I think the best way to master a new concept is by directly applying it, so this tutorial will show you how to build a machine learning-based model of a canonical solid-state materials property: band gap. In short, we will—with a healthy dose of humility—take a crack at an unsolved problem of solid-state physics: predicting the band gaps of compounds.

Software Prerequisites

  • A recent version of python

  • scikit-learn package for python (NB: Weka is a very nice Java-based machine learning package that has both a GUI and a command-line interface)

  • Numpy package for python

  • pymatgen package for python, from the good folks behind Materials Project


Before we begin, because data-driven materials science is a relatively new, burgeoning field, I want to offer my own (very) concise definitions for some terms you may have heard bandied about:

Materials informatics / materials genomics. Using computational tools to automatically reveal useful patterns in large materials data sets.

Data mining. The process of extracting actionable insights from data. Data mining could be machine learning-based, statistical, or even graphical.

Machine learning. Using algorithms to identify and model patterns in data.

Features / descriptors. These are the “independent variables” of machine learning. They are the known characteristics of materials that we use to build models of unknown characteristics of interest to us.

Hype vs. Reality

Unfortunately, machine learning and its subfields (especially deep learning), along with the general concept of Big Data, are so widely hyped in the popular press that materials scientists often ask me what machine learning really can and cannot do.  Don’t be distracted by buzzwords and jargon: machine learning is not a silver bullet for any scientific problem. For our purposes, machine learning gives us a (potentially black-box) nonlinear regression technique to find meaningful trends in materials data. These trends should, in turn, be based upon the physics of the problem at hand. We are not somehow conjuring signal from noise, nor are we simulating materials using physical models such as density functional theory. We are simply looking for patterns in datasets that are too large or complex to analyze manually.

Data Aggregation & Cleaning

The first step in applying machine learning to materials data is to have on hand some data you would like to model. Although this requirement sounds simple, a lack of centralized, structured, open materials data is an enormous challenge facing the entire materials community. Citrine is, with the materials research community’s help, engaging in an effort to create such a data resource in the form of our Citrination platform.

Instances of large, clean materials datasets are few and far between. However, one notable example is the Materials Project (MP) out of LBNL and MIT. For the sake of this tutorial, we will model the band gaps of binary compounds using the DFT-derived data in the MP database. Fortunately, MP provides a nice API with which we can download all of the binary compound band gaps they have calculated. The python code I have written to do that is below:

from pymatgen import MPRester, periodic_table
import itertools

API_KEY = ‘YOUR_KEY_HERE’ # You have to register with Materials Project to receive an API

# There are 103 elements in pymatgen's list, giving C(103, 2) = 5253 binary systems
allBinaries = itertools.combinations(periodic_table.all_symbols(), 2) # Create list of all binary systems

with MPRester(API_KEY) as m:
    for system in allBinaries:
        results = m.get_data(system[0] + '-' + system[1], data_type='vasp') # Download DFT data for each binary system
        for material in results: # We will receive many compounds within each binary system
            if material['e_above_hull'] < 1e-6: # Check if this compound is thermodynamically stable
                print(material['pretty_formula'] + ',' + str(material['band_gap'])) # Output band gap csv to the screen      

Rather than running this code yourself, just familiarize yourself with the approach and download the resulting file here—let’s be good citizens and save MP’s API some bandwidth.

It is worth pausing at this stage to reflect on how intensely painful the data aggregation and cleaning process usually is for machine learning applications. In this case, we were able to programmatically access several thousand training examples whose properties were calculated properly and consistently by a single research group. A more general problem, such as predicting the bulk modulus of materials using any available experimental and simulated training data, requires a massive data aggregation and cleaning effort. Questions we might encounter in such an effort include: “If the literature contains a range of values for a single material’s bulk modulus, which do we take?” “How do we standardize values that might have been measured or calculated with different approaches?” “Do we only use values measured at room temperature, or should we define an acceptable range of experimental temperatures?” “Do we only use measurements on bulk polycrystalline samples, or do we have a way of accommodating other cases such as single-crystal measurements?”

Now, we have a clean dataset downloaded and ready to go. Next time, we will put that data to use.

Edit: Part 2 of Machine Learning for the Materials Scientist is up.