Insights from the Intersection: Applying Data Science Thinking to Materials

I’ve spent the summer working at Citrine fresh out of an undergraduate degree where I studied both Materials Science and Computer Science at Stanford. Though I thoroughly enjoyed studying both fields, I found limited opportunities to apply the two together until beginning work here. While companies in entertainment and shopping have reaped the benefits of massive data sets, many fields in the scientific community, notably materials science, have remained largely separate from data science even as they amass huge quantities of data. Working with materials data at Citrine has made me reflect on differences between how data scientists and materials scientists can perceive data in different ways, and how insights from data science can benefit materials research.

When we need accuracy:

A first insight from data science comes from reframing the idea of how information can be used in the research process. In academic materials science, the focus of study tends to be very narrow. Researchers write papers around a single breakthrough result, and can spend years studying a single material. In this frame of mind, the accuracy of a single datapoint is extremely important and vital for making progress. However, maintaining this focus on individual data points at all times during the research process can slow the pace of development by constraining research to known areas. In the view of a data scientist, understanding larger patterns in the data is more important than the accuracy of any single point. Citrine’s technology creates value from finding hidden patterns in the data, and applying those patterns to generate new insights. Using the high throughput and minimal computing resources required for machine learning algorithms, large amounts of information can be generated to direct research into valuable new areas that would not have been considered in a more narrow scope of the data.

Garbage in, garbage out:

In taking a data-driven approach to understanding problems, one of the most important problems that data scientists face is ensuring data quality. The quality of any insights can only ever be as good as the quality of the data on which they are based. This is especially true for machine-learning based models, which cannot fall back on physics if there are problems with the data. Here at Citrine, our solution to the issue of data reliability has been Citrination, a community-driven common repository for all different types of materials data.  Having a trusted, comprehensive location for materials data would also be extremely useful for researchers, who could save valuable time by quickly validating data by surveying similar results or checking results against models built on existing data.

Follow the numbers:

The core concept of data science is potentially the most valuable for materials science: the belief that there is a wealth of important information hidden in patterns that can be uncovered given enough quality data. Reframing materials data analysis as pattern detection means that Citrine’s technology is not bound by the current limits in scientific knowledge. We are able to tackle problems that scientific intuition does not yet have the means to explain or understand by finding complicated patterns and relationships through machine learning. Not only can these patterns help accelerate development, but they can also lend insight to scientific understanding by uncovering connections between things that do not immediately seem relevant.

Citrine brings together people and ideas from both data science and materials science and applies these insights to make research and development faster and smarter.