Materials Informatics and Data Mining for Materials Science
September 24, 2008
By Krishna Rajan
Broadly speaking, two primary functions of data mining--pattern recognition and prediction---together form the foundations for the study of material behavior and for material discovery. The search for new or alternative materials, for instance, whether through experiment or simulation, is a slow and arduous process, punctuated by infrequent and often unexpected discoveries. Each such finding prompts a flurry of studies in which we seek to better understand the science governing the behavior of the new material. While informatics is well established in some fields--such as biology, drug discovery, astronomy, and quantitative social sciences--materials informatics is still in its infancy. Results of the few systematic efforts undertaken to analyze trends in data as a basis for predictions have been in large part inconclusive, not least because of the lack of large amounts of organized data and, even more important, the challenge of sifting through the data in a timely and efficient manner.
It might seem natural to assume that large amounts of data are critical for any serious informatics study. In materials science applications, however, what constitutes "enough" data can vary significantly. In structural ceramics, for instance, it is difficult to obtain measurements of "fracture toughness," and, in fact, just a few careful measurements can be of great value for some of the more complex materials. Similarly, reliable measurements of fundamental constants or properties for a given material require the use of very detailed measurement and/or computational techniques. Unlike astronomers or biologists, who look at the world (environment) around them to gather data and then analyze it to find out what is important, materials scientists do a great deal of analysis (and/or experimentation) to get their data. The result is a number of challenges that are unique to materials science: lack of sufficient data, skewed datasets, and missing information, among others. On the other hand, the emergence of high-throughput data-acquisition techniques in materials science, such as combinatorial experimentation, offers unprecedented opportunities as well as challenges in data-driven discovery techniques . With such widely different issues in data characteristics, materials science offers an exciting domain for the application of the science of data mining.
Examples of Materials Informatics: Solving Materials Science Questions
with a Data Mining Paradigm
Figure 1. Left, one-dimensional autocorrelation along
the y-axis for Mg atoms in an Al–1.9Zn–1.7Mg alloy at t=3600
sec. Right, a cross section showing the superimposition of Mg, Al, and
Zn autocorrelation functions. The co-clustering of Mg and Zn is clearly
visible. From .
Data mining for combinatorial catalysis experiments. We examined a dataset of 1001 catalyst chemistries, sampling the complete composition spread of a five-dimensional search space containing the elements Cr, Co, Mn, Mo, and Ni. We applied principal component analysis to the dataset in order to detect correlations between involved elements and selectivity or activity, respectively. By using singular value decomposition techniques to reduce data dimensionality, combined with clustering analysis, we established correlations between the presence of a specific metal species and the selectivity of a given reaction product. In this manner, we have been able to identify which combinations of constituents of heterogeneous catalysts are related to which final products from a large combinatorially generated dataset.
Data Mining Challenges
1. How can data mining/machine learning be used most effectively to discover the attributes (or combinations of attributes) that govern specific properties in a material? Using information from different databases, we can compare and search for associations and patterns that could lead to ways of relating information among the different datasets.
2. What are the most interesting patterns that can be extracted from existing materials science data? Such a pattern search process can potentially yield associations between seemingly disparate datasets and could also establish possible correlations between parameters that are not easily studied experimentally in a coupled manner.
3. How can we use associations mined from large volumes of data to guide future experiments and simulations? How can we select from a materials library the compounds that are most likely to have desired properties? Incorporation of data mining methods into design and testing methodologies would increase the efficiency of optimizing materials processing techniques. For instance, a possible testbed for material discovery can involve the use of massive databases on crystal structure, electronic structure, and thermochemistry. Each of these databases by itself can provide information on hundreds of binary, ternary, and multicomponent systems. This library, coupled with electronic structure and thermochemical calculations, can be enlarged to permit a wide array of simulations for thousands of combinations of material chemistries. Such a massively parallel approach to the generation of new "virtual" data would be a daunting if not impossible task were it not for data mining tools.
In conclusion, data-intensive approaches to the discovery of behavioral models (as opposed to traditional mathematical modeling) can be powerful tools for accelerating progress in materials science.
Krishna Rajan holds the Stanley Chair of Interdisciplinary Engineering and is a professor of materials science and engineering at Iowa State University.