Home

Overview

This article defines the difference between “predictive models” and “search algorithms,” which are often confused in the field of Materials R&D DX, and explains the characteristics of the four main models that are essential in practice.

It presents criteria for selecting models based on purpose, as used by professional data scientists, and supports optimal algorithm selection according to the characteristics of the data.

This is a practical guide to a data-driven approach for accelerating research and development.

Introduction

When promoting AI-driven materials development (Materials R&D DX), the use of algorithms is unavoidable. However, even though they are all referred to as algorithms, did you know that those used in Materials Informatics (MI) can be broadly divided into two types?

One is machine learning algorithms (predictive models), which learn patterns from experimental data.

The other is search algorithms (optimization methods), which use those models to find optimal experimental conditions.

Although both are called algorithms, their roles are clearly different. If these are confused, it can lead to questions such as, “Which method should we use?” or “What is the difference between Random Forest and Bayesian Optimization?”

In this article, we focus on machine learning algorithms (predictive models), which form the foundation, and explain how data scientists in practice understand these models and on what basis they compare and select them.

1. The Role of Predictive Models in Data-Driven Development

First, let us clarify the terminology again. As mentioned above, in data-driven materials development, two types of algorithms are mainly used.

Machine Learning Algorithms (Predictive Models)

Methods for forward analysis that predict results from conditions.

Role: They learn patterns from experimental data and function as a calculation formula (engine) that predicts “material properties” when new conditions are input.

Examples: Random Forest, Lasso regression, Gaussian Process Regression, etc.

Search Algorithms (Optimization Methods)

Methods for inverse analysis that explore conditions from desired results.

Role: In response to questions such as “How can we create a stronger material?”, they repeatedly use the predictive model (engine) thousands of times to search for optimal conditions, acting as a navigator.

Examples: Bayesian Optimization, Genetic Algorithms, etc.

Note: They are used as an approach in which AI sequentially proposes conditions, replacing conventional statistical experimental design methods (DoE).

The focus of this article is 1. Predictive Models.

When using search tools such as Bayesian Optimization, this part may not be very visible, but in fact, this predictive model is always operating behind the scenes.

A predictive model is, so to speak, a virtual experimental system built inside a computer.

No matter how excellent the search algorithm (optimization method) is, if the accuracy of this system (predictive model), which serves as the basis of calculation, is low, it will never reach optimal conditions. Therefore, understanding the characteristics of predictive models is essential for successful exploration.

2. Defining the Prediction Target: Regression or Classification

Before looking at specific algorithms, the first thing to decide is what you want to predict.

① Regression

Purpose: Predict numerical values

Examples: Tensile strength, thermal conductivity, yield, bandgap, etc.

Use cases: This is the most common case. It is used when the objective is to “achieve higher values.”

② Classification

Purpose: Determine categories (labels)

Examples: Success/failure of synthesis, crystal structure (type A/B), presence/absence of toxicity

Use cases: Used when determining whether an experiment is feasible, for example, in screening stages

In this article, we focus on ① Regression (numerical prediction), which is in highest demand in materials development.

It should be noted that many algorithms, such as Random Forest and Support Vector Machines, can be applied to both regression and classification.

In this article, we describe their characteristics when used for regression (numerical prediction). The selection of algorithms for classification problems will be explained in a future article.

3. 4 Main Models Used in Practice

Many people may associate AI with deep learning (neural networks). However, in materials development, where the number of data points is typically on the order of tens to thousands, the following four groups are mainly used because they can achieve good accuracy even with relatively small datasets.

① Linear Models

Methods that attempt to capture trends in data using a straight line (or plane).

Examples: Linear regression, Lasso, Ridge, PLS

Characteristics: They are effective for simple relationships such as “increasing additive A proportionally increases strength.”

Advantages: Because the model can be expressed as a mathematical formula (y = ax + b), it is very easy for humans to understand why a particular prediction is made. It is standard practice in data analysis to first build a baseline model using this approach.

② Tree-Based Models

Methods that make predictions by combining numerous conditional branches, such as “if the temperature is above a certain value, go right; otherwise, go left.”

Examples: Random Forest, XGBoost, LightGBM, CatBoost

Characteristics: They can learn complex interactions that linear models cannot capture (for example, when A and B together produce a stronger effect).

Advantages: They are the primary choice (de facto standard) in current MI practice. By combining with techniques such as SHAP analysis, it is possible to visualize which factors are influential, achieving an excellent balance between accuracy and interpretability.

③ Kernel & Probabilistic Models

Using kernel methods, data is mapped into a higher-dimensional space, and predictions are made based on similarity (distance) between data points.

Examples: Gaussian Process Regression (GPR), Support Vector Regression (SVR), Kernel Ridge Regression (KRR), Relevance Vector Machine (RVM)

Characteristics: They follow an approach close to the intuition of chemists: “materials with similar chemical structures should exhibit similar properties.”

Advantages: A shared strength is the ability to capture complex nonlinear relationships even with small datasets.

SVR / KRR: Suitable for building stable models by controlling computational cost and reducing the influence of outliers
GPR / RVM: Can estimate uncertainty in addition to predictions, making them particularly suitable for exploring unknown regions (e.g., Bayesian Optimization)

④ Ensemble Models

Methods that integrate predictions from multiple different models (e.g., Lasso and XGBoost) using a “consensus” approach.

(Note: In a broad sense, Random Forest is also an ensemble of decision trees, but here we refer to methods that combine different types of models.)

Examples: Simple averaging, weighted averaging, stacking (stacked regressor), blending

Characteristics: They integrate the outputs of multiple models. This can involve simple averaging, weighting more reliable models, or using another AI model to combine predictions (stacking).

Advantages: They reduce the risk of overfitting and provide more robust (stable) predictions. In practice, they are often used as a default choice when unsure.

4. Guidelines for Model Selection Based on Purpose

Unfortunately, there is no universal model. Professional data scientists identify strong initial candidates based on the purpose.

When prioritizing understanding and interpretability of phenomena

Recommended: Linear models
Reason: Simple and easy to compare with chemical intuition

When pursuing predictive accuracy (with sufficient data)

Recommended: Tree-based models
Reason: With more than ~100 data points, they can capture complex interactions and often achieve the highest accuracy

When working with very limited data

Recommended: Kernel / probabilistic models
Reason: They rely on similarity, allowing them to capture trends even with small datasets

When prioritizing prediction stability

Recommended: Ensemble models
Reason: They compensate for weaknesses of individual models and reduce the risk of large errors

5. Conclusion: Automating the “Professional Validation Process”

Although we have introduced various methods and selection criteria, implementing and comparing these individually in practice requires significant effort and expertise.

Even if the characteristics of algorithms are understood, manually conducting comprehensive validation each time can be a heavy burden in practice.

Even professional data scientists rarely decide on a single model from the beginning.

Instead, they test multiple models under consistent conditions and select the most suitable one based on objective numerical evaluation.

Our Materials R&D DX platform automates this comprehensive validation process.

Using properly structured data, it trains and compares major algorithms, allowing the system to handle the extensive trial-and-error process required for model selection.

Tasks that computers excel at—such as model selection and tuning—can be left to AI.

Researchers can instead focus their time on interpreting insights and making creative decisions about the next experiments.

In the next article, before moving on to search algorithms, we will explain evaluation metrics (R², RMSE, etc.) used to determine whether predictive models are sufficiently accurate for practical use.

Even if you use a search algorithm, an inaccurate model will only lead to incorrect guidance. Evaluating model reliability in advance is essential for successful exploration.