Abstract
We present CrysMTM, a large-scale, multimodal dataset designed to benchmark temperature- and phase-sensitive machine learning models for crystalline materials. The dataset comprises approximately 30 000 atomistic samples covering the three primary polymorphs of titanium dioxide-anatase, brookite, and rutile-each evaluated across a temperature spectrum ranging from cryogenic to ambient and elevated conditions. Each data entry integrates three complementary modalities: (1) three-dimensional atomic coordinates, (2) RGBA molecular visualizations, and (3) structured textual metadata encompassing geometric descriptors, local bonding environments, and phase transformation parameters. This multimodal structure enables both supervised and self-supervised learning across graph-based, image-based, and language-based architectures. CrysMTM supports rigorous evaluation of model robustness under thermal perturbations and crystallographic phase transitions. Baseline benchmarking across 18 models-including graph neural networks (GNNs), convolutional neural networks, and foundation models-reveals significant property-specific challenges. For example, bandgap predictions exhibit errors exceeding 25%, while volumetric expansion and atomic displacement estimations frequently deviate by more than 100%. Even state-of-the-art GNNs, which achieve an average in-distribution (ID) mean absolute percentage error of approximately 20%, show a threefold increase under out-of-distribution (OOD) thermal conditions. In contrast, a few-shot multimodal large language model reduces global prediction error from 96% to 23% and narrows the performance gap between ID and OOD cases to just four percentage points. These results highlight both the selective difficulty posed by temperature-sensitive geometric targets and the considerable room for innovation in model design. All dataset files, model implementations, and pretrained checkpoints are available at https://github.com/KurbanIntelligenceLab/CrysMTM.
| Original language | English |
|---|---|
| Article number | 030603 |
| Journal | Machine Learning: Science and Technology |
| Volume | 6 |
| Issue number | 3 |
| DOIs | |
| Publication status | Published - 30 Sept 2025 |
Keywords
- Benchmark
- Dataset
- Dftb
- Explainability
- Llm
- Temperature dependence
Fingerprint
Dive into the research topics of 'CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver