CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials

  • Can Polat
  • , Erchin Serpedin
  • , Mustafa Kurban*
  • , Hasan Kurban*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

We present CrysMTM, a large-scale, multimodal dataset designed to benchmark temperature- and phase-sensitive machine learning models for crystalline materials. The dataset comprises approximately 30 000 atomistic samples covering the three primary polymorphs of titanium dioxide-anatase, brookite, and rutile-each evaluated across a temperature spectrum ranging from cryogenic to ambient and elevated conditions. Each data entry integrates three complementary modalities: (1) three-dimensional atomic coordinates, (2) RGBA molecular visualizations, and (3) structured textual metadata encompassing geometric descriptors, local bonding environments, and phase transformation parameters. This multimodal structure enables both supervised and self-supervised learning across graph-based, image-based, and language-based architectures. CrysMTM supports rigorous evaluation of model robustness under thermal perturbations and crystallographic phase transitions. Baseline benchmarking across 18 models-including graph neural networks (GNNs), convolutional neural networks, and foundation models-reveals significant property-specific challenges. For example, bandgap predictions exhibit errors exceeding 25%, while volumetric expansion and atomic displacement estimations frequently deviate by more than 100%. Even state-of-the-art GNNs, which achieve an average in-distribution (ID) mean absolute percentage error of approximately 20%, show a threefold increase under out-of-distribution (OOD) thermal conditions. In contrast, a few-shot multimodal large language model reduces global prediction error from 96% to 23% and narrows the performance gap between ID and OOD cases to just four percentage points. These results highlight both the selective difficulty posed by temperature-sensitive geometric targets and the considerable room for innovation in model design. All dataset files, model implementations, and pretrained checkpoints are available at https://github.com/KurbanIntelligenceLab/CrysMTM.

Original languageEnglish
Article number030603
JournalMachine Learning: Science and Technology
Volume6
Issue number3
DOIs
Publication statusPublished - 30 Sept 2025

Keywords

  • Benchmark
  • Dataset
  • Dftb
  • Explainability
  • Llm
  • Temperature dependence

Fingerprint

Dive into the research topics of 'CrysMTM: a multiphase, temperature-resolved, multimodal dataset for crystalline materials'. Together they form a unique fingerprint.

Cite this