Physicality

Locality

Summary

Performance in respecting locality, by measuring deviations from a neglible interactions between acetone and distance atoms.

Metrics

  1. Maximum difference in force due to “ghost atoms”

Forces on an isolated acetone molecule are calculated, and the forces on the same atoms are calculated after 20 Ne atoms are placed in a 60 Å cubic box, at least 40 Å from the acetone’s centre of mass. The magnitude of the maximum difference in force is reported.

  1. Mean difference in force due to a distance hydrogen

Forces on an isolated acetone molecule are calculated, and the forces on the same atoms are calculated after a single hydrogen atom is placed between 20 and 50 Å from the acetone’s centre of mass. This is repeated for 30 different random placements of the hydrogen atom, the mean force difference on the acetone atoms is calculared.

  1. Standard deviation in force due to a distance hydrogen

Same as (2), but the standard deviation of the force difference on the acetone atoms is calculated.

Computational cost

Low: tests are likely to take less than a minutes to run on CPU.

Data availability

None required.

Extensivity

Summary

Performance in respecting extensivity, by measuring differences in energy between isolated systems, and the same systems combined, but significantly separated.

Metrics

  1. Absolute energy difference between isolated and combined slabs

The energy of two isolated slabs is calculated, and the energy of the combined system, with the two slabs separated by 100 Å is calculated. The absolute energy difference between the sum of the isolated slabs and that of the combined system is calculated.

Computational cost

Low: tests are likely to take less than a minutes to run on CPU.

Data availability

None required.

Diatomics

Summary

This benchmark probes the short- to medium-range behaviour of every homonuclear and heteronuclear diatomic pair in the periodic table. Each MLIP is evaluated on a 100-point linear distance grid spanning 0.18-6.0 Å and the resulting energies and projected forces are analysed for unphysical oscillations.

Metrics

  1. Force flips

    Average number of times the projected bond force changes sign. Forces are projected onto the bond axis and values below \(10^{-2}\) eV/Å are rounded to zero to avoid counting noise-induced flips. A smooth curve should switch from attraction to repulsion only once at the minimum.

  2. Energy minima

    Mean count of distinct minima in the energy-distance profile. Local minima are found from the second derivative, where a physical diatomic should show a single minimum.

  3. Energy inflections

    Mean number of inflection points obtained from the second derivative of the energy curve. Inflections are flagged when the second derivative changes sign with a tolerance of 0.5 eV/Ų to avoid counting noise-induced inflections. A physical diatomic curve should show one inflection point.

  4. \(\rho(E, \text{repulsion})\)

    Spearman correlation between atomic separation and energy on the repulsive side of the well (bond lengths ≥ the equilibrium spacing). A perfect diatomic curve should show a strong negative correlation, so a value of -1, indicating that as atoms get further apart, the energy decreases.

  5. \(\rho(E, \text{attraction})\)

    Spearman correlation between distance and energy on the attractive side (bond lengths shorter than the equilibrium spacing). A perfect diatomic curve should show a strong positive correlation, so a value of +1, indicating that as atoms get closer together, the energy increases.

Computational cost

High: Expected to take hours to run on GPU, or around one day for slower MLIPs.

Data availability

None required; diatomics are generated in ASE.

Oxidation States

Summary

Examines the model’s ability to capture different oxidation states of Fe in aqueous solution [1, 2]. Two systems containing Fe 2Cl (Fe+2 state) and Fe 3Cl (Fe+3 state) in water are simulated at 300K for 20ps with NVT MD. The solvation cell is expected to be tighter for iron in the Fe+3 state. This effect, if an MLIP can correctly capture the Fe oxidation state, appears as a split on the Fe-O RDF peaks. This test examines whether a split appears between the Fe-O RDF peaks of the two systems. Additionally, the benchmark examines whether the peaks fall into the expected experimental range [1].

[1] Kocer, Emir, et al. “Machine learning potentials for redox chemistry in solution.” arXiv preprint arXiv:2410.03299 (2024). [2] Batatia, Ilyes, et al. “MACE-POLAR-1: A Polarisable Electrostatic Foundation Model for Molecular Chemistry.” arXiv preprint arXiv:2602.19411 (2026).

Metrics

  1. Fe-O RDF Peak Split

The similarity of the aqueous Fe 2Cl and Fe 3Cl system RDFs is examined. If a split is present the score is +1 and in the case there is no clear split the score is 0. This metric determines whether a model can capture the different oxidation states of Fe and is therefore weighted as 5x more important than the two following metrics.

  1. Fe +2 Peak Experimental Ref Deviation

Deviation of the Fe 2Cl system’s RDF peak position from the experimental range.

  1. Fe +3 Peak Experimental Ref Deviation

Deviation of the Fe 3Cl system’s RDF peak position from the experimental range.

Computational cost

High: Expected to take hours to run on GPU, or around one day for slower MLIPs.

Data availability

Starting configurations for the MD are available on S3 bucket. Experimental reference ranges for the RDF peaks were taken from [1].

[1] Kocer, Emir, et al. “Machine learning potentials for redox chemistry in solution.” arXiv preprint arXiv:2410.03299 (2024).

Water Slab Dipoles

Summary

Distribution of dipole of water slab, checking for width of distribution and structures with dielectric breakdown.

Metrics

  1. Standard Deviation of Dipole Distribution

For a number of samples from an MD simulation, the total dipole is calculated. Compare to a reference of a LR model trained on revPBE-D3.

  1. Number of structures with dielectric breakdown

Estimate band gap based on dipole, count structures where band gap disappears.

Computational Cost

High: Requires around 500 ps of MD of 40 A slab to get converged distribution, around 1 day on one GPU.

Data availability

https://arxiv.org/html/2603.04228v1