Adding benchmarks¶
This guide will break down the process of adding a new benchmark into several steps:
Identifying metrics¶
Having selected an application/system/property of interest to test, the first step is to identify quantifiable metrics of performance.
Often, this will be a comparison to a reference value, such as the error with respect to DFT predictions of energies.
Note
In future, we expect to support user-selection of error metrics, e.g. MAE, RMSE, or higher-order errors.
Reference data may also include experimental data, or higher-accuracy theoretical predictions, e.g. CCSD(T).
In some cases, metrics may also encode correct behaviour without a specific reference, such as by quantifying features of a known distribution (curvature, minima, etc.), or quantifying the stability of a simulation.
Running Calculations¶
Create a new directory in
mlip_testing/calcs
with a short, unique benchmark name.Write a script that will run the MLIP calculations of interest for each model being tested.
The file should be named calc_[benchmark_name].py
,
and placed in mlip_testing/calcs/[benchmark_name]
.
While not a requirement, we recommend placing input files in
mlip_testing/calcs/[benchmark_name]/data
, and output files in
mlip_testing/calcs/[benchmark_name]/outputs
, for consistency.
The test contained in this file may be runnable as a standalone script,
but it should also be possible to run with pytest
, e.g.:
pytest -v -s mlip_testing/calcs/[benchmark_name]/calc_[benchmark_name].py
Note
The -s
stops pytest
intercepting stdout, so print
statements will
be print to the console.
pytest
will run any functions beginning with test_
, enabling multiple types
of calculation to be defined, discovered, and run within the same benchmark.
Future examples will also demonstrate the use of fixtures
, which allow calculations such as
relaxation to be reused by multiple tests within the module.
Current examples are implemented with two alternative approaches:
Defining a script that iterates over models:
For consistency, we use the same model inputs as expected by mlipx
,
imported as MODELS
, which can then be used to obtain the model name
and associated calculator.
Using pytest
parametrisation,
the same calculation is run for each model name-model pair:
Note
Some imports are not included in the following example for simplicity.
from mlip_testing.calcs.models.models import MODELS
DATA_PATH = Path(__file__).parent / "data"
OUT_PATH = Path(__file__).parent / "outputs"
@pytest.mark.parametrize("mlip", MODELS.items())
def test_benchmark(mlip: tuple[str, GenericASECalculator]) -> None:
"""
Run calculations required for lithium diffusion along path B.
Parameters
----------
mlip
Name of model use and model to get calculator.
"""
model_name, model = mlip
struct = read(DATA_PATH / "struct.xyz")
struct.calc = model.get_calculator()
struct.get_potential_energy()
write_dir = OUT_PATH / self.model_name
write_dir.mkdir(parents=True, exist_ok=True)
write(write_dir / "struct.xyz", struct)
Defining a
ZnTrack
node to run viamlipx
:
The process of running these is largely as
described by mlipx,
including running dvc init
in mlip_testing/calcs/[benchmark_name]
.
Note
In general, this would also require running git init
,
but the repository should already be tracked by git.
In this example, we create the NewBenchmark
node,
which defines a run
function to perform the calculation using each model,
which mlipx
automatically sets via the zntrack.deps().
mlipx
also sets model_name
via zntrack.params()
, which
we use to differentiate the output files.
We also define test_new_benchmark
, which enables this benchmark to be automatically
run identified and run using pytest
.
Note
Some imports are not included in the following example for simplicity.
# Local directory to store input data
DATA_PATH = Path(__file__).parent / "data"
# Local directory to store output data
OUT_PATH = Path(__file__).parent / "outputs"
# New benchmark node
class NewBenchmark(zntrack.Node):
"""New benchmark."""
model: NodeWithCalculator = zntrack.deps()
model_name: str = zntrack.params()
def run(self):
"""Run new benchmark."""
# Read in data and attach calculator
calc = self.model.get_calculator()
struct = read(DATA_PATH / "struct.xyz")
struct.calc = calc
# Run calculation
struct.get_potential_energy()
write_dir = OUT_PATH / self.model_name
write_dir.mkdir(parents=True, exist_ok=True)
write(write_dir / "struct.xyz", struct)
def build_project(repro: bool = False) -> None:
"""
Build mlipx project.
Parameters
----------
repro
Whether to call dvc repro -f after building.
"""
project = mlipx.Project()
benchmark_node_dict = {}
for model_name, model in MODELS.items():
with project.group(model_name):
benchmark = NewBenchmark(
model=model,
model_name=model_name,
)
benchmark_node_dict[model_name] = benchmark
if repro:
with chdir(Path(__file__).parent):
project.repro(build=True, force=True)
else:
project.build()
def test_new_benchmark():
"""Run new benchmark via pytest."""
build_project(repro=True)
if __name__ == "__main__":
build_project()
Analysing Calculations¶
The output files created by Running Calculations must then be analysed to calculate the metrics as planned in Identifying metrics.
In principle, the exact form of this flexible, as long as the outputs can be assembled as required in Build Dash components to build the new application tab.
However, we strongly recommend following the template described below, which enables automated creation of tables and scatter plots, as well as placing structures to be visualised in an appropriate directory to be accessed by the app.
As with the script created in Running Calculations, we create a new file to be run by pytest
,
containing a function beginning with test_
to launch the analysis.
In this case, we name the file mlip_testing/analysis/[benchmark_name]/analyse_[benchmark_name].py
,
such that it can be run using:
pytest -v -s mlip_testing/analysis/[benchmark_name]/analyse_[benchmark_name].py
In order to automatically generate the components for our application, we will make use
of decorators, such as @build_table
and @plot_parity
, which use the value
returned by the function, in combination with any parameters set for the decorator.
This therefore requires the values returned by decorated functions to be of a
particular form.
For @build_table
, the value returned should be of the form:
{
"metric_1": {"model_1": value_1, "model_2": value_2, ...},
"metric_2": {"model_1": value_3, "model_2": value_4, ...},
...
}
This will generate a table with columns for each metric, as well as “MLIP”, “Score”,
and “Rank” columns. Tooltips for each column header can also be set by the decorator,
as well as the location to save the JSON file to be loaded when building the app,
which typically would be placed in mlip_testing/app/data/[benchmark_name]
.
Every benchmark should have at least one of these tables, which includes
the score for each metric, and allowing the table to calculate an overall score for the
benchmark, and so often this decorated function is called as a fixture by the test_
function.
Benchmarks may also include other tables, which can be built similarly, although currently the scores from these cannot be straightforwardly combined into an overall table.
For @plot_parity
, the value returned should be of the form:
{
"ref": ref_values_list,
"model_1": model_1_values_list,
"model_2": model_2_values_list,
...
}
This will generate a scatter plot of reference value against model value for each model,
as well as a dashed line representing y=x
. Additional options can be set to specify
the plot title, axes labels, and hover data.
Hover data will always include x and y values, but additional labels for each point are set
via a dictionary of label names and lists of labels (corresponding to the same data points as
ref_values_list
etc.):
{
"label_1": label_1_list,
"label_2": label_2_list,
...
}
Typically, functions like this that generate plots would also be fixtures that are passed to another function, which performs the aggregation needed to then pass the metric’s value to the function that generates the table for all metrics.
Further decorators will be added as required for common figures, including bar charts, and non-parity scatter plots.
While not essential, we can also make use of the @pytest.fixture
decorator,
which allows the value returned by a function to be used directly as a parameter
for other functions.
If your benchmark contains structures to be visualised, or images to be loaded, these
should be saved to mlip_testing/app/data/[benchmark_name]
, as they must be added as
assets
to be loaded into the app.
Note
Some imports are not included in the following example for simplicity.
from mlip_testing.analysis.utils.decorators import build_table, plot_parity
from mlip_testing.analysis.utils.utils import mae
from mlip_testing.calcs.models.models import MODELS
CALC_PATH = Path(__file__).parent.parent.parent / "calcs" / [benchmark_name] / "outputs"
OUT_PATH = Path(__file__).parent.parent.parent / "app" / "data" / [benchmark_name]
REF_VALUES = {"path_b": 0.27, "path_c": 2.5}
def labels() -> list:
"""
Get list of labels.
Returns
-------
list
List of all energy labels.
"""
structs = read(CALC_PATH / "structs.xyz", index=":")
return [struct.info["label"] for struct in structs]
@pytest.fixture
@plot_parity(
filename=OUT_PATH / "figure_energies.json",
title="Relative energies",
x_label="Predicted energy / eV",
y_label="Reference energy / eV",
hoverdata={
"Labels": labels(),
},
)
def energies() -> dict[str, list]:
"""
Get energies for all structures.
Returns
-------
dict[str, list]
Dictionary of all reference and predicted relative energies.
"""
results = {"ref": []} | {mlip: [] for mlip in MODELS}
ref_stored = False
for model_name in MODELS:
structs = read(CALC_PATH / model_name / "structs.xyz", index=":")
results[model_name] = [struct.get_potential_energy() for struct in structs]
if not ref_stored:
results["ref"] [struct.info["ref_energy"] for struct in structs]
# Write structures for app
structs_dir = OUT_PATH / model_name
structs_dir.mkdir(parents=True, exist_ok=True)
write(structs_dir / "structs.xyz", structs)
ref_stored = True
return results
@pytest.fixture
def metric_1(energies: dict[str, list]) -> dict[str, float]:
"""
Get metric 1.
Parameters
----------
energies
Reference and predicted energies for all structures.
Returns
-------
dict[str, float]
Dictionary of metric 1 values for each model.
"""
results = {}
for model_name in MODELS:
results[model_name] = mae(energies["ref"], energies[model_name])
return results
@pytest.fixture
def metric_2() -> dict[str, float]:
"""
Get metric 2.
Returns
-------
dict[str, float]
Dictionary of metric 2 values for each model.
"""
results = {}
for model_name in MODELS:
structs = read(CALC_PATH / model_name / "structs.xyz", index=":")
results[model_name] = mae(
pred_properties, [struct.info["property"] for struct in structs]
)
return results
@pytest.fixture
@build_table(
filename=OUT_PATH / "new_benchmark_metrics_table.json",
metric_tooltips={
"Model": "Name of the model",
"Metric 1": "Description for metric 1 (units)",
"Metric 2": "Description for metric 2 (units)",
},
)
def metrics(
metric_1: dict[str, float], metric_2: dict[str, float]
) -> dict[str, dict]:
"""
Get all new benchmark metrics.
Parameters
----------
metric_1
Metric 1 value for all models.
metric_2
Metric 2 value for all models.
Returns
-------
dict[str, dict]
Metric names and values for all models.
"""
return {
"Metric 1": metric_1,
"Metric 2": metric_2,
}
def test_new_benchmark(metrics: dict[str, dict]) -> None:
"""
Run new benchmark analysis.
Parameters
----------
metrics
All new benchmark metric names and dictionary of values for each model.
"""
return
Build Dash components¶
Any tables and figures to be added to the app should have been created and saved by running the test defined in Analysing Calculations.
The final step is to assemble these, by defining a layout
, and set up any required
interactivity, by defining callback
functions, for the Dash application.
Building those components and their interactivity should become increasingly automated, but less standard plots/interactions will need setting up.
For now, please contact us to help with this process.