class light_pfp_data.utils.dataset.H5DatasetWriter(h5_file: Union[str, Path, File], mode: str = ‘copy’)#

Bases: object

A class to save the training structure and the corresponding potential energy,
forces, and stress into HDF5 files.

add_item(model_version: str, calc_mode: str, atoms: Atoms, potential_energy: float, forces: ndarray, stress: Optional[ndarray] = None, **kwargs) None#

Add one item into the dataset.

Parameters
  • model_version (str) – The PFP version to get the potential energy.

  • calc_mode (str) – The PFP calculation mode to to get the potential energy.

  • atoms (Atoms) – The input ASE atoms.

  • potential_energy (float) – The potential energy.

  • forces (np.ndarray) – The forces.

  • stress (np.ndarray, optional) – The virial stress. Defaults to None.

get_atoms(key: str, add_calc: bool = False) Atoms#

Get ASE atoms object from one item.

Parameters
  • key (str) – The key of the item.

  • add_calc (bool, optional) – Whether to attach the calculator with the information of
    potential energy, forces etc. Defaults to False.

Returns

The ASE atoms object.

Return type

Atoms

property n_items: int#
recalculate(model_version: str, calc_mode: str, show_progress_bar: bool = False, executor: Optional[ThreadPoolExecutor] = None, num_threads: int = 8, max_retries: int = 0) List[Future]#

Recalculate the potential energy, forces, and stress of all items in the dataset
with the given model version and calculation mode.

Parameters
  • model_version (str) – The PFP version.

  • calc_mode (str) – The PFP calculation mode.

  • show_progress_bar (bool, optional) – Show progress bar. Defaults to False.

  • executor (ThreadPoolExecutor, optional) – Thread pool executor parallel calculation. Defaults to None.

  • num_threads (int, optional) – Max number of threads to use for executor if no executor is passed. Defaults to 8.

  • max_retries (int, optional) – Max retries for PFP calculation. Defaults to 0.

update_item(key: str, **kwargs)#

Update one item in the dataset by overwriting the old values with a new ones.

Parameters

key (str) – The key of the item.

light_pfp_data.utils.dataset.check_quality(datasets: List[Union[H5DatasetWriter, File]], max_energy: float = 0.0, max_forces: float = 20.0, delete_invalid_keys: bool = False, print_info: bool = True) bool#

Checks the quality of a group of datasets.

Specifically, this function checks whether the calculation parameters
(model_version and calc_mode) are consistent both within each dataset and across datasets.
If there are multiple inconsistent values, training is unsuitable for this group of datasets.

This dataset also checks that all items in a dataset are below a specified maximum
energy and force threshold, which suggests that a data point is unsuitable for training.

Parameters
  • datasets (list[H5DatasetWriter or File]]) – The list of datasets to check.

  • max_energy (float, optional) – The threshold for acceptable maximum energy per atom (eV/atom).
    Defaults to 0.0.

  • max_force (float, optional) – The threshold for acceptable maximum force (eV/angstrom).
    Defaults to 20.0.

  • delete_invalid_keys (bool, optional) – delete invalid keys. Defaults to False.

Returns

Whether the quality of the given group of datasets is suitable for training.

Return type

bool

light_pfp_data.utils.dataset.dataset_dist_analysis(h5_file: File, filename: str) None#

Analyze a dataset and draw the distribution of energy and forces.

Parameters
  • h5_file (File) – The dataset in h5py.File format.

  • filename (str) – The path to save the output figure.