Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data schema and HDF5 data support #322

Merged
merged 115 commits into from
Dec 13, 2024
Merged

Conversation

laserkelvin
Copy link
Collaborator

This PR introduces classes in preparation for a big refactor, emphasizing on reproducibility and generally "explicit is better than implicit". The eventual goal would be to replace currently defined LMDB datasets with the HDF5 + fully specified schema: the former being a better known and less problematic binary data format, and the latter meaning all datasets can share the same Python implementation, but be fully documented and validated at runtime.

  • Leans on pydantic schema definition and validation quite heavily, up to and including array shape validation in DataSampleSchema, and fully documented datasets with DatasetSchema.
  • Schema provides consistent field names, which should phase out DataDict and any ambiguity in the pipelines of what key maps to what.
  • As seen in __getitem__ of the new dataset class, the code is significantly simpler for the data loading logic.

@laserkelvin laserkelvin added documentation Improvements or additions to documentation enhancement New feature or request data Issues related to data loading, pipelining, etc. dependencies Pull requests that update a dependency file code maintenance Issue/PR for refactors, code clean up, etc. labels Nov 25, 2024
Copy link
Collaborator

@melo-gonzo melo-gonzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments for now. It would be nice to have some concrete documentation of how datasets should be created, but the tests cover a good amount of it. The docs can come as everything is migrated to this schema. Looks good overall!

a package or algorithm is used to compute this neighborhood
function.

The validation of this schema includes checking to ensure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished in a0726c1


scf = "SCFCycle"
opt_trajectory = "OptimizationCycle"
property = "Property"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be something more specific, like CalculatedProperty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1fd2d88 - also added more categories so that it's specific

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1fd2d88 - also added more categories so that it's specific

algorithm: Literal["pymatgen", "ase", "custom"]
allow_mismatch: bool
adaptive_cutoff: bool
algo_version: str | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute could be confusing if you dont know that its referring to the package version. algo_package_version would be more explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought was to have it map to a hash or package version if and when custom is supported, hence why it's a little ambiguous

a consistency check after model creation to validate that per-atom
fields have the right number of atoms.

Parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a frac_coords attribute also?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a bunch of refactoring here, and wrangled it so that fractional coordinates are computed if lattice parameters/matrix is available.

Instance of an ``Atoms`` object constructed with
the current data sample.
"""
return Atoms(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pbc is commonly used in creating the Atoms objects as well - should it be included?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ba79a34

If a sample of data already exists at the specified index while
overwrite is set to False.
"""
assert h5_file.mode != "r"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nitpick but i feel like an h5_file.mode == "w" would read better, unless you also expect "a" as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to any writeable modes in 16c7bdc

Pydantic v2 doesn't support external JSON libraries
@laserkelvin laserkelvin merged commit dc6a125 into IntelLabs:main Dec 13, 2024
2 of 3 checks passed
@laserkelvin laserkelvin deleted the hdf5-datasets branch December 13, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code maintenance Issue/PR for refactors, code clean up, etc. data Issues related to data loading, pipelining, etc. dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants