-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data schema and HDF5 data support #322
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments for now. It would be nice to have some concrete documentation of how datasets should be created, but the tests cover a good amount of it. The docs can come as everything is migrated to this schema. Looks good overall!
a package or algorithm is used to compute this neighborhood | ||
function. | ||
|
||
The validation of this schema includes checking to ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That the what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished in a0726c1
matsciml/datasets/schema.py
Outdated
|
||
scf = "SCFCycle" | ||
opt_trajectory = "OptimizationCycle" | ||
property = "Property" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be something more specific, like CalculatedProperty
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 1fd2d88 - also added more categories so that it's specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 1fd2d88 - also added more categories so that it's specific
algorithm: Literal["pymatgen", "ase", "custom"] | ||
allow_mismatch: bool | ||
adaptive_cutoff: bool | ||
algo_version: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute could be confusing if you dont know that its referring to the package version. algo_package_version
would be more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought was to have it map to a hash or package version if and when custom
is supported, hence why it's a little ambiguous
a consistency check after model creation to validate that per-atom | ||
fields have the right number of atoms. | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a frac_coords
attribute also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a bunch of refactoring here, and wrangled it so that fractional coordinates are computed if lattice parameters/matrix is available.
Instance of an ``Atoms`` object constructed with | ||
the current data sample. | ||
""" | ||
return Atoms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pbc
is commonly used in creating the Atoms objects as well - should it be included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in ba79a34
matsciml/datasets/generic.py
Outdated
If a sample of data already exists at the specified index while | ||
overwrite is set to False. | ||
""" | ||
assert h5_file.mode != "r" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nitpick but i feel like an h5_file.mode == "w"
would read better, unless you also expect "a"
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to any writeable modes in 16c7bdc
Signed-off-by: Kin Long Kelvin Lee <[email protected]>
Signed-off-by: Kin Long Kelvin Lee <[email protected]>
Pydantic v2 doesn't support external JSON libraries
Signed-off-by: Kin Long Kelvin Lee <[email protected]>
Signed-off-by: Kin Long Kelvin Lee <[email protected]>
13aeadd
to
ac8593f
Compare
This PR introduces classes in preparation for a big refactor, emphasizing on reproducibility and generally "explicit is better than implicit". The eventual goal would be to replace currently defined LMDB datasets with the HDF5 + fully specified schema: the former being a better known and less problematic binary data format, and the latter meaning all datasets can share the same Python implementation, but be fully documented and validated at runtime.
pydantic
schema definition and validation quite heavily, up to and including array shape validation inDataSampleSchema
, and fully documented datasets withDatasetSchema
.DataDict
and any ambiguity in the pipelines of what key maps to what.__getitem__
of the new dataset class, the code is significantly simpler for the data loading logic.