Data schema and HDF5 data support #322

laserkelvin · 2024-11-25T22:07:42Z

This PR introduces classes in preparation for a big refactor, emphasizing on reproducibility and generally "explicit is better than implicit". The eventual goal would be to replace currently defined LMDB datasets with the HDF5 + fully specified schema: the former being a better known and less problematic binary data format, and the latter meaning all datasets can share the same Python implementation, but be fully documented and validated at runtime.

Leans on pydantic schema definition and validation quite heavily, up to and including array shape validation in DataSampleSchema, and fully documented datasets with DatasetSchema.
Schema provides consistent field names, which should phase out DataDict and any ambiguity in the pipelines of what key maps to what.
As seen in __getitem__ of the new dataset class, the code is significantly simpler for the data loading logic.

melo-gonzo

Just a few minor comments for now. It would be nice to have some concrete documentation of how datasets should be created, but the tests cover a good amount of it. The docs can come as everything is migrated to this schema. Looks good overall!

melo-gonzo · 2024-11-27T16:48:20Z

matsciml/datasets/schema.py

+    a package or algorithm is used to compute this neighborhood
+    function.
+
+    The validation of this schema includes checking to ensure


That the what?

Finished in a0726c1

melo-gonzo · 2024-11-27T16:49:56Z

matsciml/datasets/schema.py

+
+    scf = "SCFCycle"
+    opt_trajectory = "OptimizationCycle"
+    property = "Property"


Can this be something more specific, like CalculatedProperty?

Addressed in 1fd2d88 - also added more categories so that it's specific

melo-gonzo · 2024-11-27T16:53:32Z

matsciml/datasets/schema.py

+    algorithm: Literal["pymatgen", "ase", "custom"]
+    allow_mismatch: bool
+    adaptive_cutoff: bool
+    algo_version: str | None = None


This attribute could be confusing if you dont know that its referring to the package version. algo_package_version would be more explicit.

The thought was to have it map to a hash or package version if and when custom is supported, hence why it's a little ambiguous

melo-gonzo · 2024-11-27T16:56:45Z

matsciml/datasets/schema.py

+    a consistency check after model creation to validate that per-atom
+    fields have the right number of atoms.
+
+    Parameters


Does this need a frac_coords attribute also?

I've done a bunch of refactoring here, and wrangled it so that fractional coordinates are computed if lattice parameters/matrix is available.

melo-gonzo · 2024-11-27T16:58:27Z

matsciml/datasets/schema.py

+            Instance of an ``Atoms`` object constructed with
+            the current data sample.
+        """
+        return Atoms(


pbc is commonly used in creating the Atoms objects as well - should it be included?

Addressed in ba79a34

melo-gonzo · 2024-11-27T16:59:47Z

matsciml/datasets/generic.py

+        If a sample of data already exists at the specified index while
+        overwrite is set to False.
+    """
+    assert h5_file.mode != "r"


just a nitpick but i feel like an h5_file.mode == "w" would read better, unless you also expect "a" as well.

Changed it to any writeable modes in 16c7bdc

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

Pydantic v2 doesn't support external JSON libraries

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

laserkelvin requested a review from melo-gonzo November 25, 2024 22:07

melo-gonzo approved these changes Nov 27, 2024

View reviewed changes

laserkelvin added 23 commits December 12, 2024 08:57

feat: implemented schema and enum prototypes

9cf06e1

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

refactor: changing dataset schema to folder orientation

e3e5392

refactor: making data sample fully qualified

6fa311f

refactor: adding cell image attributes to sample schema

9dc48e0

feat: added periodic boundary condition schema

059ccf2

feat: implemented generic matsciml dataset

7db5cd0

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: implemented schema and validation workflows for multiple aspects

f256e0b

feat: added normalization schema

a66e03d

feat: added data normalization to dataset specification

cc095f7

feat: adding target normalization key matching

32e259d

feat: adding std positive validation

55a0e02

feat: adding node and edge statistics

7bdcd0a

refactor: removing orjson from de-serialization

f7f91f3

Pydantic v2 doesn't support external JSON libraries

feat: added normalizer object creation from schema

386983c

feat: added hashing property for HDF5 datasets

ad8a4b8

refactor: replacing parse_file deserialization with model_validate_json

7eb1b4a

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

fix: add missing super() inits

073522d

refactor: making split attributes a dictionary mapping instead of set

26ab01c

feat: added dataset checksum validation mechanism

473b88a

refactor: passing strict_checksum through datamodule hyperparameters

c81f2bb

feat: adding target key checking during dataloading

cf308bd

feat: added to_transform concretization method

f472489

refactor: mapping kwargs to periodic properties transform

9f74d6a

laserkelvin added 26 commits December 13, 2024 08:13

refactor: using ase for coordinate conversion

daafe96

refactor: revising frac coord bounds checl

174729f

refactor: adding orthogonal cell matrix validation step

cbeaca4

fix: ensuring enum types are being serialized as values

593c6f1

refactor: adding cached property for dataset keys

7a86153

refactor: changing index mapping to data index

bebad00

fix: adding whitespace to valid shape schema

84a8451

fix: setting write mode for to_json file

2c43f6b

refactor: adding random seed attribute to dataset schema

933f04f

fix: writing json directly for to_json serialization

946dd4d

refactor: replacing write hdf5 function with recursive one

ca616c4

refactor: adding overwrite activity back into function

08eb49c

feat: added recursive hdf5 read function

a1ddb55

refactor: using the new read hdf5 function in __getitem__

7657204

docs: added docstring for get item

ae01524

refactor: making stage in setup optional

94c062e

refactor: adding loader defaults

ee98a03

feat: added data loader definitions

9530650

refactor: removing unused hdf5 write method in data sample schema

d771da1

fix: checking for presence of keys in extras during sample recreation

f7a9b20

feat: implemented batching mechanisms for new data structure

1f799fa

refactor: adding data collater calls in dataloaders

a65b590

refactor: adding embeddings and outputs to batch schema

0e8c2ff

refactor: added optional transform store for data schema

3f12eac

feat: exporting new classes and schema to __all__

8fdf179

docs: added documentation on how to construct datasets

ac8593f

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

laserkelvin force-pushed the hdf5-datasets branch from 13aeadd to ac8593f Compare December 13, 2024 16:23

fix: addressing ci style issues

00d2f63

laserkelvin merged commit dc6a125 into IntelLabs:main Dec 13, 2024
2 of 3 checks passed

laserkelvin deleted the hdf5-datasets branch December 13, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data schema and HDF5 data support #322

Data schema and HDF5 data support #322

laserkelvin commented Nov 25, 2024

melo-gonzo left a comment

melo-gonzo Nov 27, 2024

laserkelvin Nov 27, 2024

melo-gonzo Nov 27, 2024

laserkelvin Nov 27, 2024

laserkelvin Nov 27, 2024

melo-gonzo Nov 27, 2024

laserkelvin Nov 27, 2024

melo-gonzo Nov 27, 2024

laserkelvin Nov 27, 2024

melo-gonzo Nov 27, 2024

laserkelvin Nov 27, 2024

melo-gonzo Nov 27, 2024

laserkelvin Nov 27, 2024

Data schema and HDF5 data support #322

Data schema and HDF5 data support #322

Conversation

laserkelvin commented Nov 25, 2024

melo-gonzo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment