LMDB traversal cli #301

laserkelvin · 2024-10-03T17:28:40Z

This PR adds a big QoL oriented CLI, which provides some high level functionality for inspecting LMDB datasets.

Adds a matsciml.datasets.lmdb_cli module, which houses a click-based interface with multiple commands that perform various LMDB inspection tasks
Updates pyproject.toml to install lmdb_cli as a "script", which allows you to access the CLI after installing matsciml simply by running lmdb_cli in the command line.
Accompanying documentaton

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

melo-gonzo

Looks good! Not entirely sure why we would want to use a window size instead of just setting num_samples to something smaller. Made one comment for potential clean-up, but feel free to merge when ready.

melo-gonzo · 2024-10-03T17:57:35Z

matsciml/datasets/lmdb_cli.py

+    transforms = []
+    if periodic:
+        transforms.append(PeriodicPropertiesTransform(radius, adaptive_cutoff))
+    if graph_backend:
+        transforms.append(PointCloudToGraphTransform(graph_backend))
+    target_class = (
+        BaseLMDBDataset
+        if not dataset_type
+        else registry.get_dataset_class(dataset_type)
+    )


Could consolidate this common code block into its own function

Addressed with 16da9d8

laserkelvin · 2024-10-03T18:40:55Z

So window size is used by the running average, so as you're iterating through the dataset it will do (by default) a running average of properties based on 10 of the last samples. It's different from just capping the number of samples to go through, because you might want to sweep through the data and look for outliers.

laserkelvin added 9 commits October 3, 2024 08:39

feat: added initial lmdb cli module

2b52468

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: added interactive CLI mode

db37359

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: added lmdb_cli to installed scripts

7667d64

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: added single sample checking function

2357149

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

feat: added running average statistic cli

b90ffaa

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

docs: correcting docstring for running average

04d9125

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

refactor: adding graph property accumulators for pyg

bcef226

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

docs: added LMDB CLI description in datasets

0b9c5c7

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

docs: mentioning lmdb cli in best practices

3def614

Signed-off-by: Kin Long Kelvin Lee <[email protected]>

laserkelvin added documentation Improvements or additions to documentation ux User experience, quality of life changes data Issues related to data loading, pipelining, etc. labels Oct 3, 2024

laserkelvin requested a review from melo-gonzo October 3, 2024 17:28

melo-gonzo approved these changes Oct 3, 2024

View reviewed changes

laserkelvin added 2 commits October 3, 2024 11:44

refactor: moving common dataset creation into its own funcftion

577dd00

refactor: using _make_dataset instead now

16da9d8

laserkelvin merged commit 25969cc into IntelLabs:main Oct 3, 2024
2 of 3 checks passed

laserkelvin deleted the lmdb-traversal-cli branch October 3, 2024 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDB traversal cli #301

LMDB traversal cli #301

laserkelvin commented Oct 3, 2024

melo-gonzo left a comment

melo-gonzo Oct 3, 2024

laserkelvin Oct 3, 2024

laserkelvin commented Oct 3, 2024

LMDB traversal cli #301

LMDB traversal cli #301

Conversation

laserkelvin commented Oct 3, 2024

melo-gonzo left a comment

Choose a reason for hiding this comment

melo-gonzo Oct 3, 2024

Choose a reason for hiding this comment

laserkelvin Oct 3, 2024

Choose a reason for hiding this comment

laserkelvin commented Oct 3, 2024