Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cataloguing of individual experiments #274

Open
charles-turner-1 opened this issue Nov 27, 2024 · 16 comments · May be fixed by #317
Open

Cataloguing of individual experiments #274

charles-turner-1 opened this issue Nov 27, 2024 · 16 comments · May be fixed by #317
Assignees
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Follows from Building intake-esm datastores in the Payu repository.

To minimise friction & get users used to the intake catalogue system, it would be good to provide a utility to generate (or open) a catalogue for an experiment just by pointing to the path that contains the outputs.

Describe the feature you'd like

Functionality to:

  1. Point to the already known directory where an experiment output is saved & automatically build an ESM-Datastore for this experiment, if no catalogue is found there.
  2. Open the catalog found there, if there already exists a catalog in the directory.
  3. If outputs are moved off scratch, the catalog will break (the catalog.json file contains an internal reference to it's location & moving a catalog therefore breaks it). In such cases, it would be nice to rebuild the catalog.
  4. Hooks to link this functionality to. I think this would in practice just look like an entrypoint that allows us to call this functionality from a bash script.

This might looks something like the following:

# Generate a datastore for an experiment which hasn't been catalogued yet
$ generate-esm-datastore --builder AccessOm2Builder --experiment-dir $DIR 

Generating esm-datastore...
Datastore successfully written to $DIR/catalog.json!

# Try to generate a datastore for the experiment which we just catalogued
$ generate-esm-datastore --builder AccessOm2Builder --experiment-dir $DIR 

esm-datastore found in $DIR, verifying datastore integrity...
Datastore integrity verified, aborting build

# Move the experiment datastore & catalogue (ie. off scratch)
$ cp -r $DIR $NEWDIR && cd $NEWDIR
$ generate-esm-datastore --builder AccessOm2Builder --experiment-dir $NEWDIR 

esm-datastore found in $NEWDIR, verifying datastore integrity...
Datastore broken due to path inconsistency, regenerating datastore...
Datastore successfully written to $NEWDIR/catalog.json!

From within python, we could then have a convenience function that looks something like:

>>> from access_nri_intake import use_esm_datastore
>>> my_datastore_path = "/scratch/abc/xyz/etc/experiment_dir/"
# Generate a datastore for an experiment which hasn't been catalogued yet
>>> use_esm_datastore(experiment_dir = my_datastore_path)
Generating esm-datastore...
No builder supplied - please supply one of `AccessOm2Builder`,...

>>> esm_ds = use_esm_datastore(experiment_dir = my_datastore_path, builder = AccessOm2Builder)
Generating esm-datastore...
Datastore successfully written to /scratch/abc/xyz/etc/experiment_dir/catalog.json!

>>> esm_ds
$EXPERIMENT_NAME datastore with $X dataset(s) from $Y asset(s):

# Run it again on the same dir:
>>> esm_ds = use_esm_datastore(experiment_dir = my_datastore_path, builder = AccessOm2Builder)
esm-datastore found in /scratch/abc/xyz/etc/experiment_dir/, verifying datastore integrity...
Datastore integrity verified, aborting build

# Move it, run again
!cp -r $DIR $NEWDIR
>>> my_new_datastore_path = "/home/abc/xyz/etc/experiment_dir/"
>>> esm_ds = use_esm_datastore(experiment_dir = my_new_datastore_path, builder = AccessOm2Builder)
esm-datastore found in /home/abc/xyz/etc/experiment_dir/, verifying datastore integrity...
Datastore broken due to path inconsistency, regenerating datastore...
Datastore successfully written to /home/abc/xyz/etc/experiment_dir/catalog.json!

>>> esm_ds
$EXPERIMENT_NAME datastore with $X dataset(s) from $Y asset(s):

@chrisb13 @anton-seaice are you able to confirm this is the sort of functionality we're after?

@marc-white
Copy link
Collaborator

I like this idea, with a few things related to the wider access-nri-intake-catalog ecosystem:

  • We need to be super-explicit to users that running generate-esm-datastore won't add the experiment to access-nri-intake-catalog
  • On the flip side, it would be cool if part of the output of generate-esm-datastore is the instructions for getting the experiment added to access-nri-intake-catalog, including auto-generating the lines that would need to be added to the config YAML
  • Users may also appreciate it if the output of generate-esm-datastore gives them the basic Python command they need to open the datastore

@anton-seaice
Copy link
Collaborator

Thanks @charles-turner-1

I think the builder can be determined automatically from the model field in metadata.yaml

The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date.

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Dec 1, 2024

Cheers for the feedback @anton-seaice. Couple of things just to clarify:

I think the builder can be determined automatically from the model field in metadata.yaml

This would only be in the case of regenerating a datastore right - I'm assuming the metadata.yaml in this instance is the one that gets created as part of the catalog? I think it should be straightforward to implement regeneration without specifying a builder.

The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date.

I'm not quite sure I understand the logic here - if we regenerate the datastore every time the run is extended, then surely the datastore will stay up to date? Or is extending a model run different from the initial run in a way that makes this nontrivial?

@anton-seaice
Copy link
Collaborator

This would only be in the case of regenerating a datastore right - I'm assuming the metadata.yaml in this instance is the one that gets created as part of the catalog?

It's made by payu as long as the option to make it is on.

https://payu.readthedocs.io/en/stable/usage.html#metadata-files

The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date.

I'm not quite sure I understand the logic here - if we regenerate the datastore every time the run is extended, then surely the datastore will stay up to date? Or is extending a model run different from the initial run in a way that makes this nontrivial?

I guess that may not be totally robust though - folks will use old payu versions or configurations which don't update the datastore ?

i.e. in this case:

# Run it again on the same dir:
>>> esm_ds = use_esm_datastore(experiment_dir = my_datastore_path, builder = AccessOm2Builder)
esm-datastore found in /scratch/abc/xyz/etc/experiment_dir/, verifying datastore integrity...
Datastore integrity verified, aborting build

are their feasible ways to confirm the integrity without re-making the whole datastore ?

@charles-turner-1
Copy link
Collaborator Author

Ahh, gotcha.

Since we're passing the outputs path to this utility function, I think it should be possible to run a subset of the datastore building pipeline in order to work out the expected time bounds. We should be able to work out if the catalog is only indexing a subset of the model outputs from this.

I suspect that this might be a bit slow to run if we naively index the whole thing but we can probably make it fairly efficient if we put in some relevant information about how to outputs are structured.

Do you think that would address the issue?

@aidanheerdegen
Copy link
Member

Since we're passing the outputs path to this utility function, I think it should be possible to run a subset of the datastore building pipeline in order to work out the expected time bounds. We should be able to work out if the catalog is only indexing a subset of the model outputs from this.

Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated? If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

@charles-turner-1
Copy link
Collaborator Author

Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated?

Yeah, this is what I had in mind - I think it should be possible & relatively straightforward to implement .

If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

This strikes me as a better solution - it will certainly be faster for the user. I don't think there is any potential for false negatives here - the catalog scans & indexes files on disk, so it can only index files created prior to the catalog.

Unless there are complicated things going on behind the scenes with Payu, I think this is the way to go?

@chrisb13
Copy link

chrisb13 commented Dec 2, 2024

@charles-turner-1, thanks for following this up. Looks good, I like that it has both a cli and python module interface. If it's not too hard, @marc-white's additions sound great too.

Based on our chats I think we are all thinking this but it's not quite clear from the early post, that we want the ability to get the achieved path from payu if the user decides to not keep their data on scratch. I think there's still some clarity needed from our payu experts on how we choose to run this for some models and not others?

If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

This was my naive thinking of a solution but that's due to my naivety of how these catalogs are built!

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Dec 2, 2024

Yup, I think Marc's suggestions are great & would be pretty straightforward to implement.

Re. paths, I'm not sure I understand Payu well enough to say anything specific, but I think we should be able to make this feature path-agnostic internally, and just let the user/Payu pass a path to it. That way, Payu can build a default catalog (if desired?), and then if the user moves the data elsewhere, they can rebuild it straightforwardly using essentially the same tools, syntax, config, etc..

@chrisb13
Copy link

chrisb13 commented Dec 2, 2024

Sounds great.

Also, this morning we were thinking it would be great if you could add the 'template for ACCESS-OM3 evaluation metrics' to this repo; discussed here. Actually, @anton-seaice , it's already been done!

Any feedback @anton-seaice? I think once the above is done, it would be good to update it and the cosima recipes.

@anton-seaice
Copy link
Collaborator

Also, this morning we were thinking it would be great if you could add the 'template for ACCESS-OM3 evaluation metrics' to this repo; discussed here. Actually, @anton-seaice , it's already been done!

Any feedback @anton-seaice?

replied in ACCESS-NRI/access-eval-recipes#5

@anton-seaice
Copy link
Collaborator

Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated?

Yeah, this is what I had in mind - I think it should be possible & relatively straightforward to implement .

Sometimes people would re-run the model to add extra diagnostics. How slow would doing a checksum on every file be to catch this case ?

If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

This strikes me as a better solution - it will certainly be faster for the user. I don't think there is any potential for false negatives here - the catalog scans & indexes files on disk, so it can only index files created prior to the catalog.

Unless there are complicated things going on behind the scenes with Payu, I think this is the way to go?

The question I have about this one, is that sometimes people will touch every file in a folder on scratch to prevent it getting it deleted. Which might lead to a rebuild when its not needed? This is probably fine really ...

@aidanheerdegen
Copy link
Member

Sometimes people would re-run the model to add extra diagnostics. How slow would doing a checksum on every file be to catch this case ?

Not awful, but even faster if something like binhash was used (the change detection hash payu uses in the manifest files).

However I don't think we want to encourage "re-running" a model. To my mind this would just be another experiment with its own catalogue. Re-running and overwriting an existing experiment seems kinda fraught and definitely isn't a supported mode of operation.

@aidanheerdegen
Copy link
Member

The question I have about this one, is that sometimes people will touch every file in a folder on scratch to prevent it getting it deleted. Which might lead to a rebuild when its not needed? This is probably fine really ...

Apart from this being against NCI guidelines, we really want to encourage users to utilise the excellent sync capabilities @jo-basevi has written for payu to automatically synchronise outputs to /g/data or similar. It is pretty much the only safe way of operating on gadi with the current /scratch purge policy.

@chrisb13
Copy link

chrisb13 commented Dec 3, 2024

Side note, once this has been rolled out, would be worth mentioning on the hive docs: https://access-hive.org.au/models/run-a-model/run-access-om/#access-om2-outputs

@anton-seaice
Copy link
Collaborator

To summarise my thoughts - just using file paths to determine intake-esm datastore correctness is probably ok. Doing something more robust would be better, and if binhash is fast, then this would protect against the source data files changing (even if thats not the intended / trained usecase).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

5 participants