-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanding submisisons #133
Comments
One thing I'm not sure I like about this is that handling compute in this way introduces an orthogonal axis on which our sense of dataset state can be impacted, but is not coupled to a particular version bump. Said another way, we can have a dataset submitted with one spec that completes without issue, but then another spec expanding compute on the same dataset that fails in many cases. Is this version of dataset "complete"? Is it "end-of-life"? It's no longer really clear to me, which makes it harder to reason about. |
This is a tricky subject, but there are some constraints which may help resolve the proper solution. The first one that comes to mind is that we cannot modify a compute spec after it has been added. Therefore a new dataset will need to be created if compute specs are changed and the same name is wanted. This means that modifying a compute requires a dataset version bump. This a general QCA design constraint, and something we cannot change. On our end of things, we should also ideally be stating that the dataset versions are an indicator towards meeting the promises defined by our standards. Version bumping should only be done if it a) fixes/modifies a bug in the molecules, or b) it is a required change to meet our standards. For example (one may skip this paragraph if time is short), I would say adding a My takeaway here seems to be that we should be versioning as a means of quality assurance rather than versioning for the sake of raw data deltas. It is important to keep track of provenance and give freedom for OpenFF users to submit exploratory datasets/specs, but it is also nonproductive to try to maintain lists and track everything that goes into QCA since it requires a human in the loop to fix issues. We should draw the line somewhere, and the standards should be that line. There are pros/cons to both ways of versioning, so this is my 2 cents. |
I would agree that the once the initial dataset specifications are complete or maybe even only the default then the dataset can be considered complete and we should not expect all extra compute specifications to finish without error as the dataset was not constructed with them in mind. This is something we are going to see with the benchmark ligand dataset when we start to run the ANI2x scans as some molecules contain Br and I which is not included in the model and so the jobs will never complete. |
Excellent, thank you both for laying this out. @trevorgokey, your approach is clear to me. We should take advantage of our STANDARDS as the minimum foundation for what constitutes a dataset's versioning LIFECYCLE. Components beyond that minimum foundation, such as exploratory compute stacked after the fact, fall outside of that supported lifecycle. We provide a path for enabling this exploratory work, but the path isn't tightly coupled to our core provenance needs. @jthorton thank you for this; ANI is a good example for where this permissive approach to compute stacking will help us maintain operational sanity while giving researchers the freedom to quickly explore a new compute method. I'll put together a PR with proposed additions to STANDARDS.md and LIFECYCLE.md making this stance clear. |
This is related to #123, and is inspired by @jthorton's proposal in #131.
There are two axes (currently) on which submissions are likely to be expanded:
Molecules (and conformers)
Compute (specs (bases, methods))
How do we want to handle these cases? By their nature, they happen after a submission. We need a protocol for each, and these protocols should not conflict or cause a cascade of other problems downstream.
Expanding Molecules
Expanding molecules should be done with a new submission, with an incremented dataset version. A new, perhaps more descriptive name than that of the original submission is permissible.
Because this includes fundamentally new objects on which compute is performed, the submission should be done in a new directory in
qca-dataset-submission
. This allows us to track the point-in-time more accessibly than the git history alone.The procedure for submission and lifecycle is identical to that for the original submission.
Expanding Compute
Expanding compute should be done with
compute*.json
files added to the same directory as the original submission.No version bump should be made for expanding compute. Version bumps are reserved for changes in the dataset's objects on which compute is performed (e.g. molecules).
The procedure for submission and lifecycle is identical to that for the original submission. Validation will be performed on any
compute*.json
files found in the PR. On merge, these will be submitted. Error cycling will be performed on these specs only. Their lifecycle will not be coupled to the dataset they correspond to. Error cycling will be performed on the compute artifacts corresponding to the specs represented in thecompute*.json
files included in the PR.The text was updated successfully, but these errors were encountered: