Expanding submisisons #133

dotsdl · 2020-08-17T16:47:38Z

This is related to #123, and is inspired by @jthorton's proposal in #131.

There are two axes (currently) on which submissions are likely to be expanded:

Molecules (and conformers)
Compute (specs (bases, methods))

How do we want to handle these cases? By their nature, they happen after a submission. We need a protocol for each, and these protocols should not conflict or cause a cascade of other problems downstream.

Expanding Molecules

Expanding molecules should be done with a new submission, with an incremented dataset version. A new, perhaps more descriptive name than that of the original submission is permissible.
Because this includes fundamentally new objects on which compute is performed, the submission should be done in a new directory in qca-dataset-submission. This allows us to track the point-in-time more accessibly than the git history alone.
The procedure for submission and lifecycle is identical to that for the original submission.

Expanding Compute

Expanding compute should be done with compute*.json files added to the same directory as the original submission.
No version bump should be made for expanding compute. Version bumps are reserved for changes in the dataset's objects on which compute is performed (e.g. molecules).
The procedure for submission and lifecycle is identical to that for the original submission. Validation will be performed on any compute*.json files found in the PR. On merge, these will be submitted. Error cycling will be performed on these specs only. Their lifecycle will not be coupled to the dataset they correspond to. Error cycling will be performed on the compute artifacts corresponding to the specs represented in the compute*.json files included in the PR.

The text was updated successfully, but these errors were encountered:

dotsdl · 2020-08-17T16:50:31Z

One thing I'm not sure I like about this is that handling compute in this way introduces an orthogonal axis on which our sense of dataset state can be impacted, but is not coupled to a particular version bump.

Said another way, we can have a dataset submitted with one spec that completes without issue, but then another spec expanding compute on the same dataset that fails in many cases. Is this version of dataset "complete"? Is it "end-of-life"? It's no longer really clear to me, which makes it harder to reason about.

trevorgokey · 2020-08-17T17:57:41Z

This is a tricky subject, but there are some constraints which may help resolve the proper solution. The first one that comes to mind is that we cannot modify a compute spec after it has been added. Therefore a new dataset will need to be created if compute specs are changed and the same name is wanted. This means that modifying a compute requires a dataset version bump. This a general QCA design constraint, and something we cannot change.

On our end of things, we should also ideally be stating that the dataset versions are an indicator towards meeting the promises defined by our standards. Version bumping should only be done if it a) fixes/modifies a bug in the molecules, or b) it is a required change to meet our standards.

For example (one may skip this paragraph if time is short), I would say adding a gaff spec is an optional spec because it is not outlined in the standards, so no version bump is needed. We may choose to track in the lifecycle, but we should consider it "complete" only if our standard specs are complete. If the user finds that the gaff spec has a bug and needs to be changed, they are free to make a new compute spec with a new name; no version bump needed. On the other hand, as we work out the discrepancy between constrained and unconstrained openff compute specs, we may indicate in our standards that only unconstrained offxmls are to be accepted, and therefore a spec with the name openff-1.0.0 should indicate the unconstrained FF. This means that a version bump would be needed if this standard was not met. This specific case will be triggered if/once we require all datasets to have an openff spec added by default. Right now I would argue that all MM specs are optional, but it definitely would behoove us to set some groundwork to make this required in the near future.

My takeaway here seems to be that we should be versioning as a means of quality assurance rather than versioning for the sake of raw data deltas. It is important to keep track of provenance and give freedom for OpenFF users to submit exploratory datasets/specs, but it is also nonproductive to try to maintain lists and track everything that goes into QCA since it requires a human in the loop to fix issues. We should draw the line somewhere, and the standards should be that line. There are pros/cons to both ways of versioning, so this is my 2 cents.

jthorton · 2020-08-18T09:34:36Z

I would agree that the once the initial dataset specifications are complete or maybe even only the default then the dataset can be considered complete and we should not expect all extra compute specifications to finish without error as the dataset was not constructed with them in mind. This is something we are going to see with the benchmark ligand dataset when we start to run the ANI2x scans as some molecules contain Br and I which is not included in the model and so the jobs will never complete.

dotsdl · 2020-08-19T16:15:10Z

Excellent, thank you both for laying this out. @trevorgokey, your approach is clear to me. We should take advantage of our STANDARDS as the minimum foundation for what constitutes a dataset's versioning LIFECYCLE. Components beyond that minimum foundation, such as exploratory compute stacked after the fact, fall outside of that supported lifecycle. We provide a path for enabling this exploratory work, but the path isn't tightly coupled to our core provenance needs.

@jthorton thank you for this; ANI is a good example for where this permissive approach to compute stacking will help us maintain operational sanity while giving researchers the freedom to quickly explore a new compute method.

I'll put together a PR with proposed additions to STANDARDS.md and LIFECYCLE.md making this stance clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanding submisisons #133

Expanding submisisons #133

dotsdl commented Aug 17, 2020 •

edited

Loading

dotsdl commented Aug 17, 2020

trevorgokey commented Aug 17, 2020

jthorton commented Aug 18, 2020

dotsdl commented Aug 19, 2020 •

edited

Loading

Expanding submisisons #133

Expanding submisisons #133

Comments

dotsdl commented Aug 17, 2020 • edited Loading

Expanding Molecules

Expanding Compute

dotsdl commented Aug 17, 2020

trevorgokey commented Aug 17, 2020

jthorton commented Aug 18, 2020

dotsdl commented Aug 19, 2020 • edited Loading

dotsdl commented Aug 17, 2020 •

edited

Loading

dotsdl commented Aug 19, 2020 •

edited

Loading