Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding submisisons #133

Open
dotsdl opened this issue Aug 17, 2020 · 4 comments
Open

Expanding submisisons #133

dotsdl opened this issue Aug 17, 2020 · 4 comments

Comments

@dotsdl
Copy link
Member

dotsdl commented Aug 17, 2020

This is related to #123, and is inspired by @jthorton's proposal in #131.


There are two axes (currently) on which submissions are likely to be expanded:

  1. Molecules (and conformers)

  2. Compute (specs (bases, methods))

How do we want to handle these cases? By their nature, they happen after a submission. We need a protocol for each, and these protocols should not conflict or cause a cascade of other problems downstream.

Expanding Molecules

  1. Expanding molecules should be done with a new submission, with an incremented dataset version. A new, perhaps more descriptive name than that of the original submission is permissible.

  2. Because this includes fundamentally new objects on which compute is performed, the submission should be done in a new directory in qca-dataset-submission. This allows us to track the point-in-time more accessibly than the git history alone.

  3. The procedure for submission and lifecycle is identical to that for the original submission.

Expanding Compute

  1. Expanding compute should be done with compute*.json files added to the same directory as the original submission.

  2. No version bump should be made for expanding compute. Version bumps are reserved for changes in the dataset's objects on which compute is performed (e.g. molecules).

  3. The procedure for submission and lifecycle is identical to that for the original submission. Validation will be performed on any compute*.json files found in the PR. On merge, these will be submitted. Error cycling will be performed on these specs only. Their lifecycle will not be coupled to the dataset they correspond to. Error cycling will be performed on the compute artifacts corresponding to the specs represented in the compute*.json files included in the PR.

@dotsdl
Copy link
Member Author

dotsdl commented Aug 17, 2020

One thing I'm not sure I like about this is that handling compute in this way introduces an orthogonal axis on which our sense of dataset state can be impacted, but is not coupled to a particular version bump.

Said another way, we can have a dataset submitted with one spec that completes without issue, but then another spec expanding compute on the same dataset that fails in many cases. Is this version of dataset "complete"? Is it "end-of-life"? It's no longer really clear to me, which makes it harder to reason about.

@trevorgokey
Copy link
Collaborator

This is a tricky subject, but there are some constraints which may help resolve the proper solution. The first one that comes to mind is that we cannot modify a compute spec after it has been added. Therefore a new dataset will need to be created if compute specs are changed and the same name is wanted. This means that modifying a compute requires a dataset version bump. This a general QCA design constraint, and something we cannot change.

On our end of things, we should also ideally be stating that the dataset versions are an indicator towards meeting the promises defined by our standards. Version bumping should only be done if it a) fixes/modifies a bug in the molecules, or b) it is a required change to meet our standards.

For example (one may skip this paragraph if time is short), I would say adding a gaff spec is an optional spec because it is not outlined in the standards, so no version bump is needed. We may choose to track in the lifecycle, but we should consider it "complete" only if our standard specs are complete. If the user finds that the gaff spec has a bug and needs to be changed, they are free to make a new compute spec with a new name; no version bump needed. On the other hand, as we work out the discrepancy between constrained and unconstrained openff compute specs, we may indicate in our standards that only unconstrained offxmls are to be accepted, and therefore a spec with the name openff-1.0.0 should indicate the unconstrained FF. This means that a version bump would be needed if this standard was not met. This specific case will be triggered if/once we require all datasets to have an openff spec added by default. Right now I would argue that all MM specs are optional, but it definitely would behoove us to set some groundwork to make this required in the near future.

My takeaway here seems to be that we should be versioning as a means of quality assurance rather than versioning for the sake of raw data deltas. It is important to keep track of provenance and give freedom for OpenFF users to submit exploratory datasets/specs, but it is also nonproductive to try to maintain lists and track everything that goes into QCA since it requires a human in the loop to fix issues. We should draw the line somewhere, and the standards should be that line. There are pros/cons to both ways of versioning, so this is my 2 cents.

@jthorton
Copy link
Collaborator

I would agree that the once the initial dataset specifications are complete or maybe even only the default then the dataset can be considered complete and we should not expect all extra compute specifications to finish without error as the dataset was not constructed with them in mind. This is something we are going to see with the benchmark ligand dataset when we start to run the ANI2x scans as some molecules contain Br and I which is not included in the model and so the jobs will never complete.

@dotsdl
Copy link
Member Author

dotsdl commented Aug 19, 2020

Excellent, thank you both for laying this out. @trevorgokey, your approach is clear to me. We should take advantage of our STANDARDS as the minimum foundation for what constitutes a dataset's versioning LIFECYCLE. Components beyond that minimum foundation, such as exploratory compute stacked after the fact, fall outside of that supported lifecycle. We provide a path for enabling this exploratory work, but the path isn't tightly coupled to our core provenance needs.

@jthorton thank you for this; ANI is a good example for where this permissive approach to compute stacking will help us maintain operational sanity while giving researchers the freedom to quickly explore a new compute method.

I'll put together a PR with proposed additions to STANDARDS.md and LIFECYCLE.md making this stance clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants