Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema versioning and deployment proposal #787

Open
mferrera opened this issue Sep 19, 2024 · 0 comments
Open

Schema versioning and deployment proposal #787

mferrera opened this issue Sep 19, 2024 · 0 comments

Comments

@mferrera
Copy link
Collaborator

mferrera commented Sep 19, 2024

Currently every change we make to the schema incurs a high risk of service interruption because we do not have a fully automated, consistent, and reproducible deployment regime.

From the fmu-dataio perspective, deployment should ideally look like so:

  • fmu-dataio 3.0.0 represents and produces version 3.0.0 of the schema
  • fmu-dataio 3.1.2 represents and produces version 3.0.0 of the schema
  • A new optional field is added
    • fmu-dataio 3.2.0 represents and produces version 3.1.0 of the schema

This means that these versions should be decoupled.

  • It tracks the evolution of how data is produced
  • It is more easily auditable if some version of dataio begins producing data differently
  • It is consistent
  • It is reproducible
  • It provides backward compatibility by always allowing validation against an existing schema, even if that schema is not the latest

Schema versioning

The schema is already versioned by semantic versioning. This gives every schema version a specific number in the form X.Y.Z where X is the major version number, Y is the minor version number, and Z is the patch version number.

Schema version numbers change when a schema update is made. When deciding what version a changed schema should become the primary concern should be whether or not it is backward compatible. Backwards compatibility is broken if metadata generated for and valid against a previous version is invalid against the updated version.

Therefore schema version numbers should change like so:

Major

Any schema change that breaks backwards compatibility with metadata created using the previous version. These scenarios are candidates for a major version change:

  • Adding a required field
  • Removing a required or optional field
  • Moving an optional field to a required field
  • Changing the name or a field
  • Changing the type of a field (e.g. number to string)
  • Removing a value from a controlled vocabulary (e.g. 'OWC' is no longer a valid contact [unlikely, but an example!])
  • Adding a regular expression to a field

Minor

Any schema change that ensures backwards compatibility with metadata created using the previous version.

  • Adding an optional field
  • Making a required field optional
  • Changing a field from a controlled vocabulary to free text without changing the field type
  • Removing a regular expression from a field

Patch

Any change to auxiliary information that does not affect the structure or semantics of the schema itself. Also, any bug fixes to the schema.

  • Adding or updating the field description to improve readability
  • Adding or updating the field example, comment, or user-friendly name
  • Extending a controlled vocabulary enumeration
  • Fixing an incorrect regular expression

Initial impact

  • Sumo will need to reference the schema url from the metadata.

This should be the only initial impact. In practical terms, nothing else changes except that the schema version number will tick up according to the above versioning conditions. As long as we continue to make all changes backward compatibility, i.e. we continue to work toward a version 1.0.0 of the schema, from the consumer perspective nothing is changing except they have metadata on the metadata to tie the ongoing changes to.

Deployment

  • fmu-dataio 3.2.0 is released
  • This schema is deployed to radix as schemas/3.0.0/fmu_results.json, or schemas/fmu_results-3.0.0.json
  • This schema exists as a real file always committed to this repository (?)
    • We could start generating these for radix by checking out every version tag and writing it... but that is probably less ideal
  • All metadata produced with the schema is self-referential, i.e. points to schema which produced and can validate it
  • fmu-dataio is now staged for release to Komodo + RMS
    • Each Komodo version points to a distinct RMS version that contains the same fmu-dataio version (in progress!)
    • Metadata should be consistent and reproducible between the RMS and Komodo versions now, 1 to 1
  • When uploaded to Sumo, Sumo should validate metadata against the schema url referenced within the metadata
  • Consumers can also reference this as needed

Or, fmu-schemas

Another, possibly better solution is to host and add schema updates statically to their own repository as it could be cumbersome to continue to stack them here.

Open questions

  • How does this affect consumers and their expectations about what exists in metadata?
    • Suppose fmu-dataio 3.0.0 adds spec.num_rows
    • ConsumerA wants to display this property
    • Is this pattern fine?
    •  metadata = get_metadata()
       if version.parse(metadata.version) >= version.parse("3.0.0"):
           do_something_with(metadata.num_rows)
  • These sorts of version expectations are burdensome for consumers, but offer consistent and long-term guarantees. I.e. once version 3.0.0 is released, every version prior to it cannot possibly have spec.num_rows so logic built to handle this can persist long-term.
    • However, if we are inconsiderate with our changes this can lead to a miasma of spaghetti conditionals for consumers to handle. Therefore we would need a sensible strategy attaching metadata changes to a version
    • A sensible strategy is bundling them into major versions. This makes sense from semantic versioning perspective and also makes version checking simpler, i.e. it'd become cumbersome if version 3.1.0 added spec.num_columns and version 3.2.1 added spec.num_awesome_columns
  • Despite these hurdles, even if some extra conditionals are added, it gives consumers predictive power so that they can tie functionality to something concrete rather than trying to infer it or deal with optional patterns like
    •  metadata = get_metadata()
       if hasattr(metadata, "num_rows"):
           do_something_with(metadata.num_rows)
  • There are of course a number of possible issues not yet contained here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant