Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(More) General spec comments #36

Open
jsgf opened this issue Jan 25, 2023 · 1 comment
Open

(More) General spec comments #36

jsgf opened this issue Jan 25, 2023 · 1 comment
Labels
s-needs-info Status: Further information is requested

Comments

@jsgf
Copy link
Contributor

jsgf commented Jan 25, 2023

Scope

manner of embedding identifiers for ADGs, and other related data structures in artifacts of various types
This is not covered here, other than a couple of mentions of its necessity. I think the specific details of this are out of scope for the main part of the spec. It should be covered in the annexes or (eventually) related specs.

guidance
guidance

I generally find it cleaner to have a spec which contains all the mandatory aspects, and a second related document describing how to use the spec in practice.

Definitions

In "Derived artifacts"

An artifact output from a build tool based on artifiacts input into the build tool. Such an artifact is said to be 'derived' from the input artifacts to the build tool.

The wording here strikes me as odd; it uses the noun "build tool" as a stand-in for the verb "build action" or "build process" which is performed by the build tool. We want to be very clear that the "inputs" here are consumed by the build tool as part of the action (verb), and are not inputs used to create the build tool (noun) itself. (Esp since in principle the build tool could itself be a derived artifact.)

In "Leaf artifacts", do we generally want to define these as any artifact which are created out of the scope of the omnibom-instrumented build process? Hand-written source files fall into this class. But what about things like prebuilt binaries, system libraries, etc? Source files/headers provided by the development environment (which could be machine generated)?

In "Artifact Identifiers" (definitions), should it clarify each identifier has a discrete type? So for "canonical" to work, the parties have to be generating identifiers of the same type. Are artifact identifiers required to be fixed size? Could you, as an extreme, use the literal artifact content as an identifier? It would meet the requirements.

In "Artifact Dependency Graph (ADG)" is "recursive" needed? What would a non-recursive DAG be?

It includes the direct input artifacts, and the recursive set of input artifacts to each input artifact, all the way down the graph.

I'm not sure what this sentence is saying. I wonder if it might be better to describe this inductively - each node in the graph represents an artifact and the reference to the inputs used to create it, and the whole graph is composed of such nodes to form the whole set of dependencies to produce some final artifact. A key property is that given a graph, you can extend it by adding new nodes, or conversely, you can remove nodes which are not the inputs to any other node (ie subgraphs represent the build dependencies of intermediate derived artifacts).

Artifact Dependency Graph (ADG) singularity

I wonder about the use of the word "singularity" here - I wonder if something like "convergence" might express this idea better.

An artifact should have precisely one Artifact Dependency Graph (ADG). All equivalent artifacts should have the same Artifact Dependency Graph (ADG).

Should we point out that this requires the artifacts to embed the omnibor document, otherwise there's no way to guarantee this? And as a consequence, all file formats involved in a build must have a way of embedded some kind of non-functional content within them.

"Build tools" - followup from comment above, the tool does nothing on its own, but implements a build action with a specific set of inputs to create a specific output. Is it required to be deterministic? Ie, the same tool with the same inputs will always generate the same artifact. Seems like it would be idea, but tools have a distressing habit of including things like timestamps and commit ids into their outputs.

Specifications

"Artifact ID" - is this the same as "Artifact Identifier"? If so, we should avoid the use of abbreviations. Also should we have the same entity both "defined" and "specified"? I would assume that "definitions" are just to give defined meanings to terms which are later used in the specification.

This doesn't ever specify how identifiers are represented in general. Do they include their type? Are they binary? Ascii? Hex encoded? uuencode?

In "Artifact Identifier Types" it talks about git oids, and has a list of oid prefixes which represent types. Does this mean that all "artifact identifier types" are gid oids? Could there be non-gid-oid identifier types? If there are mandatory identifier types, does mean they're the only ones allowed or it can be extended? Ie, is that an open or closed list?

This approach is anticipated to extend gracefully as git adopts new hash types in the future.

This seems very much like an open question.

OmniBOR Identifier

Perhaps "OmniBOR Document Identifier" would be clearer, since all identifiers within the OmniBOR spec could be interpreted as "OmniBOR Identifiers".

A OmniBOR Document describes the immediate children of an artifact in the Artifact Dependency Graph (ADG).

What does "children" mean in the context of a graph. Since the graph only represents inputs, it implies that children are the inputs, but typically children are the outputs of a process. I think something like "immediate inputs of the build process" would be clearer.

A OmniBOR Document utilizes precisely one Artifact Identifier Type.

Perhaps "All identifiers within a single omnibor document are of the same type. The document itself may only be referenced with identifiers of that same type". Ie, it's not valid to reference a sha1 document with a sha256 id (which I think is stronger than the constraint that "omnibor document identifier" covers).

OmniBOR Document Child Records

As above, should use "input" not "child".

blob⎵${artifact id of child}⎵bom⎵${OmniBOR identifier of child's OmniBOR Document}\n

Is this technically redundant, since the child artifact already has its omnibor document embedded within it? Is the bom field just there to save tooling from working out how to extract the document. Should this be bor rather than bom?

OmniBOR Document Character Encoding

I think this should come first, before we start talking about any specific layouts. Also mention that all whitespace is rigid (ie, literal space not tab, no runs of whitespace).

OmniBOR Identifier Embedding

Each build tool should embed into the output artifact a new line terminated, lexically ordered, list of OmniBOR identifiers for each mandatory Artifact Identifier Type in a manner:

Does this mean that just the document identifier is embedded, or the whole document? What is the "new line terminated, lexically ordered list" referring to? If its multiple document ids, then is that constraining whatever embedding mechanism to allow an arbitrarily sized list of document identifiers to be embedded contiguously? Doesn't this, for example, make the Annex B ELF note embedding in conflict with this?

What is the identifier representation being ordered? Does it include the identifier type, or is it just in the order of the raw hash? Are identifiers hex-encoded in ascii or raw binary?

Is embedding the entire document also OK?

Is it required that the embedded info also be extractable, or is it enough to embed just enough to meet the uniqueness requirements of "Artifact Dependency Graph (ADG) singularity". For example, could it be any other sufficiently strong hash of the bomdocs?

OmniBOR Document Construction by a Build Tool
Examine the input for an embedded OmniBOR Document Identifier - ${omnibor document identifier}

Implies that the build tools need to know how to extract identifiers from any input document.

@alilleybrinker
Copy link
Member

Hi Jeremy, if you're amenable, I'd like to identify which items in this issue remain open questions, and split them off into more focused discussions or issues accordingly. Based on my reading, here's the enumeration of everything raised above:

  • Splitting the specification into a core specification which describes the mandatory information, and annexes which describe its practical application / integration into other systems.
    • I believe we've done this! This is how the specification is now structured.
  • Clarification of the use of terms like "build tool," "build action," "build process," "inputs," and more.
    • I don't believe we've resolved this! It's probably the next big language discussion since we resolved the "artifact identifier" stuff. Should be made into its own issue!
  • Clarification of the definition of "leaf artifacts," and in particular whether they might include things like prebuilt binaries, system libraries, and more.
    • I don't believe we've resolved this. It should get its own issue!
  • Tightening the definition of "Artifact Identifier" to make clear how GitOID types should be selected, and what the length constraints and content constraints on artifact identifiers are.
    • Needs its own issue.
  • Clarification of the use of the term "recursive" when describing artifact dependency graphs
    • I believe we've done this! If you think there's more to do here, let me know!
  • Objection to the term "singularity" in the spec
    • We've removed the term!
  • Decision on whether we require build tools to be deterministic for a specific set of inputs
    • I don't think this is resolved, and obviously it ties into the convergence discussion you kickstarted
  • Clarifying that "Artifact ID" is a shorthand for "Artifact Identifier," or eliminating the shorthand altogether
    • Yes, we should resolve this. Needs its own issue.
  • Clarifying how identifiers are represented in general
    • Do you mean in text files or in ELF files? I think this would need to be specified on a per-embedding-context basis. At any rate, we should audit our spec to make sure we're clear about this, and probably update the main spec to explain that it's embedding-specific. Needs it own issue!
  • Describing more clearly how we'll handle Git introducing any new hash types in the future
    • Yes, needs its own issue
  • Fixing use of "OmniBOR Identifier"
    • I believe this is resolved by the recent terminology overhaul
  • Fixing the use of the word "children"
    • The spec no longer refers to children!
  • Clarifying that an Artifact Input Manifest must only have a single artifact identifier type within it (per the parallel trees discussion)
    • I believe the current spec is clear on this, but if you disagree let's open an issue!
  • Further refining how the artifact input manifests for your own inputs are recorded in your manifest
    • Yeah, I don't think we've resolved this. Needs it own issue!
  • Clarifying the construction of artifact input manifests, including what exactly is written in them, and how those elements are sorted. Also clarifying what is specific to an embedding (ELF, text file), and what is true generally
    • Needs its own issue!
  • Clarifying if/how build tools should consume artifact identifiers to produce artifact input manifests for the materials they themselves produce
    • Needs its own issue!

I think that's a thorough accounting of the issues presented above. With your agreement, I can go and open fresh distinct issues for each of the unchecked items in my list. If you have any others which ought to be made, we can open those too.

@alilleybrinker alilleybrinker added the s-needs-info Status: Further information is requested label Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s-needs-info Status: Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants