Identities of documents #453

ctron · 2024-06-25T10:17:05Z

ctron
Jun 25, 2024
Maintainer

Preface

The trigger for this was: #303 … Things have changed since then, but I think we still don't have a clear strategy for this.

I created two PRs (#451 and #452). Both add tests that re-ingest SBOMs. Both accept the status quo as "ok". This discussion is there to find out what we want. My proposal is to merge those PRs anyway, as creating them uncovered additional issues which get fixed by them.

While this discussion focuses on SBOMs, I think it might be valid for advisories too. But we should check.

The tests

They are all under integration_tests::sbom::reingest. I'll explain them in more detail in the next sub-sections. In general the idea is to upload two versions of "the same" SBOM and see what happens. They also all take actual data that is out there, not artificially altered (except for one case).

`quarkus`

There are two versions of the same SBOM. Released at different points in history. The structure of them changed massively. Neither the name, nor the document namespace is the same. Only the describing PURL is.

This results in two different SBOMs, which can be located using the same PURL.

I think that behavior is actually ok, as the document namespace and the name did change. They don't really have much in common.

The background of those files is, that at some point the tool generating the SBOMs was changed. Creating fundamentally different SBOMs.

`nhc`

This is a variation of the quarkus test. However, both SBOMs haven been created with the same (or similar) tool version. Creating the same structure. The name is the same, as is the document namespace. That latter is actually a violation of the spec. It's a known issue, that will not be fixed in the foreseeable future.

The result again is two different SBOMs, as the digest of the SBOMs is different.

`nhc_same`

This is a variation of the nhc test. Re-ingesting the same version of the SBOM twice.

This will result in a single SBOM, as the digest matches.

`nhc_same_content`

This is a variation of the nhc_same test, having the exact same content and structure, be re-serialized without "pretty print".

This results in two different SBOMs, as the digest is different.

`syft_rerun`

This uses the syft tool to generate an SBOM from the same container twice. Ingesting the two versions results again in two different SBOMs. The name of the SBOM is the same, however the document namespace is not. All according to the spec.

As the digest is different, we get two different SBOMs.

The inconsistencies

The outcome feels rather inconsistent to me. The spec (both SPDX and CycloneDX) say that the "document namespace" ("serialNumber" in CycloneDX) uniquely identify an SBOM. However, we uniquely identify an SBOM by its digest.

This leads to altering a single byte triggering a new SBOM instance. And this is not about "someone" altering a single byte, it could also come from the vendor, properly hashed and signed.

This also leads to us accepting multiple versions of the (claimed) same SBOM, with different content.

Proposal

First of all I think we should not design for spec non-compliance. If SBOMs are wrong (according to the spec), the we can reject them, or claim that things might explode. But we should not try to fix content.

Second, if the unique ID of a document (document namespace or serial number) says the document is the same, then we should accept that as a fact (like we accept all the other content of that file). This should either lead to a "duplicate ID" error, or replace the existing document. Maybe we even allow both and let some admin decide how to deal with this situation.

Third, I think we need a way to get the "most recent" SBOM by some non-unique identifier. That could be the SBOM name (plus version? need to find something), and we could simply offer an endpoint which returns a list of SBOMs matching that name. Maybe we already have that. And maybe we add a convenience endpoint which only returns one (or none) sorted by SBOM date.

JimFuller-RedHat · 2024-06-25T10:52:20Z

JimFuller-RedHat
Jun 25, 2024
Collaborator

Could you clarify what you mean by 'latest' ... if it is just datetime or is it 'latest released' ? This type of question affects the component level as well (though it is unclear if trustify is going down to that level) eg. latest or latest released ... if the later then 'latest released' is also bounded by product eg. latest released in product .

0 replies

ctron · 2024-06-25T11:14:12Z

ctron
Jun 25, 2024
Maintainer Author

My idea of "sorted by SBOM date" was to use the "date created" which SBOMs have. Independent of the upload time.

0 replies

JimFuller-RedHat · 2024-09-11T19:14:06Z

JimFuller-RedHat
Sep 11, 2024
Collaborator

what about the situation where an sbom contains 2 versions of the same component (common example = libcurl) ... both being used (with presumably same datetime).

1 reply

ctron Sep 12, 2024
Maintainer Author

Good point. A 1.0.1 release could be later (by time) than a 1.1.0 release. If in doubt, let the user decide?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identities of documents #453

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Identities of documents #453

ctron Jun 25, 2024 Maintainer

Preface

The tests

quarkus

nhc

nhc_same

nhc_same_content

syft_rerun

The inconsistencies

Proposal

Replies: 3 comments · 1 reply

JimFuller-RedHat Jun 25, 2024 Collaborator

ctron Jun 25, 2024 Maintainer Author

JimFuller-RedHat Sep 11, 2024 Collaborator

ctron Sep 12, 2024 Maintainer Author

ctron
Jun 25, 2024
Maintainer

`quarkus`

`nhc`

`nhc_same`

`nhc_same_content`

`syft_rerun`

Replies: 3 comments 1 reply

JimFuller-RedHat
Jun 25, 2024
Collaborator

ctron
Jun 25, 2024
Maintainer Author

JimFuller-RedHat
Sep 11, 2024
Collaborator

ctron Sep 12, 2024
Maintainer Author