Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic fetcher: Official support with ADR #728

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kosciCZ
Copy link
Contributor

@kosciCZ kosciCZ commented Nov 6, 2024

This change includes the generic fetcher ADR which is a necessary step for the feature to be officially supported. Split from #718

Maintainers will complete the following section

  • Commit messages are descriptive enough
  • Code coverage from testing does not decrease and new code is covered
  • Docs updated (if applicable)
  • Docs links in the code are still valid (if docs were updated)

Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:

docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
Copy link
Member

@eskultety eskultety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosciCZ minor suggestions I'd say but content wise not much really, I see the important stuff I would expect from an ADR mentioned there.

Copy link
Collaborator

@a-ovchinnikov a-ovchinnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few trivial modifications to ADR to reflect the most recent changes, also some wording nitpicks, LGTM otherwise.

docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/adr/0001-add-generic-fetcher.md Outdated Show resolved Hide resolved
docs/usage.md Show resolved Hide resolved
@@ -32,24 +32,29 @@ def fetch_generic_source(request: Request) -> RequestOutput:
components = []
for package in request.generic_packages:
path = request.source_dir.join_within_root(package.path)
components.extend(_resolve_generic_lockfile(path, request.output_dir))
lockfile = package.lockfile or path.join_within_root(DEFAULT_LOCKFILE_NAME).path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if requiring users to provide absolute paths for the lockfile via the CLI is the best approach here. It might be tricky in the context of a CI system where the user does not know exactly what the path to the lockfile will be. We could potentially allow for relative paths if we resolved them within request.source_dir.

Did we verify if this approach would solve any current use case? I'm getting the impression that fetching the lockfile from a remote location could be needed in some cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern is security implications of allowing users to specify a filepath that resolves outside of the request's root dir, as it could be used to exploit any larger CI system that Cachi2 is part of. For the current use cases, Cachi2 always verifies that all the files it is reading are within the request's root dir, and that's the reason we have the RootedPath class.

This is very important in context of using package managers built-in commands that are outside of our control. I reckon that in this case, we'd be safer since we have control over what we're doing with the file. Still, I think some extra consideration is needed since this sets a new precedent for the project.

Ideally, from a security perspective, we'd allow only relative paths based on request.source_dir. But on the other hand, we're trying to solve the use case of upstream projects that would not commit an artifacts.lock.yaml file to it, so this would not be sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggested before, we could postpone this to another moment in case we're not certain of the decision now. I don't think we should block this ADR because of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI We as reviewers are contradicting each other in our suggestions: #728 (comment) which is what @kosciCZ followed so we need to get things straight rather quickly otherwise it's an awfully frustrating experience for the contributor.

Another concern is security implications of allowing users to specify a filepath that resolves outside of the request's root dir, as it could be used to exploit any larger CI system that Cachi2 is part of. For the current use cases, Cachi2 always verifies that all the files it is reading are within the request's root dir, and that's the reason we have the RootedPath class.

We do and that makes total sense for most scenarios, manifest files IMO not being one of them, but in general we should keep honoring this I agree.

This is very important in context of using package managers built-in commands that are outside of our control. I reckon that in this case, we'd be safer since we have control over what we're doing with the file. Still, I think some extra consideration is needed since this sets a new precedent for the project.

The manifest file is a declarative yaml file, so in order for this to be exploitable the absolute path would have to be a symlink to a valid YAML file containing some hidden parser instructions to exploit the YAML parser itself which would then need to gain full control over our own process; other than that I don't see how we are safer with a potentially malicious checked-in manifest file, IOW this manifest file only contains data pointing to online resources which we are not able to evaluate on our own anyway, so how do you see this manifest file exploiting the containing CI systems cachi2 is part of? I think any pipeline subsystem/service should be hardened on its own for this very reason that any component can essentially be exploited and so nothing can be implicitly trusted and attack vectors can only be mitigated, not avoided.

Ideally, from a security perspective, we'd allow only relative paths based on request.source_dir. But on the other hand, we're trying to solve the use case of upstream projects that would not commit an artifacts.lock.yaml file to it, so this would not be sufficient.

Relative paths solve nothing in general because the same file a user would provide via an absolute path they'd now purposfully copy over to the repository to pass our check, so from security perspective we didn't achieve anything, the file is either checked-in to the repo in which case we're putting our confidence into the community collective responsibility for the project or it isn't checked into the repository in which case we have absolutely no guarantee on anything regardless of where the file came from and since users can customize their pipeline automation (e.g. tekton) you can bet they'll use whatever power they have to workaround our trivial path checks (also demonstrated by an example posted by @kosciCZ here: https://github.com/red-hat-data-services/trustyai-explainability/pull/44/files)

As I suggested before, we could postpone this to another moment in case we're not certain of the decision now. I don't think we should block this ADR because of this.

I disagree , I outlined why in: #728 (comment) . If we make this a follow up work, then:

  1. it'll likely never happen and
  2. we'd be forcing our user base to check the manifest file to their repos and once they do so, they won't be willing to go back easily which is bad UX IMO solely because we're forcing people to commit random manifest files to their repos which don't really relate to their project in any way it's just that someone told the project to be built in a particular manner which requires them to use a non-standard configuration file which has no widely-accepted ecosystem around it - perfectly fine for downstream solutions (do whatever you want), not okay with upstream projects IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential exploit would be this: {"type": "generic", "lockfile": "/root/.docker/config.json"} - an attempt to make cachi2 raise a validation error which would print the content of a file that the CI task wasn't intended to access

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is more that until now, cachi2 promised not to read/write files outside the --source, --output and --config-file locations. This change breaks that promise, but if the error messages are safe it's not a huge deal

Do we have a promise like that anywhere? Also, that promise isn't broken with a default CLI, cachi2 would only ever look beyond if instructed with additional JSON input arguments on the CLI, so I don't think we're breaking any promises.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: the error message could be a tiny bit better (avoid exposing the keys), but no further concerns from my side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had discussion with Adam, so I see there 2 POW:

  1. INPUT is just considered to be regular CLI params, just in json form, then wrapper around CLI (user, OSBS) must ensure, that provide paths are right (any paths from nested files must not get out of CWD)

  2. INPUT is just inline config file and wrapper cannot do much about it, then cachito should ensure that it's not reading outside of CWD, as wrapper may not have the power to ensure paths are safe.

Given the complexity of JSON file, I'm probably more eager to agree with 2

*CWD or dir explicitly set by separate option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idealistically my view is option 2, but practically making an exception for the generic fetcher is fine IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go with 1) ; I think users may want to provide this file from a separate task, not via source

Copy link
Contributor

@brunoapimentel brunoapimentel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, the only thing I wanted is more feedback on this concern here.

My last comment is that I think we should use this PR to move generic out of dev-package-managers, effectively making it stable.

@brunoapimentel
Copy link
Contributor

/ok-to-test

@eskultety
Copy link
Member

This LGTM, the only thing I wanted is more feedback on this concern here.

My last comment is that I think we should use this PR to move generic out of dev-package-managers, effectively making it stable.

If you want to lift the experimental status, then I believe the ability to consume manifest files for the generic fetcher via a secondary channel needs to addressed, but most importantly all maintainers need to be in agreement with the approach.

@kosciCZ
Copy link
Contributor Author

kosciCZ commented Nov 13, 2024

If you want to lift the experimental status, then I believe the ability to consume manifest files for the generic fetcher via a secondary channel needs to addressed, but most importantly all maintainers need to be in agreement with the approach.

@eskultety So do I understand correctly that despite everything else being resolved and ready, this PR is now blocked for two weeks since Bruno is on vacation, despite the feature of being able to specify an absolute lockfile path being approved by two out of three reviewers?

@eskultety
Copy link
Member

If you want to lift the experimental status, then I believe the ability to consume manifest files for the generic fetcher via a secondary channel needs to addressed, but most importantly all maintainers need to be in agreement with the approach.

@eskultety So do I understand correctly that despite everything else being resolved and ready, this PR is now blocked for two weeks since Bruno is on vacation, despite the feature of being able to specify an absolute lockfile path being approved by two out of three reviewers?

First of all, I don't think this personal plan/situation was publicly announced anywhere and so I don't consider it appropriate doing it on his behalf this way in a public space, so please respect others' privacy and refrain from doing so in the future.

Now, to your question, I will summarize what I said in the relevant threads regarding this concern:

  • allowing external YAML config files is only as safe as the YAML parser itself we rely on
  • we already allow our own cachi2 config to be passed from an arbitrary location with an absolute path and it is a YAML, so if a manifest could be used to exploit the YAML parser so could be via our own cachi2 config
  • we don't check whether files are truly checked into a repo and so if we don't allow arbitrary paths for the artifact manifest people will deliberately circumvent that by injecting the file from a different location just before running cachi2
  • forcing the manifest to be read from the repo and repo only is IMO only a marginal improvement from security POV (that said, additional hardening should be employed wherever cachi2 is to be executed with elevated privileges), but a bigger convenience penalty for the user base
  • makes IMO no sense to force people checking a random YAML manifest file to their repos to satisfy a particular 3rd party build toolchain

With all that, no, I am not convinced this work needs to be blocked for any longer with the approvals in place.

Copy link
Member

@eskultety eskultety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I cannot rule out the theoretical possibility of exploiting our input model validators in order to disclose some unintended data of varying severity, I believe the actual problem (pre-existing) is bound to the logging/error reporting system which may need to be further hardened which is something outside of scope of this PR.

@kosciCZ
Copy link
Contributor Author

kosciCZ commented Nov 14, 2024

I appreciate the approval. I will add one more commit that lifts the experimental status.

Target location for the individual fetched artifacts specified in
the lockfile is now called `filename`, to better signify its purpose.

Signed-off-by: Jan Koscielniak <[email protected]>
This change removes the option to specify multiple checksums for a
given artifact in the lockfile as a map and replaces it with a single
checksum as a string of "alg:hash" format

Signed-off-by: Jan Koscielniak <[email protected]>
For better UX, the lockfile for the generic fetcher is now
called `artifacts.lock.yaml`

Signed-off-by: Jan Koscielniak <[email protected]>
Users are now allowed to supply an absolute path of the generic
lockfile. This should help in cases where it does not make sense
to commit the file to the source repository.

Signed-off-by: Jan Koscielniak <[email protected]>
Previously, the lockfile validation passed, even if user provided
extra information in the lockfile. The lockfile format must now
match the schema exactly.

Signed-off-by: Jan Koscielniak <[email protected]>
Adds example usage of the `filename` option in the lockfile.

Signed-off-by: Jan Koscielniak <[email protected]>
This change includes the generic fetcher ADR which is a necessary
step for the feature to be officially supported.

Signed-off-by: Jan Koscielniak <[email protected]>
Generic fetcher is now an officially supported part of cachi2.
`--dev-package-managers` flag is no longer needed.

Signed-off-by: Jan Koscielniak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants