Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Chart: Move vulnerability processing to be a cronjob by default #25488

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pboushy
Copy link
Contributor

@pboushy pboushy commented Jan 16, 2025

The existing helm chart is designed to run vulnerability processing on every container, which requires 4Gi/container.
However, the default for the helm chart is for each container to have a maximum of 1Gi.

This change switches the default so that vulnerability processing is disabled in the deployment, and moves vulnerability processing to a dedicated cronjob that runs 1/day at 1am. (I didn't make that configurable...)

A few items I think are important to call out:

  1. I have commented out alot of environment variables in the cronjob that existed in the migration and deployment because I don't think they're required, but I wanted one of you to review and actually say that they're not necessary.
  2. I did not include anything related to osquery or exposing the server to clients in this since it's not meant to handle clients, just vulnerability processing.
  3. I believe I did everything to make sure cloudSQL will work, but it should be tested.

Checklist for submitter

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
    See Changes files for more information.
  • Added/updated automated tests
  • A detailed QA plan exists on the associated ticket (if it isn't there, work with the product group's QA engineer to add it)
  • Manual QA for all new/changed functionality

- vuln processing requires 4Gi RAM. Fleet can run fine with less for
  most items
- Add "dedicated" flag and default to true
- Allow user to customize vulnProcessing resources independently from
  main resources var
- ensure that FLEET_VULNERABILITIES_DISABLE_SCHEDULE=true when using
  dedicated
@pboushy pboushy force-pushed the make-vuln-processing-a-cron branch from 7aa3ccf to 3b535d8 Compare January 16, 2025 01:10
@lukeheath lukeheath marked this pull request as draft January 16, 2025 16:53
@lukeheath
Copy link
Member

@pboushy Thank you for your contribution!

Because this changes the default product behavior, I'm adding this PR to our drafting board so that our product design team can review. Some users will need vulnerability processing to run more often than once a day, so that's likely something that needs to be configurable. But we'll review with our team first and let you know our thoughts. Thanks again!

cc @noahtalerman

@lukeheath lukeheath added the :product Product Design department (shows up on 🦢 Drafting board) label Jan 16, 2025
@pboushy
Copy link
Contributor Author

pboushy commented Jan 16, 2025

@pboushy Thank you for your contribution!

Because this changes the default product behavior, I'm adding this PR to our drafting board so that our product design team can review. Some users will need vulnerability processing to run more often than once a day, so that's likely something that needs to be configurable. But we'll review with our team first and let you know our thoughts. Thanks again!

cc @noahtalerman

I'm also open to switching this so it operates as it already does by default, and the admin has to opt-in to moving this to a dedicated cronjob for vulnerability processing.

It's a very simple change to charts/fleet/values.yaml

vulnProcessing:
-  dedicated: true
+  dedicated: false

I went with making this method the default because the default resources specified in the chart are insufficient to run vuln processing, and this makes the resource utilization more efficient overall.

@rfairburn
Copy link
Contributor

Thanks for your efforts here @pboushy!

The default the internal cron uses when not configured for vuln processing is once per hour. My guess is we will likely want to match that as the default in values.yaml.

We'll also need to ensure to have a concurrencyPolicy of Forbid as having multiple jobs run concurrently for any reason would cause unexpected behavior.

CC: @lukeheath

@noahtalerman noahtalerman removed the :product Product Design department (shows up on 🦢 Drafting board) label Jan 17, 2025
@noahtalerman noahtalerman removed their assignment Jan 17, 2025
@noahtalerman
Copy link
Member

noahtalerman commented Jan 17, 2025

Thanks @pboushy! I tracked a feature request for this here: #25566

From the feature request:

Can we bump the default resources instead? Sounds like we have to find a balance between frequency of the job and amount of resources.

@rfairburn can you please schedule 15 mins with me next week to discuss?

cc @lukeheath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants