Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix master CI #1772

Closed
jimmykarily opened this issue Aug 28, 2023 · 11 comments · Fixed by #1794
Closed

fix master CI #1772

jimmykarily opened this issue Aug 28, 2023 · 11 comments · Fixed by #1794

Comments

@jimmykarily
Copy link
Contributor

jimmykarily commented Aug 28, 2023

We get 2 different types of errors:

  • timeouts (maybe we need to wait more?)
  • SSH errors

After trying various things for the SSH issue I tend to believe that our new (paid) Github runners run on Azure and we receive datasource from the azure provider:

The reason I believe that is because I see this right before SSH fails:

      +run-qemu-test *failed* | WARN[0000] Azure: Saving SSH keys failed: Getting SSH key failed: IMDS returned status code: 404 
      +run-qemu-test *failed* | INFO[2023-08-28T11:25:18Z] Processing stage step 'Run stages if userdata is found'. ( commands: 3, files: 0, ... ) 

(here: https://github.com/kairos-io/kairos/actions/runs/5998712300/job/16268535514)

We should find a way to disable at least the azure provider and see if this happens again.

@jimmykarily jimmykarily converted this from a draft issue Aug 28, 2023
@jimmykarily jimmykarily moved this from Todo 🖊 to In Progress 🏃 in 🧙Issue tracking board Aug 28, 2023
jimmykarily added a commit that referenced this issue Aug 29, 2023
trying to fix this:

#1772

Obviously this deserves a proper fix. This commit is just to see if it
makes any difference.

Signed-off-by: Dimitris Karakasilis <[email protected]>
@jimmykarily
Copy link
Contributor Author

I'm trying a hack here: #1777

A proper fix would allow us to disable certain providers at runtime (not build time) so that we don't need custom builds just to disable a provider.

@jimmykarily
Copy link
Contributor Author

The full list of providers is enabled by default. A solution could be to:

  • Create a constant "DEFAULT_PROVIDERS" in yip (somewhere here) which would be the full list which is currently set in 00_datasource.yaml
  • Stop setting the list in 00_datasource.yaml (remove that file completely?)
  • If datasource.providers is not defined in the cloud config, yip will use the default list from the constant otherwise it will use the user defined one.

The end result is the same but it allows the user to override the default (hardcoded) list with her own, even an empty one.

The only drawback is that the default (full) list will be hardcoded in the binary while now it's set in a yaml (though baked in the image with no way to override).

Another solution would be to use some environment variable (e.g. EXCLUDE_CLOUDINIT_PROVIDERS) with a list of providers to exclude but that's as ugly as it gets.

@jimmykarily
Copy link
Contributor Author

Hmm, this is more complex than I thought. The yip code doesn't know which stage it's running. If we run all providers when none is set, then we would be doing that for every stage. Currently it only happens for the rootfs.after where provider list is not empty.

jimmykarily added a commit to mudler/yip that referenced this issue Aug 29, 2023
@jimmykarily
Copy link
Contributor Author

@jimmykarily
Copy link
Contributor Author

A kairos-agent bumped to this version of yip ^, when installed with the following config, it skips the set providers:

#cloud-config

write_files:
- content: |
    EXCLUDE_CLOUD_INIT_PROVIDERS="gcp,azure"
  path: /etc/environment
  permissions: "0644"

users:
    - name: kairos
      passwd: kairos

so unless we can find a better solution and if it proves to fix the CI failures, we can go with this one.

@jimmykarily
Copy link
Contributor Author

The (paid) github runners don't run currently. Until we get them back, this story is paused. We are temporarily back to the action-runner-controller based runners.

@Itxaka Itxaka moved this from In Progress 🏃 to Under review 🔍 in 🧙Issue tracking board Sep 4, 2023
@Itxaka
Copy link
Member

Itxaka commented Sep 6, 2023

A kairos-agent bumped to this version of yip ^, when installed with the following config, it skips the set providers:

#cloud-config

write_files:
- content: |
    EXCLUDE_CLOUD_INIT_PROVIDERS="gcp,azure"
  path: /etc/environment
  permissions: "0644"

users:
    - name: kairos
      passwd: kairos

so unless we can find a better solution and if it proves to fix the CI failures, we can go with this one.

I like this one. Allows us to keep kairos and other using yip the same way, while providing flexibility to disable providers via just a simple env var 👍

@jimmykarily
Copy link
Contributor Author

We can keep this feature even if it doesn't fix our CI issues. Maybe other people find it useful. We'll need to document it (TODO).

@Itxaka
Copy link
Member

Itxaka commented Sep 6, 2023

Im just wondering why would that issue would only affect those two jobs.
alpine-opensuse-leap has the same tests on the same machines and they work.
there is other 2 provider tests run with the same framework (datasources, same workers) and they work.

Very very weird.

@Itxaka
Copy link
Member

Itxaka commented Sep 6, 2023

si it kind of looks like it found the userdata endpoint and even if the ssh keys do not work and cant get them, it creates the userdata sentinel during the cos-setup-network service:

23-09-06T07:57:35+0000 localhost kairos-agent[1262]: INFO[2023-09-06T07:57:35Z] Processing stage step 'Pull data from provider'. ( commands: 0, files: 0, ... )
2023-09-06T07:57:36+0000 localhost kairos-agent[1262]: time="2023-09-06T07:57:36Z" level=error msg="there were errors probing /dev/sr1: failed mounting /dev/sr1: no medium found no medium found"
2023-09-06T07:57:37+0000 localhost kairos-agent[1262]: time="2023-09-06T07:57:37Z" level=warning msg="Azure: Saving SSH keys failed: Getting SSH key failed: IMDS returned status code: 404"
2023-09-06T07:57:39+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T07:57:39Z] Processing stage step 'Run stages if userdata is found'. ( commands: 3, files: 0, ... )

Userdata doesnt seem to be valid:

2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: warning: skipping /oem/userdata because it has no valid header
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: warning: skipping /oem/userdata.yaml because it has no valid header

But the sentinel gets created anyway, which in turn runs the initramfs stages again, which may impact the user/password:

2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:19Z] kairos-agent version v2.2.9
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:20Z] Processing stage step 'Setup sudo'. ( commands: 1, files: 1, ... )
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:20Z] (conditional) Skip 'Skipping stage (if statement error: failed to run cat /proc/cmdline | grep -q "nodepair.enable"
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: : exit status 1)' stage name:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:20Z] Command output: passwd: password expiry information changed.
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:21Z] Processing stage step 'Ensure runtime permission'. ( commands: 2, files: 0, ... )
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:21Z] (conditional) Skip 'Skipping stage (if statement error: failed to run cat /proc/cmdline | grep -q "interactive-install"
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: : exit status 1)' stage name:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:21Z] Command output:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:21Z] Command output:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:21Z] (conditional) Skip 'Skipping stage (if statement error: failed to run [ -e /sbin/rc-service ]: exit status 1)' stage name: Enable serial login for alpine
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:21Z] Processing stage step 'Ensure runtime permission'. ( commands: 2, files: 0, ... )
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:21Z] (conditional) Skip 'Skipping stage (if statement error: failed to run cat /proc/cmdline | grep -q "interactive-install"
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: : exit status 1)' stage name:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:21Z] Command output:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:22Z] Command output:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:22Z] Processing stage step ''. ( commands: 0, files: 0, ... )
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:22Z] (conditional) Skip 'Skipping stage (if statement error: failed to run [ -f "/sys/firmware/devicetree/base/model" ] && grep -i jetson "/sys/firmware/devicetree/base/model"
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: : exit status 1)' stage name: Create files
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:22Z] (conditional) Skip 'Skipping stage (if statement error: failed to run grep -i alpine "/etc/os-release"
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: : exit status 1)' stage name: Create files
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:22Z] Processing stage step ''. ( commands: 0, files: 0, ... )
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:22Z] Processing stage step 'Set user and password'. ( commands: 0, files: 0, ... )
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:22Z] Done executing stage 'initramfs'
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:22Z] Running stage: initramfs.after
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:23Z] (conditional) Skip 'Skipping stage (if statement error: failed to run [[ $(kairos-agent state get kairos.flavor) =~ ^ubuntu ]]: exit status 1)' stage name: setupcon initramfs.after ubuntu
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Done executing stage 'initramfs.after'
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Running stage: initramfs.before
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Done executing stage 'initramfs.before'
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Running stage: initramfs
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Done executing stage 'initramfs'
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Running stage: initramfs.after
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Done executing stage 'initramfs.after'
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: INFO[2023-09-06T08:00:23Z] Some errors found but were ignored. Enable --strict mode to fail on those or --debug to see them in the log
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]: WARN[2023-09-06T08:00:23Z] 2 errors occurred:
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]:         * failed to run systemctl disable NetworkManager: exit status 1
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]:         * failed to run systemctl enable nohang-desktop: exit status 1
2023-09-06T08:00:23+0000 WUS3-GHEUS2UB22C34-0022 kairos-agent[1262]:   

@Itxaka
Copy link
Member

Itxaka commented Sep 6, 2023

Also notice how the hostname is changed due to the userdata. What Im not sure is why is this affecting this 2 jobs specifically....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants