Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Dec 17, 2024

examine the events associated with the installer Pods, do the following:
a) construct an e2e timeline
b) detect if installer pods are running concurrently on two nodes, and return a flaking test

we want to know how widespread b is.

installer pod timeline:
image

and the test will flake if it finds concurrent installer pods on two or more nodes, this is how it would look like (simulated, not an actual occurrence):

: [sig-apimachinery] installer Pods should not run concurrently on two or more node
{  
A(2024-12-18T16:11:21Z -> 0001-01-01T00:00:00Z) B(2024-12-18T16:13:07Z -> 0001-01-01T00:00:00Z):

A: node(ci-op-54qd4d73-03fd1-cl265-master-0) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-0) namespace(openshift-etcd) reason() started(2024-12-18T16:11:21Z) duration: -2562047h47m16.854775808s
B: node(ci-op-54qd4d73-03fd1-cl265-master-1) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-1) namespace(openshift-etcd) reason() started(2024-12-18T16:13:07Z) duration: -2562047h47m16.854775808s
}

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2024
Copy link
Contributor

openshift-ci bot commented Dec 17, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tkashem
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tkashem tkashem force-pushed the monitor-installer-pod branch from 16844b1 to f6f032f Compare December 18, 2024 00:44
Copy link

openshift-trt bot commented Dec 18, 2024

Job Failure Risk Analysis for sha: f6f032f

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (482): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret is deleted [Suite:openshift/conformance/serial]
This test has passed 97.30% of 74 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret expiry annotation is changed [Suite:openshift/conformance/serial]
This test has passed 97.30% of 74 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret expiry annotation is changed [Suite:openshift/conformance/serial]
This test has passed 78.57% of 14 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:none] in the last week.
---
[sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret is deleted [Suite:openshift/conformance/serial]
This test has passed 78.57% of 14 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 57.14% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

@tkashem tkashem force-pushed the monitor-installer-pod branch from f6f032f to d770b9f Compare December 18, 2024 14:30
Copy link

openshift-trt bot commented Dec 18, 2024

Job Failure Risk Analysis for sha: d770b9f

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (446): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node Low
[sig-node] static pods should start after being created
This test has passed 71.11% of 90 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node'] in the last 14 days.

@tkashem tkashem force-pushed the monitor-installer-pod branch from d770b9f to 36711ce Compare December 18, 2024 20:21
@tkashem
Copy link
Contributor Author

tkashem commented Dec 19, 2024

@tkashem tkashem changed the title [WIP] add a monitor test for installer pod timeline add a monitor test for installer pod timeline Dec 19, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2024
Copy link

openshift-trt bot commented Dec 19, 2024

Job Failure Risk Analysis for sha: 36711ce

Job Name Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (20) are below the historical average (194): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@tkashem
Copy link
Contributor Author

tkashem commented Dec 19, 2024

/payload

Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

@tkashem: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@tkashem
Copy link
Contributor Author

tkashem commented Dec 19, 2024

/payload 4.18 nightly informing

Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

@tkashem: trigger 68 job(s) of type informing for the nightly release of OCP 4.18

  • periodic-ci-openshift-release-master-nightly-4.18-e2e-agent-compact-fips
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-agent-single-node-ipv6
  • periodic-ci-openshift-release-master-nightly-4.18-console-aws
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-aws
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-csi
  • periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-cgroupsv2
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-fips
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-csi
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-serial
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-techpreview
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.18-upgrade-from-stable-4.17-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-upi
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-azure
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-csi
  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn
  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-driver-toolkit
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-gcp
  • periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-rt
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-bm-upgrade
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-dualstack-techpreview
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-serial-ipv4
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-serial-virtualmedia
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.18-upgrade-from-stable-4.17-e2e-metal-ipi-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-serial-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-serial-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.18-upgrade-from-stable-4.17-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ovn-assisted
  • periodic-ci-openshift-release-master-nightly-4.18-metal-ovn-single-node-recert-cluster-rename
  • periodic-ci-openshift-osde2e-main-nightly-4.18-osd-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-osd-ccs-gcp
  • periodic-ci-openshift-osde2e-main-nightly-4.18-osd-gcp
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-proxy
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ovn-single-node-live-iso
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-rosa-sts-ovn
  • periodic-ci-openshift-osde2e-main-nightly-4.18-rosa-classic-sts
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-rosa-sts-hypershift-ovn
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-telco5g
  • periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.18-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-upi
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-upi-serial
  • periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-static-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dfdef9b0-bdbf-11ef-98ca-f9166c462206-0

@tkashem tkashem changed the title add a monitor test for installer pod timeline OCPBUGS-45924: add a monitor test for installer pod timeline Dec 19, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 19, 2024
@openshift-ci-robot
Copy link

@tkashem: This pull request references Jira Issue OCPBUGS-45924, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

examine the events associated with the installer Pods, do the following:
a) construct an e2e timeline
b) detect if installer pods are running concurrently on two nodes, and return a flaking test

we want to know how widespread b is.

installer pod timeline:
image

and the test will flake if it finds concurrent installer pods on two or more nodes, this is how it would look like (simulated, not an actual occurrence):

: [sig-apimachinery] installer Pods should not run concurrently on two or more node
{  
A(2024-12-18T16:11:21Z -> 0001-01-01T00:00:00Z) B(2024-12-18T16:13:07Z -> 0001-01-01T00:00:00Z):

A: node(ci-op-54qd4d73-03fd1-cl265-master-0) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-0) namespace(openshift-etcd) reason() started(2024-12-18T16:11:21Z) duration: -2562047h47m16.854775808s
B: node(ci-op-54qd4d73-03fd1-cl265-master-1) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-1) namespace(openshift-etcd) reason() started(2024-12-18T16:13:07Z) duration: -2562047h47m16.854775808s
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tkashem tkashem changed the title OCPBUGS-45924: add a monitor test for installer pod timeline OCPBUGS-45924: add a monitor test that detects concurrent installer pods Dec 19, 2024
@tkashem
Copy link
Contributor Author

tkashem commented Dec 20, 2024

/retest

1 similar comment
@tkashem
Copy link
Contributor Author

tkashem commented Dec 22, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Dec 22, 2024

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade a051b38 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-single-node-upgrade a051b38 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-agnostic-ovn-cmd a051b38 link false /test e2e-agnostic-ovn-cmd
ci/prow/verify a051b38 link true /test verify
ci/prow/unit a051b38 link true /test unit
ci/prow/e2e-aws-ovn-single-node-serial a051b38 link false /test e2e-aws-ovn-single-node-serial
ci/prow/lint a051b38 link true /test lint
ci/prow/e2e-metal-ipi-ovn-ipv6 a051b38 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-upgrade a051b38 link true /test e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Dec 22, 2024

Job Failure Risk Analysis for sha: a051b38

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade Low
[sig-network] pods should successfully create sandboxes by other
This test has passed 66.44% of 149 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:micro] in the last week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants