e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557

ShyamsundarR · 2024-09-13T12:17:43Z

This fails very regularly, so not really a flake, but sticking to the keyword flake.

Failure log looks like so:

--- FAIL: TestSuites/Exhaustive/Deploy-cephfs (0.00s)
  --- FAIL: TestSuites/Exhaustive/Deploy-cephfs/Subscr (852.06s)
    --- PASS: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Deploy (35.59s)
    --- PASS: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Enable (210.49s)
    --- FAIL: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover (605.99s)

The failure seems to stem from failover cluster not reporting DataReady, such that the workload can be deployed to the cluster. This happens as per the following logs usually:

2024-09-12T13:14:56.851Z        INFO    controllers.VolumeReplicationGroup      volsync/vshandler.go:385        PVC is not in use by ready pod, not creating RS yet ...   {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "rsSpec": {"protectedPVC":{"namespace":"subscr-deploy-cephfs-busybox","name":"busybox-pvc","protectedByVolSync":true,"storageID":{"id":""},"replicationID":{"id":""},"storageClassName":"rook-cephfs","annotations":{"apps.open-cluster-management.io/hosting-subscription":"subscr-deploy-cephfs-busybox/subscr-deploy-cephfs-busybox","apps.open-cluster-management.io/reconcile-option":"merge"},"labels":{"app":"subscr-deploy-cephfs-busybox","app.kubernetes.io/part-of":"subscr-deploy-cephfs-busybox","appname":"busybox"},"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"1Gi"}}}}, "runFinalSync": false}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:121   Not all ReplicationSources completed setup. We'll retry...        {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8b93b-70966f46932e", "State": "primary"}
 
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:480   VolSync PVsRestored is not complete yet for PVC busybox-pvc       {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:480   VolSync ReplicationSourceSetup is not complete yet for PVC busybox-pvc    {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:418   VolSync RS hasn't been setup yet for PVC busybox-pvc      {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/volumereplicationgroup_controller.go:1354    Marking VRG ready with replicating reason {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary", "reason": "Unused"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/volumereplicationgroup_controller.go:1320    DataReady {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary", "subconditions": [{"type":"DataReady","status":"False","observedGeneration":3,"lastTransitionTime":null,"reason":"Progressing","message":"Not all VolSync PVCs are ready"},{"type":"DataReady","status":"True","observedGeneration":3,"lastTransitionTime":null,"reason":"Unused","message":"No PVCs are protected using VolumeReplication scheme"}]}

RS not being setup till there is a pod is fine, but should not get to vote into DataReady. Need to root cause this in the code to understand the behavior and correct the same.

Instances:

The text was updated successfully, but these errors were encountered:

nirs · 2024-10-28T17:25:23Z

In this instance:
https://github.com/RamenDR/ramen/actions/runs/11558180380

We have a ceph core dump:

% ls -lh gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12
-rw-------@ 1 nsoffer  staff   1.5G Oct 28 18:33 gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12

cat gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/ceph-mgr.a.log
...
2024-10-28T16:33:13.231+0000 7ffb02ebc640 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos
9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedOb
ject()' thread 7ffb02ebc640 time 2024-10-28T16:33:13.224069+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)

 ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7ffb445d2d86]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
 4: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
 5: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
 6: ceph-mgr(+0x19da74) [0x559b6ab47a74]
 7: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
 8: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
 9: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]

2024-10-28T16:33:13.235+0000 7ffb02ebc640 -1 *** Caught signal (Aborted) **
 in thread 7ffb02ebc640 thread_name:OpHistorySvc

 ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
 1: /lib64/libc.so.6(+0x3e730) [0x7ffb43f65730]
 2: /lib64/libc.so.6(+0x8ba6c) [0x7ffb43fb2a6c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7ffb445d2de0]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
 8: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
 9: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
 10: ceph-mgr(+0x19da74) [0x559b6ab47a74]
 11: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
 12: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
 13: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

nirs · 2024-10-31T19:23:38Z

Another case when we had a ceph core dump:
https://github.com/RamenDR/ramen/actions/runs/11603228347

nirs added bug Something isn't working test Testing related issue labels Sep 23, 2024

ShyamsundarR mentioned this issue Sep 25, 2024

Add ClusterClaim to the ramen-dr-cluster reconciler scheme #1574

Merged

ShyamsundarR mentioned this issue Oct 28, 2024

e2e flake: cephfs failover times out #1619

Closed

ShyamsundarR mentioned this issue Nov 5, 2024

Map PV secrets in parameters when restoring PVs #1602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557

e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557

ShyamsundarR commented Sep 13, 2024 •

edited by nirs

Loading

nirs commented Oct 28, 2024 •

edited

Loading

nirs commented Oct 31, 2024 •

edited

Loading

e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557

e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557

Comments

ShyamsundarR commented Sep 13, 2024 • edited by nirs Loading

nirs commented Oct 28, 2024 • edited Loading

nirs commented Oct 31, 2024 • edited Loading

ShyamsundarR commented Sep 13, 2024 •

edited by nirs

Loading

nirs commented Oct 28, 2024 •

edited

Loading

nirs commented Oct 31, 2024 •

edited

Loading