Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557

Open
ShyamsundarR opened this issue Sep 13, 2024 · 2 comments
Open
Labels
bug Something isn't working test Testing related issue

Comments

@ShyamsundarR
Copy link
Member

ShyamsundarR commented Sep 13, 2024

This fails very regularly, so not really a flake, but sticking to the keyword flake.

Failure log looks like so:

--- FAIL: TestSuites/Exhaustive/Deploy-cephfs (0.00s)
  --- FAIL: TestSuites/Exhaustive/Deploy-cephfs/Subscr (852.06s)
    --- PASS: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Deploy (35.59s)
    --- PASS: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Enable (210.49s)
    --- FAIL: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover (605.99s)

The failure seems to stem from failover cluster not reporting DataReady, such that the workload can be deployed to the cluster. This happens as per the following logs usually:

2024-09-12T13:14:56.851Z        INFO    controllers.VolumeReplicationGroup      volsync/vshandler.go:385        PVC is not in use by ready pod, not creating RS yet ...   {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "rsSpec": {"protectedPVC":{"namespace":"subscr-deploy-cephfs-busybox","name":"busybox-pvc","protectedByVolSync":true,"storageID":{"id":""},"replicationID":{"id":""},"storageClassName":"rook-cephfs","annotations":{"apps.open-cluster-management.io/hosting-subscription":"subscr-deploy-cephfs-busybox/subscr-deploy-cephfs-busybox","apps.open-cluster-management.io/reconcile-option":"merge"},"labels":{"app":"subscr-deploy-cephfs-busybox","app.kubernetes.io/part-of":"subscr-deploy-cephfs-busybox","appname":"busybox"},"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"1Gi"}}}}, "runFinalSync": false}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:121   Not all ReplicationSources completed setup. We'll retry...        {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8b93b-70966f46932e", "State": "primary"}
 
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:480   VolSync PVsRestored is not complete yet for PVC busybox-pvc       {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:480   VolSync ReplicationSourceSetup is not complete yet for PVC busybox-pvc    {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:418   VolSync RS hasn't been setup yet for PVC busybox-pvc      {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/volumereplicationgroup_controller.go:1354    Marking VRG ready with replicating reason {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary", "reason": "Unused"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/volumereplicationgroup_controller.go:1320    DataReady {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary", "subconditions": [{"type":"DataReady","status":"False","observedGeneration":3,"lastTransitionTime":null,"reason":"Progressing","message":"Not all VolSync PVCs are ready"},{"type":"DataReady","status":"True","observedGeneration":3,"lastTransitionTime":null,"reason":"Unused","message":"No PVCs are protected using VolumeReplication scheme"}]}

RS not being setup till there is a pod is fine, but should not get to vote into DataReady. Need to root cause this in the code to understand the behavior and correct the same.

Instances:

@nirs
Copy link
Member

nirs commented Oct 28, 2024

In this instance:
https://github.com/RamenDR/ramen/actions/runs/11558180380

We have a ceph core dump:

% ls -lh gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12
-rw-------@ 1 nsoffer  staff   1.5G Oct 28 18:33 gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12
cat gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/ceph-mgr.a.log
...
2024-10-28T16:33:13.231+0000 7ffb02ebc640 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos
9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedOb
ject()' thread 7ffb02ebc640 time 2024-10-28T16:33:13.224069+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)

 ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7ffb445d2d86]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
 4: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
 5: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
 6: ceph-mgr(+0x19da74) [0x559b6ab47a74]
 7: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
 8: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
 9: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]

2024-10-28T16:33:13.235+0000 7ffb02ebc640 -1 *** Caught signal (Aborted) **
 in thread 7ffb02ebc640 thread_name:OpHistorySvc

 ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
 1: /lib64/libc.so.6(+0x3e730) [0x7ffb43f65730]
 2: /lib64/libc.so.6(+0x8ba6c) [0x7ffb43fb2a6c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7ffb445d2de0]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
 8: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
 9: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
 10: ceph-mgr(+0x19da74) [0x559b6ab47a74]
 11: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
 12: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
 13: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

@nirs
Copy link
Member

nirs commented Oct 31, 2024

Another case when we had a ceph core dump:
https://github.com/RamenDR/ramen/actions/runs/11603228347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Testing related issue
Projects
None yet
Development

No branches or pull requests

2 participants