Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe recovery partially fills key range hole #6859

Open
overvenus opened this issue Jul 29, 2023 · 3 comments
Open

Unsafe recovery partially fills key range hole #6859

overvenus opened this issue Jul 29, 2023 · 3 comments

Comments

@overvenus
Copy link
Member

overvenus commented Jul 29, 2023

Bug Report

On a 4-node TiKV cluster, we stops two nodes and then starts unsafe recovery using pd-ctl.
After unsafe recovery, we find there are lots of PD server timeout, and it turns out there is
a region fails to be created.

Failed TiKV: tikv-0 and tikv-1
Alive TiKV: tikv-2 and tikv-3
Original region ID: 1965
New region ID: 2991

Timeline:

  1. 1965 on tikv-3 sends a snapshot to tikv-2.
  2. Starts unsafe recovery.
  3. Snapshot sent.
  4. 1965 on tikv-3 becomes tombstone.
  5. A peer of 1965 is created on tikv-2.
  6. PD sends to tikv-2 to create 2991 to cover the key rang of 1965.
  7. 2991 fails to be created because 1965 has been created on tikv-3.
  8. PD considers unsafe recovery is finished.

There are actually two questions:

  1. Why does PD finish unsafe recovery while there is a key rang hole?
  2. Why does PD tombstone 1965 in the first place? Stoping two nodes out of
    four nodes cluster should not lost replica data completely.

Note: the issue is found on a multi-rocksdb cluster. But I think it may affect single rocksdb cluster too.

Log:

What did you do?

See above.

What version of PD are you using (pd-server -V)?

v7.1.0

@overvenus overvenus added the type/bug The issue is confirmed as a bug. label Jul 29, 2023
@v01dstar
Copy link
Contributor

Maybe not relevant, just for references, region 1965 received 1 vote from the dead store 1

[2023/07/28 07:54:59.375 +00:00] [INFO] [raft.rs:2230] ["received votes response"] [term=9] [type=MsgRequestVoteResponse] [approvals=2] [rejections=0] [f om=1967] [vote=true] [raft_id=1968] [peer_id=1968] [region_id=1965]

Members:

region_epoch { conf_ver: 59 version: 109 } peers { id: 1967 store_id: 1 } peers { id: 1968 store_id: 216 } peers { id: 2783 store_id: 45 }"] [legacy=false] [changes="[change_type: AddLearnerNode peer { id: 2990 store_id: 4 role: Learner }]"] [peer_id=1968] [region_id=1965]

@v01dstar
Copy link
Contributor

I can't find any clue from the log.

I think the the snapshot related stuff was "ok" in this case, the key is to find out why PD decided to tombstone 1965 on store 216 (tikv3), this only happens when another newer region covers the range of 1965, but from the log, I could not find such regions.

@overvenus I suggest we add some info log in PD, print out any overlap regions while building the range tree. And wait for this problem occur again?

@overvenus
Copy link
Member Author

Besides adding logs, can we check if all regions have quorum replicas alive before exiting unsafe recovery?

ti-chi-bot bot pushed a commit that referenced this issue Sep 8, 2023
…6959)

ref #6859

Add log for overlapping regions in unsafe recovery.

We were unable to find the root cause of #6859, adding this log may help us better identify the issue, by printing out the regions that overlap with each other, that causes some of them to be marked as tombstone.

Signed-off-by: Yang Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants