Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: vtgate handles replicas with stalled disk #17610

Open
timvaillancourt opened this issue Jan 22, 2025 · 5 comments · May be fixed by #17624
Open

Feature Request: vtgate handles replicas with stalled disk #17610

timvaillancourt opened this issue Jan 22, 2025 · 5 comments · May be fixed by #17624
Assignees
Labels
Component: VTGate Component: VTTablet Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Jan 22, 2025

Feature Description

This feature request hopes to mitigate the user impact caused by a REPLICA/RDONLY with a stalled MySQL datadir disk

In at least Vitess 19 (w/MySQL 8.0.36), when the disk storing the MySQL datadir is stalled (can be simulated with fsfreeze --freeze /mount/for/mysql/here), vtgate continues to believe the stalled replica is healthy, and the health stream updates sent to vtgate do not reflect that there is any problem. In reality, some/all application queries to the underlying mysqld essentially hang, so vtgate is sending traffic to a blackhole

This sort of makes sense, because the health stream stats don't contain too many metrics to infer the health of mysqld itself, and the updates continue to send when a disk is stalled, so the health stream never "times out". ReplicationLagSeconds is one metric that could infer health, but it turns out even if --enable-heartbeat is enabled this lag value comes purely from SHOW REPLICA STATUS

Interestingly, on a totally-datadir-stalled mysqld (in a shard with live writes) Seconds_Behind_Source never increases, so in health stats we see ReplicationLagSeconds: 0 😱. SHOW REPLICA STATUS output:

Seconds_Behind_Source: 0

And Relay_Log_File and Relay_Log_Pos have no movement - which I believe is the reason Seconds_Behind_Source remains zero; the SQL thread is 0 seconds behind the relay logs, which have stopped receiving updates due to the stalled disk

And when you un-freeze the disk, suddenly mysqld realizes it's behind:

$ sudo fsfreeze --unfreeze /mount/for/mysql/here

SHOW REPLICA STATUS:

Seconds_Behind_Source: 2638

I consider vtgate seeing ReplicationLagSeconds: 0 here a bug. This bug could potentially be solved by considering the sidecar-based heartbeat (when --enable-heartbeat is set); this value will increase in staleness if it stops receiving updates from replication workers. If there is no objection, I would like to update the logic that gathers ReplicationLagSeconds to use the sidecar-based heartbeat when --enable-heartbeat is set

Regardless of the accuracy of ReplicationLagSeconds, I feel vtgate should know sooner that a replica is un-useable in this scenario. Waiting for the replication lag to grow extremely high to ignore a stalled replica still causes impact. Ideally we'd include a metric in the health stream updates that tells vtgate more about the health of the underlying mysqld or perhaps just a HealthError (an existing health stats field)

Some questions/braindump/ideas:

  • Does MySQL have internal metrics we can use? So far it seems somewhat unaware the world stopped, but it responds to trivial server-variable "reads" (not table data), hangs on "writes" and thinks it's "zero seconds behind" 👎
  • Code was recently added to test for stalled disks on primaries. This feature tests a path is writeable out-of-band of mysqld. Should we repurpose that for replicas too?
    • A concern I have: the stalled disk feature is optional and requires some additional MySQL-internals context to configure correctly
    • If we made this feature seamless to users I would feel more comfortable. Ie: pretend I don't know exactly where/how MySQL stores data
    • What if there are more ways mysqld can stall?
  • If --enable-heartbeat is enabled and we know the heartbeat interval is say 1s, no movement in relay logs for that duration would indicate a stall? 🤔

Deferred/shot-down ideas:

  • GTIDs would tell us we're behind, but that would require remote operations
  • Writing to the local mysqld is a no-go, REPLICAs have read_only = ON and sql_log_bin = 0 feels hacky

cc @GuptaManan100 as this may involve the stalled-disk PR

Use Case(s)

Users who would like vtgate to stop sending traffic (that will probably fail) to a REPLICA/RDONLY will a stalled MySQL datadir disk

@timvaillancourt timvaillancourt added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VTTablet Component: VTGate and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 22, 2025
@timvaillancourt timvaillancourt self-assigned this Jan 22, 2025
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jan 22, 2025

Added Type: Bug label for the incorrect/inaccurate ReplicationLagSeconds: 0 being reported when in reality there is lag

Really MySQL is the problem here, it should be smarter, but I'm guessing we'll have to be smarter instead 😄

@timvaillancourt
Copy link
Contributor Author

Does MySQL have internal metrics we can use? So far it seems somewhat unaware the world stopped, but it responds to "reads", hangs on "writes" and thinks it's "zero seconds behind" 👎

I found some weak signals, two are in global status variables, which can still be queried in this scenario:

mysql> show global status like 'Innodb_data_pending_writes'\G
*************************** 1. row ***************************
Variable_name: Innodb_data_pending_writes
        Value: 1
1 row in set (0.00 sec)

And

mysql> show global status like 'Innodb_os_log_pending_writes'\G
*************************** 1. row ***************************
Variable_name: Innodb_os_log_pending_writes
        Value: 1
1 row in set (0.00 sec)

But those being non-zero alone don't mean a stall really, it's just one of the few queryable signals I've found

Another is queries to replication applier worker tables hang, but that's not something to rely on really:

mysql> select * from performance_schema.replication_applier_status;

(hangs forever)

@timvaillancourt timvaillancourt changed the title Feature Request: vtgate to understand replicas with stalled disk Feature Request: vtgate handles replicas with stalled disk Jan 22, 2025
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jan 22, 2025

Repro steps:

  1. Setup a shard with ongoing writes
    • --enable-heartbeat will produce periodic writes
  2. fsfreeze --freeze /path/to/mysql/datadir a REPLICA - This prevents all progress in replication
  3. Notice on that REPLICA (even after waiting some time):
    • SHOW REPLICA STATUS shows Seconds_Behind_Source: 0
      • The relay log file/positions have no movement
    • Health stats for the tablet have ReplicationLagSeconds: 0
      • curl on /debug/gateway on vtgate will show you this
    • Everything else in the health stats looks ✅ (the bad REPLICA will keep getting queries 👎)

@GuptaManan100
Copy link
Member

I think given that #17470 already added the ability to detect stalled disks, we should just use that information to mark the tablet not serving. This would make vtgate not route queries to the vttablet until it has recovered from the stalled disk (at which point even the replication lag would correct its value and we might still not send it any queries until the replication lag reduces.)

@GuptaManan100
Copy link
Member

@timvaillancourt I've made the change I suggested ☝ in #17624. Please take a look when you have time and let me know if this would address your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTGate Component: VTTablet Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants