Feature Request: `vtgate` handles replicas with stalled disk #17610

timvaillancourt · 2025-01-22T16:38:42Z

Feature Description

This feature request hopes to mitigate the user impact caused by a REPLICA/RDONLY with a stalled MySQL datadir disk

In at least Vitess 19 (w/MySQL 8.0.36), when the disk storing the MySQL datadir is stalled (can be simulated with fsfreeze --freeze /mount/for/mysql/here), vtgate continues to believe the stalled replica is healthy, and the health stream updates sent to vtgate do not reflect that there is any problem. In reality, some/all application queries to the underlying mysqld essentially hang, so vtgate is sending traffic to a blackhole

This sort of makes sense, because the health stream stats don't contain too many metrics to infer the health of mysqld itself, and the updates continue to send when a disk is stalled, so the health stream never "times out". ReplicationLagSeconds is one metric that could infer health, but it turns out even if --enable-heartbeat is enabled this lag value comes purely from SHOW REPLICA STATUS

Interestingly, on a totally-datadir-stalled mysqld (in a shard with live writes) Seconds_Behind_Source never increases, so in health stats we see ReplicationLagSeconds: 0 😱. SHOW REPLICA STATUS output:

Seconds_Behind_Source: 0

And Relay_Log_File and Relay_Log_Pos have no movement - which I believe is the reason Seconds_Behind_Source remains zero; the SQL thread is 0 seconds behind the relay logs, which have stopped receiving updates due to the stalled disk

And when you un-freeze the disk, suddenly mysqld realizes it's behind:

$ sudo fsfreeze --unfreeze /mount/for/mysql/here

SHOW REPLICA STATUS:

Seconds_Behind_Source: 2638

I consider vtgate seeing ReplicationLagSeconds: 0 here a bug. This bug could potentially be solved by considering the sidecar-based heartbeat (when --enable-heartbeat is set); this value will increase in staleness if it stops receiving updates from replication workers. If there is no objection, I would like to update the logic that gathers ReplicationLagSeconds to use the sidecar-based heartbeat when --enable-heartbeat is set

Regardless of the accuracy of ReplicationLagSeconds, I feel vtgate should know sooner that a replica is un-useable in this scenario. Waiting for the replication lag to grow extremely high to ignore a stalled replica still causes impact. Ideally we'd include a metric in the health stream updates that tells vtgate more about the health of the underlying mysqld or perhaps just a HealthError (an existing health stats field)

Some questions/braindump/ideas:

Does MySQL have internal metrics we can use? So far it seems somewhat unaware the world stopped, but it responds to trivial server-variable "reads" (not table data), hangs on "writes" and thinks it's "zero seconds behind" 👎
Code was recently added to test for stalled disks on primaries. This feature tests a path is writeable out-of-band of mysqld. Should we repurpose that for replicas too?
- A concern I have: the stalled disk feature is optional and requires some additional MySQL-internals context to configure correctly
- If we made this feature seamless to users I would feel more comfortable. Ie: pretend I don't know exactly where/how MySQL stores data
- What if there are more ways mysqld can stall?
If --enable-heartbeat is enabled and we know the heartbeat interval is say 1s, no movement in relay logs for that duration would indicate a stall? 🤔

Deferred/shot-down ideas:

GTIDs would tell us we're behind, but that would require remote operations
Writing to the local mysqld is a no-go, REPLICAs have read_only = ON and sql_log_bin = 0 feels hacky

cc @GuptaManan100 as this may involve the stalled-disk PR

Use Case(s)

Users who would like vtgate to stop sending traffic (that will probably fail) to a REPLICA/RDONLY will a stalled MySQL datadir disk

The text was updated successfully, but these errors were encountered:

timvaillancourt · 2025-01-22T16:39:57Z

Added Type: Bug label for the incorrect/inaccurate ReplicationLagSeconds: 0 being reported when in reality there is lag

Really MySQL is the problem here, it should be smarter, but I'm guessing we'll have to be smarter instead 😄

timvaillancourt · 2025-01-22T17:24:33Z

Does MySQL have internal metrics we can use? So far it seems somewhat unaware the world stopped, but it responds to "reads", hangs on "writes" and thinks it's "zero seconds behind" 👎

I found some weak signals, two are in global status variables, which can still be queried in this scenario:

mysql> show global status like 'Innodb_data_pending_writes'\G
*************************** 1. row ***************************
Variable_name: Innodb_data_pending_writes
        Value: 1
1 row in set (0.00 sec)

And

mysql> show global status like 'Innodb_os_log_pending_writes'\G
*************************** 1. row ***************************
Variable_name: Innodb_os_log_pending_writes
        Value: 1
1 row in set (0.00 sec)

But those being non-zero alone don't mean a stall really, it's just one of the few queryable signals I've found

Another is queries to replication applier worker tables hang, but that's not something to rely on really:

mysql> select * from performance_schema.replication_applier_status;

(hangs forever)

timvaillancourt · 2025-01-22T23:23:51Z

Repro steps:

Setup a shard with ongoing writes
- --enable-heartbeat will produce periodic writes
fsfreeze --freeze /path/to/mysql/datadir a REPLICA - This prevents all progress in replication
Notice on that REPLICA (even after waiting some time):
- SHOW REPLICA STATUS shows Seconds_Behind_Source: 0
  - The relay log file/positions have no movement
- Health stats for the tablet have ReplicationLagSeconds: 0
  - curl on /debug/gateway on vtgate will show you this
- Everything else in the health stats looks ✅ (the bad REPLICA will keep getting queries 👎)

GuptaManan100 · 2025-01-23T10:41:12Z

I think given that #17470 already added the ability to detect stalled disks, we should just use that information to mark the tablet not serving. This would make vtgate not route queries to the vttablet until it has recovered from the stalled disk (at which point even the replication lag would correct its value and we might still not send it any queries until the replication lag reduces.)

GuptaManan100 · 2025-01-24T13:23:28Z

@timvaillancourt I've made the change I suggested ☝ in #17624. Please take a look when you have time and let me know if this would address your use case.

timvaillancourt self-assigned this Jan 22, 2025

timvaillancourt changed the title ~~Feature Request: vtgate to understand replicas with stalled disk~~ Feature Request: vtgate handles replicas with stalled disk Jan 22, 2025

GuptaManan100 linked a pull request Jan 24, 2025 that will close this issue

Refactor Disk Stall implementation and mark tablet not serving if disk is stalled #17624

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: `vtgate` handles replicas with stalled disk #17610

Feature Request: `vtgate` handles replicas with stalled disk #17610

timvaillancourt commented Jan 22, 2025 •

edited

Loading

timvaillancourt commented Jan 22, 2025 •

edited

Loading

timvaillancourt commented Jan 22, 2025

timvaillancourt commented Jan 22, 2025 •

edited

Loading

GuptaManan100 commented Jan 23, 2025

GuptaManan100 commented Jan 24, 2025

Feature Request: vtgate handles replicas with stalled disk #17610

Feature Request: vtgate handles replicas with stalled disk #17610

Comments

timvaillancourt commented Jan 22, 2025 • edited Loading

Feature Description

Use Case(s)

timvaillancourt commented Jan 22, 2025 • edited Loading

timvaillancourt commented Jan 22, 2025

timvaillancourt commented Jan 22, 2025 • edited Loading

GuptaManan100 commented Jan 23, 2025

GuptaManan100 commented Jan 24, 2025

Feature Request: `vtgate` handles replicas with stalled disk #17610

Feature Request: `vtgate` handles replicas with stalled disk #17610

timvaillancourt commented Jan 22, 2025 •

edited

Loading

timvaillancourt commented Jan 22, 2025 •

edited

Loading

timvaillancourt commented Jan 22, 2025 •

edited

Loading