Vote txs per block are less than 1 vote per validator #1851

AshwinSekar · 2024-06-25T01:54:35Z

Problem

Given a cluster of fully performant validators, we would expect there to be 1 vote per validator in each block. This is not what we observe in practice. There is also a reduction of votes from epochs prior to 578 and current epochs. Similarly, there is a large discrepancy in landed vote transactions for each leader slot.

Analysis

A sample of 10k slots from epochs 577 and 628 show that there are less vote txs per block:

                      | Epoch 577  249303455 to 249313455 | Epoch 628  271289183 to 271299183
Total vote txs        | 9325696                           | 8703855
Avg vote txs per slot | 932.5696                          | 870.3855

Interestingly, we see in 628 there are more vote transactions landing in the first leader slot, while there are less for the second and third leader slot. There is a sharp decline in the 4th leader slot that is consistent for both 577 and 628

Breaking this down by latency, we see that there are less latency 1 votes (votes for the immediately previous slot) in general. The 4th leader slot has very few latency 1 votes in comparison:

One possible explanation for the 4th leader slot having so few latency 1 votes could be the vote tpu sending logic, which selects the leader at a 2 slot offset in order to forward to:

agave/sdk/program/src/clock.rs

Line 141 in 5263c9d

pub const FORWARD_TRANSACTIONS_TO_LEADER_AT_SLOT_OFFSET: u64 = 2;

This implies that a vote for the 3rd leader slot can only land in the 4th leader slot through gossip.

Another explanation for overall poor tx inclusion is related to forks. If the next leader chooses to build off a different parent, votes for the previous leaders slots will fail the slot hashes check. As we've modified banking stage to only hold the latest vote per validator, even if there was a previous vote for a parent that has not been included in this fork, it has no chance to land. Also since we only send vote tx to the tpu of the slot + 2 leader, these votes can only land through gossip for the next leader that decides to build off of the main fork.

We can see this in action here, as slots 271409560 - 63 was a minor fork, meaning any votes for 58-59 could have only landed through gossip on 64.

Solutions

Allow votes for the 3rd leader slot to be sent to both the current leader and next leader through tpu.
Rework the banking stage vote ingestion, to be smarter about the slot hashes check and continue to hold onto votes that do not pass the check for the current working bank, as a leader could switch forks during there 4 leader slots.

Another possibility is that replay is not keeping up and vote transactions are not being sent in time for inclusion. I will follow up with some more individual vote timing metrics.

The text was updated successfully, but these errors were encountered:

AshwinSekar · 2024-06-25T16:42:27Z

Restricting the sample to only rooted blocks, slightly improves vote numbers but not by much:

AshwinSekar · 2024-06-25T17:32:54Z

To analyze replay can look at the ~ 2050 validators that report metrics to see when the vote tx for slot S was created. For the purpose of this example we consider that the vote tx for slot S will land in S + 1 if it was created before the my_leader_slot metric for S + 2 minus 100 ms to account for latency.
Here are results from the previous small sample range in epoch 628:

This does not lineup with the # of latency 1 votes scraped from ledger for the same range:

Some slot ranges such as 53 - 55 show a drastic difference in votes that were expected to land and what actually landed.

NOTE: the replay results for the 4th leader slot will be skewed for the tpu issue mentioned previously.

AshwinSekar · 2024-06-25T18:09:20Z

Not too much vote deduplication during this range

AshwinSekar · 2024-06-25T21:19:31Z

One possible explanation for the 4th leader slot having so few latency 1 votes could be the vote tpu sending logic, which selects the leader at a 2 slot offset in order to forward to:

agave/sdk/program/src/clock.rs

Line 141 in 5263c9d

pub const FORWARD_TRANSACTIONS_TO_LEADER_AT_SLOT_OFFSET: u64 = 2;

This implies that a vote for the 3rd leader slot can only land in the 4th leader slot through gossip.

When testing this out in practice, it seems that the vote can be sent to the leader at a 3 slot offset or more. next_leader uses the poh_recorder which is based on the last reset bank, not necessarily in sync with the vote_bank. Adding tpu logging to local_cluster::test_spend_and_verify_all_nodes_3 (3 node cluster, no intentional forking) we see this is the case:

Leader Slot	Total votes	# of votes sent to leader + 2	# of votes sent to leader + 3	# of votes sent to a leader that could not land it in latency 1
1	58	38	20
2	56	40	16	16
3	58	41	17	58
4	58	36	22

In the presence of forks, the reset bank could be on a completely different fork than the vote bank, causing even poorer inclusion.

AshwinSekar · 2024-06-25T22:54:07Z

Ran a small patch to log which slot's leader we would send a vote tx to against mainnet (non voting validator), we see a larger range of the desync between the slot selected through the poh_recorder and the vote_slot. The columns are slot who's leader we sent the vote to - vote_slot

Leader Slot	-6	-5	-2	-1	0	1	2	3	4	5	6	7	8	9	10	11	12	13	Total
1	1		12	17	2	3	698	8595	2276	711	581	191	82	21	5	3	2	2	13202
2		2	9	17	2	28	4481	8998	81	18	18	3		1					13658
3			1	2		2	5493	8153	32	3									13686
4							5423	8243	23	3									13692
Total	1	2	22	36	4	33	16095	33989	2412	735	599	194	82	22	5	3	2	2	54238

Note that there is a larger amount of variability for the earlier leader slots, and it seems like the reset bank converges during the final leader slot.
This is also gives us a smaller # of vote txs that could land with latency 1:

Leader Slot	# vote txs not sent to the leader of the next slot	# vote txs sent to the leader of the next slot	% of votes sent to the wrong leader
1	3904	9298	29.57 %
2	9130	4528	66.84 %
3	13681	5	99.96 %
4	3	13689	0.02 %
Total	26718	27520	49.26 %

This means that of the ~55k slots voted on, 49% of them were sent to a leader which ensured that the vote could not land in the next slot without the assistance of forwarding or gossip.

This could just mean that replay is not able to keep up half of the time. Will follow up with more replay metrics.

AshwinSekar · 2024-08-15T13:26:56Z

Linking #2607 (send to poh_slot + 1 and poh_slot + 2)

Also #2605 fixes a bug in retryable vote packets which will improve inclusion

Edit: also efforts in here #2183 should slightly improve inclusion during forks

StaRkeSolanaValidator · 2024-08-15T15:13:12Z

Hi @AshwinSekar. Is there any plan to backfill vote txs after we failover to the heaviest fork? I understand that would increase vote inclussion as well, even though I'm not sure if that would help for the consensus. Thanks!

AshwinSekar · 2024-08-15T17:43:00Z

It doesn't increase vote inclusion in this context, as you can't retroactively add votes to blocks that have already been produced.
It is risky to backfill, as you are artificially increasing your lockout on whatever fork you choose to backfill. If for whatever reason you need to switch off this fork, you will have to wait longer.

AshwinSekar · 2024-08-28T21:51:18Z

#2607 and #2605 are present in v2.0.7 which has 63% adoption on testnet in epoch 683. Here's a comparison with some prior epochs 671 & 672 (2.0.4).
Note: These numbers are for testnet, and should not be compared to the mainnet graphs above. Also these numbers include vote transactions from firedancer which has approximately 27% stake:

Epoch 671	671 Avg Votes	Avg Latency 1	Avg Latency 2	Avg Latency 3	Avg > Latency 3
1	1,248.66	601.20	543.42	4.51	99.52
2	632.89	550.98	42.31	16.32	23.29
3	682.71	519.40	147.90	2.91	12.50
4	209.01	31.33	145.77	8.77	23.14


Epoch 672	672 Avg Votes	Avg Latency 1	Avg Latency 2	Avg Latency 3	Avg > Latency 3
1	1,245.18	603.35	541.97	3.95	95.91
2	615.44	552.21	38.76	5.72	18.75
3	680.63	520.67	147.24	2.47	10.24
4	207.94	30.78	145.57	8.71	22.89

Epoch 683	683 Avg Votes	Avg Latency 1	Avg Latency 2	Avg Latency 3	Avg > Latency 3
1	959.76	596.19	260.96	3.68	98.92
2	556.30	525.29	18.80	3.13	9.08
3	679.25	503.99	165.15	2.90	7.21
4	596.80	382.95	191.43	6.75	15.67

We have a huge increase in votes (and latency 1 votes specifically) for the 4th leader slot. I believe this can be attributed to #2607 sending to poh_slot + 1 🎉 .

bw-solana · 2024-10-28T21:56:53Z

CC @ilya-bobyr - You may find this interesting given your investigation into leader targeting

bw-solana · 2024-10-28T22:07:59Z

Let's assume an ideal state with 0 network delays, forking, or skipped slots.
And we have the fix to send to slots +1 and +2.
I think the targeting ends up looking like so:

Vote for Block	PoH slot	Leader Slot	Leader
0	1	2,3	A
1	2	3,4	A,B
2	3	4,5	B
3	4	5,6	B
4	5	6,7	B
5	6	7,8	B,C
6	7	8,9	C
7	8	9,10	C

This seems overly pessimistic. Seems like it could be even better if we also included slot offset of 0:

Vote for Block	PoH slot	Leader Slot	Leader
0	1	1,2,3	A
1	2	2,3,4	A,B
2	3	3,4,5	A,B
3	4	4,5,6	B
4	5	5,6,7	B
5	6	6,7,8	B,C
6	7	7,8,9	B,C
7	8	8,9,10	C

The only net differences is that we'll try to include votes for:

block 2 into leader A (slot 3)
block 6 into leader B (slot 7)

This is a slight increase from 5 vote txs to 6 vote txs sent across the network for every 4 blocks.

AshwinSekar self-assigned this Jun 25, 2024

AshwinSekar mentioned this issue Jun 25, 2024

replay: add option to use vote slot to select leader to send vote txs to #1867

Draft

jstarry mentioned this issue Aug 16, 2024

v2.0: fix: ensure vote packets can be retried (backport of #2605) #2612

Merged

jstarry mentioned this issue Aug 29, 2024

v1.18: fix: send votes to the immediate next leader (backport of #2607) #2620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vote txs per block are less than 1 vote per validator #1851

Vote txs per block are less than 1 vote per validator #1851

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024 •

edited

Loading

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024 •

edited

Loading

AshwinSekar commented Aug 15, 2024 •

edited

Loading

StaRkeSolanaValidator commented Aug 15, 2024 •

edited

Loading

AshwinSekar commented Aug 15, 2024

AshwinSekar commented Aug 28, 2024

bw-solana commented Oct 28, 2024

bw-solana commented Oct 28, 2024

Vote txs per block are less than 1 vote per validator #1851

Vote txs per block are less than 1 vote per validator #1851

Comments

AshwinSekar commented Jun 25, 2024

Problem

Analysis

Solutions

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024 • edited Loading

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024

AshwinSekar commented Jun 25, 2024 • edited Loading

AshwinSekar commented Aug 15, 2024 • edited Loading

StaRkeSolanaValidator commented Aug 15, 2024 • edited Loading

AshwinSekar commented Aug 15, 2024

AshwinSekar commented Aug 28, 2024

bw-solana commented Oct 28, 2024

bw-solana commented Oct 28, 2024

AshwinSekar commented Jun 25, 2024 •

edited

Loading

AshwinSekar commented Jun 25, 2024 •

edited

Loading

AshwinSekar commented Aug 15, 2024 •

edited

Loading

StaRkeSolanaValidator commented Aug 15, 2024 •

edited

Loading