Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vote txs per block are less than 1 vote per validator #1851

Open
AshwinSekar opened this issue Jun 25, 2024 · 11 comments
Open

Vote txs per block are less than 1 vote per validator #1851

AshwinSekar opened this issue Jun 25, 2024 · 11 comments
Assignees

Comments

@AshwinSekar
Copy link

Problem

Given a cluster of fully performant validators, we would expect there to be 1 vote per validator in each block. This is not what we observe in practice. There is also a reduction of votes from epochs prior to 578 and current epochs. Similarly, there is a large discrepancy in landed vote transactions for each leader slot.

Analysis

A sample of 10k slots from epochs 577 and 628 show that there are less vote txs per block:

                      | Epoch 577  249303455 to 249313455 | Epoch 628  271289183 to 271299183
Total vote txs        | 9325696                           | 8703855
Avg vote txs per slot | 932.5696                          | 870.3855

Interestingly, we see in 628 there are more vote transactions landing in the first leader slot, while there are less for the second and third leader slot. There is a sharp decline in the 4th leader slot that is consistent for both 577 and 628
image

Breaking this down by latency, we see that there are less latency 1 votes (votes for the immediately previous slot) in general. The 4th leader slot has very few latency 1 votes in comparison:
image
image

One possible explanation for the 4th leader slot having so few latency 1 votes could be the vote tpu sending logic, which selects the leader at a 2 slot offset in order to forward to:

pub const FORWARD_TRANSACTIONS_TO_LEADER_AT_SLOT_OFFSET: u64 = 2;

This implies that a vote for the 3rd leader slot can only land in the 4th leader slot through gossip.

Another explanation for overall poor tx inclusion is related to forks. If the next leader chooses to build off a different parent, votes for the previous leaders slots will fail the slot hashes check. As we've modified banking stage to only hold the latest vote per validator, even if there was a previous vote for a parent that has not been included in this fork, it has no chance to land. Also since we only send vote tx to the tpu of the slot + 2 leader, these votes can only land through gossip for the next leader that decides to build off of the main fork.

We can see this in action here, as slots 271409560 - 63 was a minor fork, meaning any votes for 58-59 could have only landed through gossip on 64.
image

Solutions

  • Allow votes for the 3rd leader slot to be sent to both the current leader and next leader through tpu.
  • Rework the banking stage vote ingestion, to be smarter about the slot hashes check and continue to hold onto votes that do not pass the check for the current working bank, as a leader could switch forks during there 4 leader slots.

Another possibility is that replay is not keeping up and vote transactions are not being sent in time for inclusion. I will follow up with some more individual vote timing metrics.

@AshwinSekar
Copy link
Author

Restricting the sample to only rooted blocks, slightly improves vote numbers but not by much:
Screen Shot 2024-06-25 at 12 41 25 PM

@AshwinSekar
Copy link
Author

AshwinSekar commented Jun 25, 2024

To analyze replay can look at the ~ 2050 validators that report metrics to see when the vote tx for slot S was created. For the purpose of this example we consider that the vote tx for slot S will land in S + 1 if it was created before the my_leader_slot metric for S + 2 minus 100 ms to account for latency.
Here are results from the previous small sample range in epoch 628:
Screen Shot 2024-06-25 at 1 24 16 PM
This does not lineup with the # of latency 1 votes scraped from ledger for the same range:
Screen Shot 2024-06-25 at 1 31 07 PM

Some slot ranges such as 53 - 55 show a drastic difference in votes that were expected to land and what actually landed.

NOTE: the replay results for the 4th leader slot will be skewed for the tpu issue mentioned previously.

@AshwinSekar
Copy link
Author

Not too much vote deduplication during this range
Screen Shot 2024-06-25 at 2 08 35 PM

@AshwinSekar
Copy link
Author

One possible explanation for the 4th leader slot having so few latency 1 votes could be the vote tpu sending logic, which selects the leader at a 2 slot offset in order to forward to:

pub const FORWARD_TRANSACTIONS_TO_LEADER_AT_SLOT_OFFSET: u64 = 2;

This implies that a vote for the 3rd leader slot can only land in the 4th leader slot through gossip.

When testing this out in practice, it seems that the vote can be sent to the leader at a 3 slot offset or more. next_leader uses the poh_recorder which is based on the last reset bank, not necessarily in sync with the vote_bank. Adding tpu logging to local_cluster::test_spend_and_verify_all_nodes_3 (3 node cluster, no intentional forking) we see this is the case:

Leader Slot Total votes # of votes sent to leader + 2 # of votes sent to leader + 3 # of votes sent to a leader that could not land it in latency 1
1 58 38 20  
2 56 40 16 16
3 58 41 17 58
4 58 36 22  

In the presence of forks, the reset bank could be on a completely different fork than the vote bank, causing even poorer inclusion.

@AshwinSekar
Copy link
Author

AshwinSekar commented Jun 25, 2024

Ran a small patch to log which slot's leader we would send a vote tx to against mainnet (non voting validator), we see a larger range of the desync between the slot selected through the poh_recorder and the vote_slot. The columns are slot who's leader we sent the vote to - vote_slot

Leader Slot -6 -5 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Total
1 1   12 17 2 3 698 8595 2276 711 581 191 82 21 5 3 2 2 13202
2   2 9 17 2 28 4481 8998 81 18 18 3   1         13658
3     1 2   2 5493 8153 32 3                 13686
4             5423 8243 23 3                 13692
Total 1 2 22 36 4 33 16095 33989 2412 735 599 194 82 22 5 3 2 2 54238

Note that there is a larger amount of variability for the earlier leader slots, and it seems like the reset bank converges during the final leader slot.
This is also gives us a smaller # of vote txs that could land with latency 1:

Leader Slot # vote txs not sent to the leader of the next slot # vote txs sent to the leader of the next slot % of votes sent to the wrong leader
1 3904 9298 29.57 %
2 9130 4528 66.84 %
3 13681 5 99.96 %
4 3 13689 0.02 %
Total 26718 27520 49.26 %

This means that of the ~55k slots voted on, 49% of them were sent to a leader which ensured that the vote could not land in the next slot without the assistance of forwarding or gossip.

This could just mean that replay is not able to keep up half of the time. Will follow up with more replay metrics.

@AshwinSekar
Copy link
Author

AshwinSekar commented Aug 15, 2024

Linking #2607 (send to poh_slot + 1 and poh_slot + 2)

Also #2605 fixes a bug in retryable vote packets which will improve inclusion

Edit: also efforts in here #2183 should slightly improve inclusion during forks

@StaRkeSolanaValidator
Copy link

StaRkeSolanaValidator commented Aug 15, 2024

Hi @AshwinSekar. Is there any plan to backfill vote txs after we failover to the heaviest fork? I understand that would increase vote inclussion as well, even though I'm not sure if that would help for the consensus. Thanks!

@AshwinSekar
Copy link
Author

It doesn't increase vote inclusion in this context, as you can't retroactively add votes to blocks that have already been produced.
It is risky to backfill, as you are artificially increasing your lockout on whatever fork you choose to backfill. If for whatever reason you need to switch off this fork, you will have to wait longer.

@AshwinSekar
Copy link
Author

#2607 and #2605 are present in v2.0.7 which has 63% adoption on testnet in epoch 683. Here's a comparison with some prior epochs 671 & 672 (2.0.4).
Note: These numbers are for testnet, and should not be compared to the mainnet graphs above. Also these numbers include vote transactions from firedancer which has approximately 27% stake:
Screen Shot 2024-08-28 at 5 49 22 PM
Screen Shot 2024-08-28 at 5 49 28 PM
Screen Shot 2024-08-28 at 5 49 32 PM
Screen Shot 2024-08-28 at 5 49 39 PM

Epoch 671 671 Avg Votes Avg Latency 1 Avg Latency 2 Avg Latency 3 Avg > Latency 3
1 1,248.66 601.20 543.42 4.51 99.52
2 632.89 550.98 42.31 16.32 23.29
3 682.71 519.40 147.90 2.91 12.50
4 209.01 31.33 145.77 8.77 23.14
           
           
Epoch 672 672 Avg Votes Avg Latency 1 Avg Latency 2 Avg Latency 3 Avg > Latency 3
1 1,245.18 603.35 541.97 3.95 95.91
2 615.44 552.21 38.76 5.72 18.75
3 680.63 520.67 147.24 2.47 10.24
4 207.94 30.78 145.57 8.71 22.89
           
Epoch 683 683 Avg Votes Avg Latency 1 Avg Latency 2 Avg Latency 3 Avg > Latency 3
1 959.76 596.19 260.96 3.68 98.92
2 556.30 525.29 18.80 3.13 9.08
3 679.25 503.99 165.15 2.90 7.21
4 596.80 382.95 191.43 6.75 15.67

We have a huge increase in votes (and latency 1 votes specifically) for the 4th leader slot. I believe this can be attributed to #2607 sending to poh_slot + 1 🎉 .

@bw-solana
Copy link

CC @ilya-bobyr - You may find this interesting given your investigation into leader targeting

@bw-solana
Copy link

Let's assume an ideal state with 0 network delays, forking, or skipped slots.
And we have the fix to send to slots +1 and +2.
I think the targeting ends up looking like so:

Vote for Block PoH slot Leader Slot Leader
0 1 2,3 A
1 2 3,4 A,B
2 3 4,5 B
3 4 5,6 B
4 5 6,7 B
5 6 7,8 B,C
6 7 8,9 C
7 8 9,10 C

This seems overly pessimistic. Seems like it could be even better if we also included slot offset of 0:

Vote for Block PoH slot Leader Slot Leader
0 1 1,2,3 A
1 2 2,3,4 A,B
2 3 3,4,5 A,B
3 4 4,5,6 B
4 5 5,6,7 B
5 6 6,7,8 B,C
6 7 7,8,9 B,C
7 8 8,9,10 C

The only net differences is that we'll try to include votes for:

  1. block 2 into leader A (slot 3)
  2. block 6 into leader B (slot 7)

This is a slight increase from 5 vote txs to 6 vote txs sent across the network for every 4 blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants