- Nov 7, 2022: initial draft (@williambanfield)
- Nov 15, 2022: draft completed (@williambanfield)
Node operators and application developers complain that Tendermint nodes consume larges amounts of network bandwidth. This RFC catalogues the major sources of bandwidth consumption within Tendermint and suggests modifications to Tendermint that may reduce bandwidth consumption for nodes.
Multiple teams running validators in production report that the validator consumes a lot of bandwidth. They report that operators running on a network with hundreds of validators consumes multiple terabytes of bandwidth per day. Prometheus data collected from a validator node running on the Osmosis chain shows that Tendermint sends and receives large amounts of data to peers. In the nearly three hours of observation, Tendermint sent nearly 42 gigabytes and received about 26 gigabytes, for an estimated 366 gigabytes sent daily and 208 gigabytes received daily. While this is shy of the reported terabytes number, operators running multiple nodes for a 'sentry' pattern could easily send and receive a terabyte of data.
Sending and receiving large amounts of data has a cost for node operators. Most cloud platforms charge for network traffic egress. Google Cloud charges between $.05 to $.12 per gigabyte of egress traffic, and ingress is free. Hetzner charges 1€ per TB used over the 10-20TB base bandwidth per month, which will be easily hit if multiple terabytes are sent and received per day. Using the values collected from the validator on Osmosis, a single node on Google cloud may cost $18 to $44 a day running on Google cloud. On Hetzner, the estimated 18TB a month of both sending and receiving may cost between 0 and 10 Euro a month per node.
To determine which components of Tendermint were consuming the most bandwidth, I gathered prometheus metrics from the Blockpane validator running on the Osmosis network for several hours. The data reveal that three message types account for 98% of the total bandwidth consumed. These message types are as follows:
The image below of p2p data collected from the Blockpane validator illustrate the total bandwidth consumption of these three message types.
This section discusses the usage of each of the three highest consumption messages.
Sending BlockPart
messages consumes the most bandwidth out of all p2p
messages types as observed in the Blockpane Osmosis validator. In the almost 3
hour observation, the validator sent about 20 gigabytes of BlockPart
messages.
A block is proposed each round of Tendermint consensus. The paper does not define a specific way that the block is to be transmitted, just that all participants will receive it via a gossip network.
The Go implementation of Tendermint transmits the block in 'parts'. It serializes the block to wire-format proto and splits this byte representation into a set of 4 kilobyte arrays and sends these arrays to its peers, each in a separate message.
The logic for sending BlockPart
messages resides in the code for the
consensus.Reactor. The consensus reactor starts a new
gossipDataRoutine
for each peer it connects to. This routine repeatedly picks
a part of the block that Tendermint believes the peer does not know about yet
and gossips it to the peer. The set of BlockParts
that Tendermint considers
its peer as having is only updated in one of four ways:
- Our peer tells us they have entered a new round via a
NewRoundStep
message. This message is only sent when a node moves to a new round or height and only resets the data we collect about a peer's blockpart state. - We receive a block part from the peer.
- We send the peer a block part.
- Our peer tells us about the parts they have block via
NewValidBlock
messages. This message is only sent when the peer has a quorum of prevotes or precommits for a block.
Each node receives block parts from all of its peers. The particular block part to send at any given time is randomly selected from the set of parts that the peer node is not yet known to have. Given that these are the only times that Tendermint learns of its peers' block parts, it's very likely that a node has an incomplete understanding of its peers' block parts and is transmitting block parts to a peer that the peer has received from some other node.
Multiple potential mechanisms exist to reduce the number of duplicate block parts a node receives. One set of mechanisms relies on more frequently communicating the set of block parts a node needs to its peers. Another potential mechanism requires a larger overhaul to the way blocks are gossiped in the network.
The Tendermint mempool stages transactions that are yet to be committed to the
blockchain and communicates these transactions to its peers. Each message
contains one transaction. Data collected from the Blockpane node running on
Osmosis indicates that the validator sent about 12 gigabytes of Txs
messages
during the nearly 3 hour observation period.
The Tendermint mempool starts a new broadcastTxRoutine
for each peer that it is informed of. The routine sends all transactions that
the mempool is aware of to all peers with few exceptions. The only exception is
if the mempool received a transaction from a peer, then it marks it as such and
won't resend to that peer. Otherwise, it retains no information about which
transactions it already sent to a peer. In some cases it may therefore resend
transactions the peer already has. This can occur if the mempool removes a
transaction from the CList
data structure used to store the list of
transactions while it is about to be sent and if the transaction was the tail
of the CList
during removal. This will be more likely to occur if a large
number of transactions from the end of the list are removed during RecheckTx
,
since multiple transactions will become the tail and then be deleted. It is
unclear at the moment how frequently this occurs on production chains.
Beyond ensuring that transactions are rebroadcast to peers less frequently, there is not a simple scheme to communicate fewer transactions to peers. Peers cannot communicate what transactions they need since they do not know which transactions exist on the network.
Tendermint votes, both prevotes and precommits, are central to Tendermint
consensus and are gossiped by all nodes to all peers during each consensus
round. Data collected from the Blockpane node running on Osmosis indicates that
about 9 gigabytes of Vote
messages were sent during the nearly 3 hour period
of observation. Examination of the Vote message indicates that it
contains 184 bytes of data, with the proto encoding adding a few additional
bytes when transmitting.
The Tendermint consensus reactor starts a new gossipVotesRoutine for each peer that it connects to. The reactor sends all votes to all peers unless it knows that the peer already has the vote or the reactor learns that the peer is in a different round and that thus the vote no longer applies. Tendermint learns that a peer has a vote in one of 4 ways:
- Tendermint sent the peer the vote.
- Tendermint received the vote from the peer.
- The peer sent a
HasVote
message. This message is broadcast to all peers each time validator receives a vote it hasn't seen before corresponding to its current height and round. - The peer sent a
VoteSetBits
message. This message is sent as a response to a peer that sends aVoteSetMaj23
.
Given that Tendermint informs all peers of each vote message it receives, all nodes should be well informed of which votes their peers have. Given that the vote messages were the third largest consumer of bandwidth in the observation on Osmosis, it's possible that this system is not currently working correctly. Further analysis should examine where votes may be being retransmitted.
The BlockPart
messages, by far, account for the majority of the data sent to
each peer. At the moment, peers do not inform the node of which block parts
they already have. This means that each block part is very likely to be
transmitted many times to each node. This frivolous consumption is even worse
in networks with large blocks.
The very simple solution to this issue is to copy the technique used in
consensus for informing peers when the node receives a vote. The consensus
reactor can be augmented with a HasBlockPart
message that is broadcast to
each peer every time the node receives a block part. By informing each peer
every time the node receives a block part, we can drastically reduce the amount
of duplicate data sent to each node. There would be no algorithmic way of
enforcing that a peer accurately reports its block parts, so providing this
message would be a somewhat altruistic action on the part of the node. Such a
system has been proposed in the past as well, so this is certainly not
totally new ground.
Measuring the size of duplicately received blockparts before and after this change would help validate this approach.
Tendermint's data is sent uncompressed on the wire. The messages are not compressed before sending and the transport performs no compression either. Some of the information communicated by Tendermint is a poor candidate for compression: Data such as digital signatures and hashes have high entropy and therefore do not compress well. However, transactions may contain lots of information that has less entropy. Compression within Tendermint may be added at several levels. Compression may be performed at the Tendermint 'packet' level or at the Tendermint message send level.
Block, vote, and mempool gossiping transmit much of same data. The mempool reactor gossips candidate transactions to each peer. The consensus reactor, when gossiping the votes, sends vote metadata and the digital signature of that signs over that metadata. Finally, when a block is proposed, the proposing node amalgamates the received votes, a set of transaction, and adds a header to produce the block. This block is then serialized and gossiped as a list of bytes. However, the data that the block contains, namely the votes and the transactions were most likely already transmitted to the nodes on the network via mempool transaction gossip and consensus vote gossip.
Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.
This new proposed method for gossiping block data could accompany a slight
update to the mempool transaction gossip and consensus vote gossip. Since all
of the contents of each block will not be gossiped together, it's possible that
some nodes are missing a proposed transaction or the vote of a validator
indicated in the new block gossip format during block gossip. The mempool and
consensus reactors may therefore be updated to provide a NeedTxs
and
NeedVotes
message. Each of these messages would allow a node to request a set
of data from their peers. When a node receives one of these, it will then
transmit the Tx/Votes indicate in the associated message regardless of whether
it believes it has transmitted them to the peer before. The gossip layer will
ensure that each peer eventually receives all of the data in the block.
However, if a transaction is needed immediately by a peer so that it can verify
and execute a block during consensus, a mechanism such as the NeedTxs
and
NeedVotes
messages should be added to ensure it receives the messages
quickly.
The same logic may applied for evidence transmission as well, since all nodes should receive evidence and therefore do not need to re-transmit it in a block part.
A similar idea has been proposed in the past as Compact Block Propagation.