-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BEP-56: Data compression extension #125
base: master
Are you sure you want to change the base?
Changes from 5 commits
bb5ee2f
99193e6
d4b3ab4
853c3fa
3fc2849
6c17e1d
4896a5f
f75109a
7c5bc3c
dc66ae2
5615857
ed2add6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
:BEP: 56 | ||
:Title: Data compression extension | ||
:Version: $Revision$ | ||
:Last-Modified: $Date$ | ||
:Author: Alexander Ivanov <[email protected]> | ||
:Status: Draft | ||
:Type: Standards Track | ||
:Created: 31-Sep-2021 | ||
:Post-History: | ||
|
||
Abstract | ||
======== | ||
This extension adds a capability for clients to negotiate and use | ||
compression methods for data streams or torrent pieces, effectively | ||
improving bandwidth for supporting clients. | ||
|
||
Rationale | ||
========= | ||
This extension would allow clients to download files faster, without | ||
using file archivers. Since large files are often pre-compressed before | ||
torrent creation, downloaders needs to keep both the archives | ||
(for seeding) and uncompressed files (for own usage). | ||
|
||
Most users prefer to remove such torrents, thus harming proper file | ||
distribution. For example: Organizations using Bittorrent for software | ||
distribution needs to have centralized storage for new customers, no | ||
matter how many customers have the same software already. | ||
|
||
Extension header | ||
================ | ||
|
||
This extension uses the extension protocol (specified in `BEP 0010`_) | ||
to advertise client capability of using chunk compression. It defines | ||
following items in the extension protocol handshake message: | ||
+-------+-----------------------------------------------------------+ | ||
| name | description | | ||
+=======+===========================================================+ | ||
| c | Dictionary of supported compression algorithms which maps | | ||
| | its identifiers to its priority (unsigned 8-bit integer), | | ||
| | clients can adjust it based on compression speed/ratio, | | ||
| | hardware support, performance, and power mode et cetera. | | ||
| | Priority set to zero means that the compression algorithm | | ||
| | is not supported or disabled by user, the client must | | ||
| | ignore unknown algorithms. | | ||
+-------+-----------------------------------------------------------+ | ||
|
||
|
||
|
||
The compression algorithm is selected by taking the dictionary item with | ||
highest priority from intersection of items supported by both peers, | ||
if there isn't any suitable compression algorithm - compression will be disabled. | ||
|
||
Example of extension handshake message: | ||
|
||
:: | ||
|
||
{ | ||
'c': { | ||
'p_zstd': 255, | ||
's_zstd': 153, | ||
'p_lz4': 106, | ||
'p_density': 70, | ||
's_lz4': 41, | ||
's_density': 37 | ||
} | ||
} | ||
|
||
|
||
Compression methods | ||
=================== | ||
Extension provides two approaches (methods) to compression, which have | ||
their own trade-offs, so choice between these should be made by clients | ||
on per-torrent basis, using its metadata (properties like piece size). | ||
|
||
With **by-piece compression** method, client must compress each piece | ||
individually, which lowers overall compression ratio but result can | ||
be stored in cache and reused, probably providing more efficiency. | ||
If the client is caching compressed pieces in memory, then it can be | ||
decompressed when saving to disk or sending to peer, which not supports | ||
compression. To reduce piece re-compression, client should raise | ||
current algorithm's priority during handshake. This method has low | ||
efficiency with pieces smaller than 4 MB. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a lot of details omitted here. This needs to fit into the way blocks are requested and sent according to the protocol, see http://bittorrent.org/beps/bep_0003.html Crucially, when you say the whole piece is compressed, do you mean that I have to request all blocks for that piece from the same peer, in order to decompress any part of it? The offset and size that's specified in the request message, is the referring to the uncompressed piece (as it does in the current protocol) or does it refer to the compressed piece? The requestor would need to know the compressed size of each piece in that case, which there doesn't seem to be a mechanism to learn. It seems far more practical to introduce a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, I should have introduced |
||
|
||
Clients using **stream compression** method instead compresses whole | ||
data stream, so compression ratio should be higher. During handshake, | ||
clients should lower or raise algorithm's priority depending on expected | ||
factors that could impact compression efficiency and performance. This | ||
method can introduce performance issues if used on thousands of | ||
simultaneous connections. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you synchronize which byte to start stream compression at? I think you would need a message indicating that everything past it is compressed, and you probably ought to include which compression algorithm you picked in this message as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done with |
||
|
||
Allowed compression algorithms | ||
------------------------------ | ||
|
||
Compression algorithms must satisfy the following requirements: | ||
|
||
1. Decompression speed must not be lower than 500 MB/s. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this doesn't really mean anything unless you specify the hardware you run it on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Totally agree. I used data from Silesia compression corpus and forgot to include reference hardware. |
||
|
||
2. It must not produce a larger piece than the original by 1%. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so there must be an option for the sending side to send a block uncompressed, even if it was requested as compressed then, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That was a short requirement list for compression algorithm candidates of specification. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed requirement list altogether for now. |
||
|
||
For consistency, identifiers are prefixed by ``p_`` or ``s_`` | ||
for "piece" and "stream" compression methods accordingly. | ||
|
||
+-------------+-----------------------------+ | ||
| identifier | compression algorithm | | ||
+=============+=============================+ | ||
| p_lz4 | LZ4 | | ||
+-------------+-----------------------------+ | ||
| s_lz4 | LZ4 | | ||
+-------------+-----------------------------+ | ||
| p_density | Chameleon (DENSITY library) | | ||
+-------------+-----------------------------+ | ||
| s_density | Chameleon (DENSITY library) | | ||
+-------------+-----------------------------+ | ||
| p_zstd | ZStandard | | ||
+-------------+-----------------------------+ | ||
| s_zstd | ZStandard | | ||
+-------------+-----------------------------+ | ||
|
||
This specification deliberately doesn't provide negotiation | ||
for configuration options, default ones must be used unless | ||
specified otherwise. | ||
|
||
**NOTE**: Currently, only ``p_zstd`` and ``s_zstd`` algorithms | ||
are required for implementation. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the point of requiring this? Would it be a problem if the negotiations resulted in an empty set of algorithms and normal protocol was used? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There were concerns about different clients supporting non-overlapping sets of algorithms, so specification should require one that must be implemented universally. There wouldn't be a problem if negotiations resulted in an empty set, as the compression feature could be disabled by the user. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is an extension to begin with. There will be clients not implementing it. I don't see a problem with that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, algorithm list must be reworked, as there can be additional options. |
||
|
||
References | ||
========== | ||
|
||
.. _`BEP 0010`: http://www.bittorrent.org/beps/bep_0010.html | ||
|
||
|
||
Copyright | ||
========= | ||
|
||
This document has been placed in the public domain. | ||
|
||
|
||
.. | ||
Local Variables: | ||
mode: indented-text | ||
indent-tabs-mode: nil | ||
sentence-end-double-space: t | ||
fill-column: 70 | ||
coding: utf-8 | ||
End: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like an unnecessary requirement that the same algorithm is used in both directions. It also seems like it would complicate things.
The fact that there's no message to ensure the clients agree on which algorithm is used seems risky. You don't specify how to resolve ambiguities. There may be 2 algorithms that are equally good options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken the assumption that the same algorithm is used in both directions to simplify the negotiation process. Once both clients shared dictionaries, no further messages are required. It's unlikely that two algorithms would have the same priority on two different clients, but I should have explained it more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're making it more complicated by introducing negotiation in the first place.
Unlikely things happen all the time, especially when you have ~100 million peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but that's necessary due to clients disabling/implementing various algorithms.
Tried to resolve this by taking TLS approach, now in
crequest
client will enumerate what algorithms it's capable of, and then having other client to respond incresponse
with selected algorithms to send and receive.