All notable changes are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Update maillog flake input
- Use exact maillog version in
pyproject.toml
- Send email warning when crawler runtime exceeds 12h
- Update DNS seed addresses
- Compare DNS seeds hardcoded into crawler to those hardcoded into Bitcoin Core and output a warning if the crawler is missing seeds
- Make use of new maillog package to send warnings about DNS seeds via email
- Add
google-auth
Python package as dependency
- Refactor Node class to switch from passing static
NodeSettings
via the constructor to using a class variable and method to initialize the settings once. This avoids unnecessary piggybacking the settings throughout the code. Among other things, this allows for some simplifications in the newly introducedHistory
class.
- Support caching previously discovered node across runs and re-trying them in successive runs if they were not discovered during the run. (This approach is necessary to generate consistent results for CJDNS nodes: since there are very few of those nodes, and since most of them are connected via other network types as well, CJDNS nodes mostly only advertise one other CJDNS node, leading to poor dissemination CJDNS node addresses.)
- Add support for CJDNS. Timeouts can be set via the
--cjdns-{connect,message,getaddr}-timeout
command line arguments - Increase age threshold for advertised nodes from one to two days to account for addrman cache lifetime of around one day
- Remove
CrawlerSettings
from the (repr()
-based) string representation ofNode
the node class
- Add support to log addresses received in
addr
messages on a per-peer basis. This new feature is enabled via the--record-addr-data
command-line argument. - Obsolete the
--record-addr-stats
option to collect timestamps for all advertised addresses since this data can be extracted from the data collected using the newly introduced--record-addr-data
option.
- Log nodes to which a connection could be established but which did not complete the
handshake, and introduce
handshake_successful
field in the reachable nodes log to differentiate between nodes that successfully completed the handshake and those that did not.
- Include node address network type in reachable nodes log
- Add command-line settings to configure collecting and storing address statistics. Address statistics are disabled by default.
- Fix bug during parsing of addr messages
- Remove logging of tor proxy connect time (
time_connect_proxy
)
- Fix timeout issue when uploading large files to GCS
- Increase the default number of workers from 20 to 64
- Default timeouts have been changed to minimize crawler runtime while maintaining 99.9% coverage of nodes (the analysis on which the new timeouts are based is available here)
- Retry node handshake in case it fails the first time(s). Number of handshake
attempts can be specified using
--handshake-attempts
and defaults to three. - Increase handshake and getaddr timeouts for IP connections to sixty seconds.
- Provide flake support (
flake run/develop
and nix services) - Update
README.md
- Improve handling of storing to GCS
- Set the default GCS location to
sources/<hostname>
- Allow specifying GCS credential file via
--gcs-credentials
command-line argument - Minor internal improvements (better default settings, dedicated dataclass for GCS settings, better sanity checks, etc.)
- Set the default GCS location to
- Fix socket leak where SAM sessions were created for each I2P stream instead of reusing a single one
- Fix socket file descriptor leak where connections where not properly closed when encountering connection problems
- Introduced command line settings to parametrize TOR proxy and I2P SAM router
addresses and ports:
--tor-proxy-{host,port}
and--i2p-sam-{host,port}
- Single source version info (from
pyproject.toml
) - Simplify version information (remove auto-detection of git info using
gitpython
dependency; instead use new--extra-version-info
argument to specify additional build info (such as a git commit hash)
- Fix bug with result output directory breaking when using nested directories
- Replace build info (read from files) with version info (from source and git)
- Making python
logging
timestamps use UTC
- Removed docker integration
- Added nix flake
- Removed
pandas
as dependency (usingcsv
instead)
- Refactored the P2P network crawler's source code. This includes several incompatible interface changes, as well as incompatible changes to the format of the result files written by the crawler.