Replies: 2 comments 1 reply
-
Atomix.io is another alternative that might help to get started. It includes APIs for member discovery and many other distributed data structures (including Raft). |
Beta Was this translation helpful? Give feedback.
-
Have decided to use Apache Ignite. It can be a substitute for RocksDB as an embedded key-value store but it can also run as a master-less cluster and take care of replication, rebalancing, discovery etc... |
Beta Was this translation helpful? Give feedback.
-
See Roadmap
We need to be able to run Frontier instances as a cluster, ideally with a discovery mechanism so that the whole set of nodes does not have to be known in advance. Here are the possible options I investigated so far.
DHT
Used in P2P networks
DHT store resource locations and closest node but not a complete view of the cluster
https://en.wikipedia.org/wiki/Distributed_hash_table
“any one node needs to coordinate with only a few other nodes in the system”
“A keyspace partitioning scheme splits ownership of this keyspace among the participating nodes. An overlay network then connects the nodes, allowing them to find the owner of any given key in the keyspace.”
https://en.wikipedia.org/wiki/Kademlia
https://github.com/ep2p/kademlia-api
Problems / limitations: uses a distance function between nodes in the network. The whole set of nodes is not necessarily known.
RAFT
Raft for full cluster but clients send request to master node only + replicate everything
Raft is a consensus algorithm for managing a replicated log.
https://raft.github.io/#implementations
https://ratis.apache.org/
Ratis has a pluggable transport protocol, including gRPC
https://github.com/sofastack/sofa-jraft
Other popular implementation
Problems / limitations: Do you need to know all the peers or can it discover nodes using seeds?
Can it handle dynamic changes in the membership? Needs reload?
Need to add a WAL?
ES why not Raft -> https://www.elastic.co/blog/a-new-era-for-cluster-coordination-in-elasticsearch
Akka
https://doc.akka.io/docs/akka/current/typed/cluster.html
Good video at https://akka.io/blog/news/2020/06/01/akka-cluster-motivation
https://developer.lightbend.com/guides/akka-sample-cluster-java/
Looks to have everything we need including sharding but means another protocol involved beside gRPC?
Gossip protocol
https://github.com/scalecube/scalecube-services
Solutions used by different OS projects
https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/discovery
https://redis.io/docs/manual/scaling/
other approaches
gRPC level - found nothing (at least in Java) for a generic cluster discovery & sharding approach?
I am also leaning towards a different option, which would be to use a scalable storage backend like e.g. TiKV and let it deal with sharding & replication. The Frontier implementation then becomes largely stateless, like in the one for Opensearch. Each Frontier instance then gets assigned section of the crawl space, as done by Redis. This is very flexible as it allows instances to be added or removed without having to move data around, only the partitions of the crawl space get reassigned.
Beta Was this translation helpful? Give feedback.
All reactions