-
Notifications
You must be signed in to change notification settings - Fork 33
MDC4_MajorTasks
NOTE: in terms of "size" R > C > L.
Depends: None
The AAE-based process for determining differences between two clusters is known to be extremely slow due to the number of round trips necessary to perform an AAE tree exchange. The goal of this work is to improve the speed at which AAE tree exchanges can be performed between two clusters.
Two options were discussed:
- a new streaming compare mode for
hashtree.erl
(similar to the work previously done by Joe onsynctree.erl
) - adding pipelining to riak_repl_aee_fssource.erl
In the end, while we all seemed to agree a true streaming interface (1) would be better, a pipelined interface (2) would be sufficient and would not require a change in the wire protocol (no need for a new repl protocol version).
The pipelining work is mostly around changing the synchronous semantics
currently coded into send_synchronous_request
in
riak_repl_aae_fssource.erl
Depends: None
The current implementation, especially riak_repl_aae_fssource.erl
,
conflates the two stages of a fullsync: difference generation and
repair. Not only is this a architectural blemish, it has severe
performance consequences in the worst case. Due to the nature of AAE
exchanges a pair of local and remote paritions perform
sum(all_nvals)
exchanges (for a total of sum(all_nvals) * ringsize
exchanges). Instead of taking the union of the diffs generated by the
exchanges, riak_repl_aae_fssource
performs a "bloom fold" (a full
backend scan) for each exchange, resulting in a worst case of
sum(all_nvals) * ringsize
folds per fullsync.
This would will modify riak_repl_aae_fssource
to only perform the
repair separately from the exchanges (effectively operating on the union of
their diff). We discussed implementing this as a "stream" where
riak_repl_aae_fssource
sends a message to the "repair process" which
may act on it immediately or may decide to repair the key along with
many others at a later time. The repair process should repair the differences it has been made aware of regardless of crashes in processes earlier in the pipeline (e.g. riak_repl_aae_fssource
or the fullsync coordinator/scheduler)
In addition to reducing the potential number of folds, it is also known that folding over the data may be suboptimal. The goal of the "bloom fold" is to do a sequential scan over a part of the disk instead of random reads. For large amounts of differences this is optimal. However, for smaller amounts of differences and perhaps some other cases this is not ideal. The new repair code will be responsible for determining the strategy to be used. Initially, we discussed just doing random reads for the first 100 keys we are asked to repair and then switch to the bloom fold.
After completing L1 and L2 we believe we will be in a much better place to address the ongoing customer issues. This work signifies qualifying that, as well as further testing (and the development of a repeatable "test harness" -- probably making riak_test better for testing repl, as well as supporting large scale testing?).
At its simplest level, to qualify AAE must be runnable by Trifork.
Depends: L2.5
Another optimization of the AAE-based difference detection and repair
is to stop performing exchanges and repairs per partition but per
keyspace. For the initial body of work here, a keyspace and a preflist
are the same thing (we are assuming equal ring sizes and N values at
the source and sink). This will require the creation of a new
"Keyspace Scheduler" which will most likely involve rewriting or
greatly refactoring the fscoord
. The keyspace scheduler attempts to
use a single partition to repair multiple keyspaces -- at best
performing on the order of coverage vs. on the order of ring size, the
best case scenario is to have ringsize * (1 / sum(n_vals))
exchanges. This is
a big improvement over ringsize
exchanges.
We discussed several types of full-sync that would be supported after these changes:
-
normal
- performs a keyspace-level exchange/repair w/o local AAE first -
all
- likenormal
but requires local AAE first -
clone
- likeall
but no difference detection, skip remote exchange and assume everything is different
Operators would like to know answers to questions like "should I perform a full-sync?" or "how up-to-date am I?". We decided the best answer we could give is "at least as up-to-date as an hour ago" or "all writes an hour ago or before are guaranteed to be in this cluster". We can do this by sharing the times at which we started the exchange. If done before or concurrently with L3, the timestamp we provide will instead need to be the time of the last local AAE exchange between the members of the keyspace. Effectively, we determine a matrix of cluster pairs each having an idea of how up-to-date they are with each other.
There was also an additional level of percision we could provide by looking at the number of dropped writes in the rtq. However, to be truly accurate we would need to fix the fact that an acked rtq write can fail on the remote side w/o being counted as dropped (or so we think).
One other decision to be made is whether or not to store the information as yet another key in Riak or as metadata in a global metadata system Ri
Depends: L2.5
In the case where intra-cluster AAE is not enabled, operators cannot use AAE fullsync. If we plan to make all these improvements we want them to take advantage of it without being forced to turn AAE on for local repairs. To do this, we will build AAE trees on demand before an exchange (performing a fold over all the data). This is a) no worse than the current key list strategy that operators must use when intra-cluster AAE is disabled b) is a bit more similar to C* style repair (not consequential but a nice frame of reference).
This can also be extended to support extremely paranoid cases where we don't want to initially trust the trees we have (we assume they can be corrupted on disk).
To perform a full-sync between two clusters it is a current requirement that they have the same ring size (Q) and number of replicas (N). This work would free us of that limitation. The details of this work will be documented in a separate write-up.
This will also require changes to the new Keyspace Scheduler.
In the case where we don't replicate a bucket we will always detect differences using the current hashtree implementation. This work will address this by extending hashtree to do incremental hashing and have tagged hashes for specific purposes. More details will also be included in the write-up mentioned in L6
Depends: L2.5
While necessary, the source/sink abstraction can lead to redundant work in the case where two clusters embody bolth roles. If differences are detected on the sink side the source will not repair it. Its not until the roles switch and another exchange is performed that the difference is repaired. In these cases we should just repair during the initial exchange.
Depends: None
In v3 repl realtime connections are chosen randomly which can lead to large imbalances. Previous implementations did a better job of ensuring there was good balance. The goal of this work is to bring that back to v3 realtime replication.
Depends: None
In order to accurately track "how up to date" a cluster is we track the number of dropped real-time messages. However, we do not believe we count, at least uniquely, the number of real-time messages that reach a sink, are acked, but then have a put fail. This work will confirm and fix this, if necessary.
Depends: Riv
The APIs and general design of the lowest layers of "riak_net" (the connection and service management parts of riak_repl) are solid. What is not is the implementation. This work will bring the code and implementation inline with the architecture originally designed by Jon Meredith, including:
- properly advertising protocol vs. service: the current system says something like "give me a realtime connection" when in fact we want "give me a tcp connection for the purposes of realtime". The current implementation has a combinatorial effect.
- well-written Erlang using OTP
- modifying the existing codebase to use the new code
Depends: CA
The message queue used by realtime replication can be used for other sub-systems, e.g. handoff or cluster metadata, where we want to continue to perform work in the face of temporary connection loss or want backpressure at the sender. This work will generalize the queue so that it can continue to be used by realtime and others. Some known tasks:
- the queue is a locally registered process
- the queue should have a mode that blocks instead of drops on enqueue (for handoff folds we can use the queue for backpressure)
Depends: CB
The current "riak_net" implementation (what lives in riak_repl) is all about point-to-point connections and communication. For future work it is desirable to be able to address any member of a cluster topology without having a direct link to it. This work will facilitate "connections" between any two peers regardless of it they have a direct link.
Depsnds: CC
Building on the ability to connect to a peer regardless of the existence of a direct link, this work will provide an interface with similar semantics to erlang's message passing. This will allow code like riak_ensemble to operate globally regardless of cluster topology.
This research explores what is necessary to extend Cluster Metadata to be global. This would facilitate the replication of e.g. bucket types and security information in addition to allowing the storage of statistics globally for quick and easy access. Some challenges include:
- how to handle tiered configuration: global, local, node.
- how to handle things like security grants for a local network
- how to deal with pre-existing clusters joining an existing topology and conflicts
- does some metadata need to be globally strongly consistent? locally?
- what about data that should be replicated to parts of a topology but not others
It was determined that the riak_core_broadcast subsystem, which implements plumtree, could most likely be used to provide a new implementation of cascading realtime replication. This would allow the re-use of existing work instead of developing a parallel spanning tree library, which has its own failure detection and healing. riak_core_broadcast will require some changes to allow it to funciton more similarly to the original plumtree algorithm. Specifically: not requiring exchanges nor requiring "ignored_i_have" messages -- putting items in the lazy queue but sending them once w/o waiting for an ignore ack.