Skip to content
jonmeredith edited this page Oct 20, 2014 · 1 revision

Replication Meetup 8/20-8/22 Tasks

NOTE: in terms of "size" R > C > L.

L1 - Streaming AAE Exchange

Depends: None

The AAE-based process for determining differences between two clusters is known to be extremely slow due to the number of round trips necessary to perform an AAE tree exchange. The goal of this work is to improve the speed at which AAE tree exchanges can be performed between two clusters.

Two options were discussed:

  1. a new streaming compare mode for hashtree.erl (similar to the work previously done by Joe on synctree.erl)
  2. adding pipelining to riak_repl_aee_fssource.erl

In the end, while we all seemed to agree a true streaming interface (1) would be better, a pipelined interface (2) would be sufficient and would not require a change in the wire protocol (no need for a new repl protocol version).

The pipelining work is mostly around changing the synchronous semantics currently coded into send_synchronous_request in riak_repl_aae_fssource.erl

L2 - Separate Difference Detection from Repair and Improve Repair

Depends: None

The current implementation, especially riak_repl_aae_fssource.erl, conflates the two stages of a fullsync: difference generation and repair. Not only is this a architectural blemish, it has severe performance consequences in the worst case. Due to the nature of AAE exchanges a pair of local and remote paritions perform sum(all_nvals) exchanges (for a total of sum(all_nvals) * ringsize exchanges). Instead of taking the union of the diffs generated by the exchanges, riak_repl_aae_fssource performs a "bloom fold" (a full backend scan) for each exchange, resulting in a worst case of sum(all_nvals) * ringsize folds per fullsync.

This would will modify riak_repl_aae_fssource to only perform the repair separately from the exchanges (effectively operating on the union of their diff). We discussed implementing this as a "stream" where riak_repl_aae_fssource sends a message to the "repair process" which may act on it immediately or may decide to repair the key along with many others at a later time. The repair process should repair the differences it has been made aware of regardless of crashes in processes earlier in the pipeline (e.g. riak_repl_aae_fssource or the fullsync coordinator/scheduler)

In addition to reducing the potential number of folds, it is also known that folding over the data may be suboptimal. The goal of the "bloom fold" is to do a sequential scan over a part of the disk instead of random reads. For large amounts of differences this is optimal. However, for smaller amounts of differences and perhaps some other cases this is not ideal. The new repair code will be responsible for determining the strategy to be used. Initially, we discussed just doing random reads for the first 100 keys we are asked to repair and then switch to the bloom fold.

L2.5 - Qualify, Benchmark, and Resilience Testing

Depends: L1, L2

After completing L1 and L2 we believe we will be in a much better place to address the ongoing customer issues. This work signifies qualifying that, as well as further testing (and the development of a repeatable "test harness" -- probably making riak_test better for testing repl, as well as supporting large scale testing?).

At its simplest level, to qualify AAE must be runnable by Trifork.

L3 - Keyspace Scheduler

Depends: L2.5

Another optimization of the AAE-based difference detection and repair is to stop performing exchanges and repairs per partition but per keyspace. For the initial body of work here, a keyspace and a preflist are the same thing (we are assuming equal ring sizes and N values at the source and sink). This will require the creation of a new "Keyspace Scheduler" which will most likely involve rewriting or greatly refactoring the fscoord. The keyspace scheduler attempts to use a single partition to repair multiple keyspaces -- at best performing on the order of coverage vs. on the order of ring size, the best case scenario is to have ringsize * (1 / sum(n_vals)) exchanges. This is a big improvement over ringsize exchanges.

We discussed several types of full-sync that would be supported after these changes:

  • normal - performs a keyspace-level exchange/repair w/o local AAE first
  • all - like normal but requires local AAE first
  • clone - like all but no difference detection, skip remote exchange and assume everything is different

L4 - Improved Visibility into How Up-to-Date Clusters Are

Depends: L2.5, Ri

Operators would like to know answers to questions like "should I perform a full-sync?" or "how up-to-date am I?". We decided the best answer we could give is "at least as up-to-date as an hour ago" or "all writes an hour ago or before are guaranteed to be in this cluster". We can do this by sharing the times at which we started the exchange. If done before or concurrently with L3, the timestamp we provide will instead need to be the time of the last local AAE exchange between the members of the keyspace. Effectively, we determine a matrix of cluster pairs each having an idea of how up-to-date they are with each other.

There was also an additional level of percision we could provide by looking at the number of dropped writes in the rtq. However, to be truly accurate we would need to fix the fact that an acked rtq write can fail on the remote side w/o being counted as dropped (or so we think).

One other decision to be made is whether or not to store the information as yet another key in Riak or as metadata in a global metadata system Ri

L5 - On-Demand AAE Trees

Depends: L2.5

In the case where intra-cluster AAE is not enabled, operators cannot use AAE fullsync. If we plan to make all these improvements we want them to take advantage of it without being forced to turn AAE on for local repairs. To do this, we will build AAE trees on demand before an exchange (performing a fold over all the data). This is a) no worse than the current key list strategy that operators must use when intra-cluster AAE is disabled b) is a bit more similar to C* style repair (not consequential but a nice frame of reference).

This can also be extended to support extremely paranoid cases where we don't want to initially trust the trees we have (we assume they can be corrupted on disk).

CE - AAE Tree Changes for Massive Full Sync Improvements

Depends: L3, Rv

L6 - Q/N-Independent Full Sync

Depends: L3 Grouped with: L8

To perform a full-sync between two clusters it is a current requirement that they have the same ring size (Q) and number of replicas (N). This work would free us of that limitation. The details of this work will be documented in a separate write-up.

This will also require changes to the new Keyspace Scheduler.

L8 - Don't do Useless work when we don't replicate

Depends: L3 Grouped with: L6

In the case where we don't replicate a bucket we will always detect differences using the current hashtree implementation. This work will address this by extending hashtree to do incremental hashing and have tagged hashes for specific purposes. More details will also be included in the write-up mentioned in L6

L7 - Bi-Directional Full Sync

Depends: L2.5

While necessary, the source/sink abstraction can lead to redundant work in the case where two clusters embody bolth roles. If differences are detected on the sink side the source will not repair it. Its not until the roles switch and another exchange is performed that the difference is repaired. In these cases we should just repair during the initial exchange.

L0 - Realtime Connection Balancing

Depends: None

In v3 repl realtime connections are chosen randomly which can lead to large imbalances. Previous implementations did a better job of ensuring there was good balance. The goal of this work is to bring that back to v3 realtime replication.

L-1 - Realtime Failed Remote Put stats

Depends: None

In order to accurately track "how up to date" a cluster is we track the number of dropped real-time messages. However, we do not believe we count, at least uniquely, the number of real-time messages that reach a sink, are acked, but then have a put fail. This work will confirm and fix this, if necessary.

CA - Fix Existing "riak_net" Work

Depends: Riv

The APIs and general design of the lowest layers of "riak_net" (the connection and service management parts of riak_repl) are solid. What is not is the implementation. This work will bring the code and implementation inline with the architecture originally designed by Jon Meredith, including:

  • properly advertising protocol vs. service: the current system says something like "give me a realtime connection" when in fact we want "give me a tcp connection for the purposes of realtime". The current implementation has a combinatorial effect.
  • well-written Erlang using OTP
  • modifying the existing codebase to use the new code

CB - Generalize RTQ

Depends: CA

The message queue used by realtime replication can be used for other sub-systems, e.g. handoff or cluster metadata, where we want to continue to perform work in the face of temporary connection loss or want backpressure at the sender. This work will generalize the queue so that it can continue to be used by realtime and others. Some known tasks:

  • the queue is a locally registered process
  • the queue should have a mode that blocks instead of drops on enqueue (for handoff folds we can use the queue for backpressure)

CC - Peer (Multi-Hop) Communication

Depends: CB

The current "riak_net" implementation (what lives in riak_repl) is all about point-to-point connections and communication. For future work it is desirable to be able to address any member of a cluster topology without having a direct link to it. This work will facilitate "connections" between any two peers regardless of it they have a direct link.

CD - Message Passing Communication

Depsnds: CC

Building on the ability to connect to a peer regardless of the existence of a direct link, this work will provide an interface with similar semantics to erlang's message passing. This will allow code like riak_ensemble to operate globally regardless of cluster topology.

Ri - Global Metadata

This research explores what is necessary to extend Cluster Metadata to be global. This would facilitate the replication of e.g. bucket types and security information in addition to allowing the storage of statistics globally for quick and easy access. Some challenges include:

  • how to handle tiered configuration: global, local, node.
  • how to handle things like security grants for a local network
  • how to deal with pre-existing clusters joining an existing topology and conflicts
  • does some metadata need to be globally strongly consistent? locally?
  • what about data that should be replicated to parts of a topology but not others

Rii - riak_core_broadcast for cascading realtime

It was determined that the riak_core_broadcast subsystem, which implements plumtree, could most likely be used to provide a new implementation of cascading realtime replication. This would allow the re-use of existing work instead of developing a parallel spanning tree library, which has its own failure detection and healing. riak_core_broadcast will require some changes to allow it to funciton more similarly to the original plumtree algorithm. Specifically: not requiring exchanges nor requiring "ignored_i_have" messages -- putting items in the lazy queue but sending them once w/o waiting for an ignore ack.

Riii - Global Strong Consistency

Riv - riak_net writeup

Rv - Poly Sync

Clone this wiki locally