-
Notifications
You must be signed in to change notification settings - Fork 33
MDC4 Independent Fullsync RFC
W.I.P. has at least one more round of edits for readability - Jon.
The fullsync algorithm ensures that the union all data eligible for sharing between two clusters is present on both after successful completion.
This is broken down into the coordination problem of which servers need to synchronize (in the face of retries/failures) and the synchronization method itself.
Fullsync scheduling in v3 is vnode based, which is why the ring sizes must be the same. The plan for MDC4 is to switch from a vnode-based system to a keyspace system.
To make sure we can still support a similar properties of the existing fullsync, we are replacing it with two types of sync - normal and clone. Normal sync just exchanges a single replica of each keyspace, clone sync completes an exchange for all replicas of the keyspace. Users may wish to perform a clone sync when decommissioning a cluster to be sure all data is transferred off.
To support provisioning empty clusters without computing differences, we are also adding an explicit 'all' sync variant to both, where the difference step is skipped, and all data is transmitted, using either normal or clone sync.
The ring determines how many partitions Riak's objects are distributed over, splitting the hash space into a number of even sized keyspaces. Each keyspace has a preference list calculated for every entry in the set of active N values (the Nset).
Because Riak requires ring sizes to be powers of two, and uses the high bits of the hash value to determine the partition number, a lower power of two ring can subdivide it's keyspace to get the same data as the higher power ring. For this document, call that a subspace.
Looking at it from the vnode perspective, each vnode participates in one or more keyspaces, overlapping with other vnodes for bucket/types where N > 1. Overlapped keyspaces also overlap subspaces.
If both clusters know each others ring size, Qa, Qb then they can compute the common number of subpartitions as Qmax, and work out how many the factor they need to divide their local keyspaces match the remote side, such that on cluster A (Qa < Qb), Qmax = Qa x SFa
Each vnode contains the object data and may also contain a AAE trees for each member of the Nset when AAE is enabled. The naming is a little confusing between keyspace, vnode and aae tree. This is how they relate, using the shortname convention of just referring to a small integer of each keyspace/vnode rather than the full hash index
An object in a keyspace are stored on vnodes of the next N highest index around the ring. AAE trees are named after the first index in the preference list.
For <Hash:160> on ring size Q, we define Keyspace as Hash >> (160 - log2(Q)) So, for placement, kX maps to vnode (X+1) rem Q to vnode (X+N) rem Q one each vnode, kX maps to aae tree {(X+1) rem Q, N} Example of {Hash,KeySpace,PrefList,Tree} for Q=8 on a ring owned a,b,c,d,a,b,cd N=3 {182687704666362864775460604089535377456991567873,k1,[{v2,nc},{v3,nd},{v4,na}],{t2,3}} N=5 {730750818665451459101842416358141509827966271487,k3,[{v4,na},{v5,nb},{v6,nc},{v7,nd},{v0,na}],{t4,5}}
From this relation, for any keyspace and n-val we can create a list of candidate subspaces to be exchanged and a list of vnodes that contain the subspace from each N. We cannot know how to efficiently schedule that across nodes, or which preference lists are being impacted by background work without change.
The independent fullsync algorithm for MDC4 takes something similar to the existing v3 approach and enables synchronization between two clusters (a, b) with independent ring sizes (Qset -> [Qa, Qb]) and list of active N-values aka Nset (NS -> [1,2,3] = NSa, [3,5] NSb).
Planning:
Stage 1: Build the mapping of keyspace between the two clusters. The smaller Q cluster will
subdivide it's keyspace into subspaces the size of the larger Q cluster. There will be max(Qset) subspaces.
P total subspaces on each side, Qa*Fa
== Qb*Fb
. For efficiency, Fa == 1 or Fb == 1.
Stage 2: For each cluster, work out the min(NSet). From that, derive the vnode ids that cover each subspace in each cluster - should be a tuple of {LocalVnodeIds},{RemoteVnodeIds}
Execution:
Stage 3: For each subspace, synchronize pairs of local, remote vnodeids.
How many syncs are required per subspace depends on type of sync * Simple bidir sync: any pair * Clone sync: all from each sending direction
This interacts with background task management, but that will handled as a separate project, the algorithm assumes that some sync attempts will be denied.
Steps
a) Make a list of all subspaces with the keyspace syncs required for each.
b) Pick a next sync that covers as many required syncs as possible (say min(Qa) = 3, min(Qb) = 5, if you have vnodeids [1,2,3] and [1,2,3,4,5] then probably good to sync {1,1}, {2,2}, {3,3} then any of {1,4},{2,4},{3,4} then any of {1,5},{2,5},{3,5} to make you cover them all as quickly as you can. Repeat until none of the current sync requests succeed, or the max fs cluster limit is reached
d) on successful completion of any subspace sync, update completed syncs for the subspace. If sync criteria is satisfied, move the subspace from pending to completed. if there are any subspaces left to sync, resume (b).
e) on failure, keep track of the pair and reason, check if we exceed an abort condition (time with no progress, number of errors) so we can terminate and restart in case of deadlocks or other software error, resume (b) if ok.
e) until background management is unified, periodically tick (say 5-10s) to see if any new locks are possible.
To decide if you can execute a sync the existing node fscoord_serv/node_reserver
are
able to provide an IP/Port to talk to. In a future iteration it would be better to have
that local decision making being made on the remote cluster, which node is best to satisfy
that portion of the subspace at that moment in time based on all of the other background
activity there rather than just guessing and being denied. Something like collaborative
fscoords both providing a list of keyspaces they could sync with minimum nodes and picking.
Each bucket can be part of replication or not depending on the fullsync configuration (if fullsync is enabled between Ca and Cb), and the bucket/type configuration (repl:fullsync or repl:both).
Subspace synchronization takes a local vnode/treeid, and list of N values, sync direction from the and a remote IP/port pair to synchronize with. Sync direction is either unidirectional sending, from the fullsync process direction only, or bidirectional where remote differences are requested.
The process breaks down into three subcomponents, different detection, difference sending and difference requesting.
The algorithm a) starts up the difference sender/requestor b) for each n-value computes the differences for {treeId,N}, streaming them to sender/requestor. c) send eof to sender/requestor and wait until all differences sent/requested.
For 'all' syncs, we skip the difference computation, and let the difference sender know to just send everything.
Difference detection is the AAE exchange. For the two sets of objects - A and B, it computes a list of Objects to send from B on A, Objects to send A on B and Objects. Objects to send from one side are either missing on the remote side, or the local dominates the remote.
Easiest v3 compatibility, the ring size must be the same, but the n-value can vary. If the bucket types are incompatible or the bucket properties disable sending objects over fullsync, then every exchange will send objects/difference requests that are denied. Things will still work though.
To improve performance, convert the AAE algorithm already in use for local anti-entropy/MDC3
to switch to breadth-first traverals and pipeline requests to the remote node to avoid
waiting for the round trip time. When differences are detected they are streamed to the
sender or requestor accordingly. This does not require any changes to the fs_aae_sink
process or protocol so we should be able to upgrade.
For MDC4 we need to remove the ring size restriction, though supporting subspaces, the individual bucket configuration (through the repl:full,both,off) setting, and inelligible data (bucket types incompatible - like strongly consistent buckets).
The proposed solution was to tag the objects in the vnode with a tag, and maintain a logical AAE tree for each tag. The tagging system provides a function that takes an object and provides a list of tags that should be applied.
Each object involved in fullsync should be tagged by {fullsync,<>,SubSpaceId}. There will be tags for other subsystems (all for local vnode exchange, perhaps search for buckets with Search2 enabled).
Logical AAE trees can be provided on top of the existing AAE tree system by keeping an entry in the upper level tree per tag. The lower level tree is kept as is, and filtered with the tagging function.
Whenever a change is made to the fullsync configuration, the bucket properties affecting fullsync or the ring size, all upper level trees are invalidated and recomputed next sync.
Until we have global metadata, there is a problem of knowing the configuration for remote buckets. If that is not available at implementation time, we may be able to extend the existing cluster manager protocol to request configuration for a bucket/type that we can cache and on fullsync changes, send a message to invalidate remotes.
The difference sender is responsible for receiving a stream of bkeys and efficiently requesting them from the underlying vnode. On death of the difference detection system it should still complete sending all detected differences if possible so that the system make progress during failures.
Difference sending may request individual objects from the backend, or it may fold across all data. The simplest implementation is to just respond to the first fixed number of requests, or portion of the data if known. Diff detection algorithm could provide an estimate on the expected difference to help switch over earlier.
For v3 compatibility, either the v3 fs_aae_source
socket or realtime could be
used to send the data unidrectionally. The realtime and fullsync sockets were
demultiplexed from the v2 design so that fullsync traffic would not burst and
head-of-line block realtime.
If we continue to use the fs_aae_source
connection, the differences die if
the socket does. We could send differences there while alive, and fallback to the
RTQ on death.
For MDC4, once the RTQ code is refactored, we should create a clean mechanism for at least-once delivery of the differences.
Difference requestor receives a stream of bkeys, retrieves the objects from the remote cluster and stores them with the local N-value.
For v3 compatibility, there is no notion of requesting remote objects in the current
fs_aae_source
code. If proxy get is configured, we can issue requests against
the remote cluster one object at a time, otherwise we can only unidirectionally sync.
We can reuse the fullsync worker pool for the storage part.
For MDC4, we can do something very similar to difference sending of requesting objects individually up to a threshold. After that we can compute a bloom filter (ideally sized based on an estimate of the remote vnode size for some percentage error), ship the bloom to the remote side and retrieve all the objects.