Master and config discovery on the router #280

Gerold103 · 2021-05-28T22:16:11Z

Gerold103
May 28, 2021
Maintainer

The related issues are #75 and #209. The discussion starts with a description of how the task looks in my understanding. Then I provide my vision of API and behaviour, some insights at internals, open and frequent questions, alternatives, and what I want to do in the future.

Problems with existing functionality

In order to change master in a replicaset too many actions are required due to quite stiff configuration update process. The config must be updated on all the storage masters and replicas in all replicasets and on all routers so as the switch would be taken into account by all parties.

In big clusters with tens of replicasets and several replicas in each, with tons of routers, it becomes a weak link in the chain of failover steps.

Some instances might be unreachable and won't get the update. Others might fail to update it due to any reason. Thirds might be not accounted somewhy and won't even attempt to update.

This is not only about master change really. In scope of #209 it was proposed to automate the entire config discovery. At least for the routers, which are supposed to be completely stateless, but having to update their huge configs is an obstacle on the stateless way of their functioning.

Another major feature related to master auto-discovery is support of built-in Raft. In the future I assume vshard might support it in such a way that even the storages won't need to have a master specified in the config. Instead, they will use the automatic leader election. Then it is even more important for the routers to be able to discover the leader/master automatically. Although that is not the main issue here with Raft. Rebalancer would need some serious enhancements, I expect.

This RFC is not about config discovery between the storages, but might be applied there as well in the future. Everything below is about routers.

Possible solutions and behaviour, config API

Master = auto

For #75 (master discovery).

The possible look of the config:

{
    sharding = {
        ['rs-uuid1'] = {
            master = 'auto', -- <<-- The new option!
            replicas = {
                ['in-uuid1'] = {
                    uri = 'storage:[email protected]:3301',
                    name = 'storage_1_a',
                },
                ['in-uuid2'] = {
                    uri = 'storage:[email protected]:3302',
                    name = 'storage_1_b'
                }
            },
        },
        ...
    },
}

In the replicaset settings a user can specify master = 'auto'. Then the router will go to the listed nodes and try to fetch from them who is the current master. When found, it will use the node for RW requests.

When the router sees the master has resigned, it will restart the discovery process.

Discovery must scan all the instances in the given replicaset, not just one. Because it might be that the old master has just died. But it wasn't deleted from the replicaset yet, while a new master was already elected externally, and the other nodes know about it.

Pros: explicit, a user can see where the auto-discovery is enabled and where not.
Cons: a new option to keep in mind and support. Might be not obvious how would work with #209 when entire replicasets are discovered. Is it assumed, that if I didn't specify a replicaset at all, it would be discovered with implicit master = auto setting? Implicit is usually bad.

Master not specified = implicit auto

For #75 (master discovery).

The same as the previous option, but master key simply is not specified anywhere in the replicaset settings. Then it is assumed auto.

Pros: solves the cons of the previous option. No new config key to support, and should be compatible with #209.
Cons: firstly, currently it is fine not to have any masters in a replicaset, and the router does not do anything, only raises a warning. Auto-discovery would change the behaviour - the router would start doing something. Although this maybe good, can't say for sure. Also this means master discovery is implicit - I don't like implicit things in general.

Master is being discovered always

For #75 (master discovery).

Regardless of is_master setting, master discovery works for all replicasets specified in the config. If the master was specified in a replicaset, and discovery found it is not true, it will use the real master, but will return a warning in router.info() that the config is outdated.

Pros: the same as in the previous 2 options.
Cons: I don't like is_master being ignored when it mismatches the real master.

Config = auto per replicaset

For #209 (full topology and config discovery).

{
    sharding = {
        ['rs-uuid1'] = {
            config = 'auto', -- <<-- The new option!
            replicas = {
                ['in-uuid1'] = {
                    uri = 'storage:[email protected]:3301',
                    name = 'storage_1_a',
                },
                ['in-uuid2'] = {
                    uri = 'storage:[email protected]:3302',
                    name = 'storage_1_b'
                }
            },
        },
        ['rs-uuid2'] = {
            config = 'auto', -- <<-- The new option!
            replicas = {
                ...
            },
        },
        ...
    },
}

A user must only specify at least one node from a replicaset from where the router will fetch the entire replicaset config. This is per-replicaset.

The idea is that in addition to master change discovery the router would also be able to find replica zone updates, find new replicas, purge deleted replicas. Not very common actions though I imagine.

Pros: solves a bigger task than just master discovery. The routers become much less dependent on a hardcoded config.
Cons: harder to implement than just master discovery. Not obvious what to do when different nodes say different configs. That might mean the full config can only be downloaded from some centric storage so as to be consistent.

Config = auto global

For #209 (full topology and config discovery).

This is the most global imaginable discovery.

{
    auto_config = true,
    config_nodes = {
        'storage:[email protected]:3301',
        'storage:[email protected]:3302'
    }
    sharding = {
        ['rs-uuid1'] = {
            replicas = {
                ...
            },
        },
        ...
    },
}

The router will try to download the entire cluster config from all the available nodes + from the given anchor nodes. The entire config can just contain the anchor URIs + router-specific options.

The router will try to build its own map of the cluster. A user might just create a replicaset with 0 weight for the sake of config discovery with a few of nodes + RAFT in it, store the config here in a space, and all routers would go here and download the config + update it when it is changed.

Still, the user can hardcode some of replicasets and their replicas.

The anchor URIs might even not be fully configured vshard storages. They only would need to implement some function like vshard.storage.pull_config(). They can also be omitted and as anchor nodes a user could specify one of the replicasets.

Pros: can discover new and deleted replicasets.
Cons: notably harder to implement. And the same problem with config conflicts as in the previous option.

Config = auto, centric storage

For #209 (full topology and config discovery), but not for #75 (master discovery).

A simplification of the previous two options, although not so flexible. The config looks the same except that there is no auto_config option. And discovery works only on the nodes listed in config_nodes. It won't try to download anything from the nodes in sharding. Although some of them might be the same as in config_nodes.

This works relatively well if the config is stored in a dedicated place from where all the storages and routers are supposed to download it.

Pros: simpler than the previous option, but still works the same in the common case. No config conflicts. And it is proved to be working fine in real systems.
Cons: master discovery still needs to be implemented not via the centric config storage, because if built-in master election would be ever used, it would mean it is not a part of the config anymore. Also such discovery won't be able to work properly if the config is updated only on a subset of the storages. Routers, having the new config, will try to use it on the storages still having the old config. This might lead to issues. For instance, if master is configured explicitly, router might see it differently that it really is, and won't be able to find the truth until the config is applied on the entire cluster.

Summary about the config look

Have no idea what to choose so far really.

I tend to think the idea about full config discovery via all possible URLs is the most flexible, but it is the hardest one to implement. While the idea about a centric config storage seems to be most widely used. Which does not prove it is good though. This might be due to lack of automatic master election in tarantool until recently that is_master is the global config worked fine.

Currently I am considering implementing only #75 as master = 'auto' option among described above.

How to fetch the config

This has to be something implemented in vshard, but also available as public API maybe, for the sake of implementing your own config storage not depending on vshard in the future.

This is though fine to implement it as internal now inside of vshard.storage._call(), and expose to public later when there is feedback if anybody really wants it.

One of the hard parts here is how to avoid download of the full config when it does not change, as it might be quite big I suppose. A lot of strings, arrays and maps in messagepack, big Lua tables when unpacked.

The ideas about using a hash does not work - it can only tell if the config has changed for sure (different hashes). To check if it didn't change need to compare the configs due to hash being not perfect - it might have collisions.

The "is the config the same" check should be 100% correct. Sometimes it is fine to re-download the config without need, but should be rare.

I see several options.

Full config, config hash + timestamp

For #209 (full config discovery).

Gives almost perfect way to see if anything has changed. The storage calculates the config hash + updates its timestamp when it is changed. The router goes there, checks this pair, and if anything has changed - download the new version.

Pros: simple, understandable.
Cons: not reliable. I need to use the UTC timestamps because the monotonic time won't survive the restart. UTC time might be corrected on the machine. Together with the possible hash collisions, there is a chance a config update might be not noticed by the routers. This could be solved by bumping the timestamp periodically even if nothing changes, but looks like a crutch.
Also need to filter out from the config irrelevant information. For instance, the router does not care about box.cfg settings. This adds complexity.

Full config, bound to the TCP session

For #209 (full config discovery).

I like this solution more because it is 100% solid. Although might look a bit complicated. The idea is similar to what I did for storage refs in scope of map-reduce.

The router goes to the storage and asks for the config + its version. They are returned first time + the version is remembered in the TCP session via which the request was done.

Next time the router asks for the config, it passes the version. If it matches the one stored in the session, no need to return anything.

Config is re-downloaded only when the connection is re-established (rare, regardless of the reason), and when it is really changed.

Pros: reliable.
Cons: might look complicated.

Full config, subscription

For #209 (full config discovery). Can be done just for #75 (master discovery) as well, without a full config.

This looks relatively easy - send to the storage a long-polling request which would send box.session.push() when the config changes anyhow.

Here are 2 problems:

The router might want to terminate the subscription earlier than the session disconnects. For instance, it was reloaded to a new version, and wants to use a new function to handle the config updates;
The long-poll request will waste part of net_msg_max limit for not doing almost anything and will occupy a fiber. Stateless router count might be bigger than now, especially if like Make routers completely stateless #209 says, they would be on the clients, which means the waste can be notable.

The problem of reload can be solved with a hack: store the function ref as a global, and call it via a not reloadable function. The ref can be reloaded. For example:

function on_message(ctx)
    -- ...
end

M.on_message = on_message

function on_message_proxy(ctx)
    return M.on_message(ctx)
end

c:call('vshard.storage.info_subscribe()', {
    is_async = true, on_push = on_message_proxy, ctx = ...
})

This way M.on_message can be reloaded without dropping the connections, but new pushes will use a new function. Already sent on_message_proxy won't be reloaded, but they are trivial, their reload should not be ever needed.

The problem is long-poll requests wasting resources can't be solved on any of the current Tarantool versions, but it might be not as bad as it looks.

On the storage side the subscriptions need to attach to a global cond var which is signaled when something is changed on the storage. And they need to end the subscription when the connection breaks. Can be done via a long periodic checks or via box.session.on_disconnect().

Another good part here - it could be used for subscribing on _bucket updates too which would make bucket discovery on the router much more reactive, and cheap when no changes. It would solve #257 and #238.

Pros: get config update right when it happens; reliable; might be not so easy to implement.
Cons: inefficient, wastes resources fibers and net_msg_max on the storage.

Full config, persistent version

For #209 (full config discovery).

The config version is stored in WAL somehow. Maybe in a replica-local space. The storage bumps the version on each reconfig.

Pros: reliable; understandable.
Cons: requires schema change; makes the config update yield; wastes space in WAL.

Only 'is_master' flag

For #75 (master discovery).

For #75 it is enough to get from each node a boolean if this is a master. Very cheap. This makes not necessary to try to save time on not asking "are you a master?" too often.

Pros: stupid and simple.
Cons: works only for #75.

Individual config + list of nodes

For #209 (full config discovery).

It extends the previous option - let the storage instances send not only their is_master flag but also their parameters like zone, replicaset UUID, instance UUID, and a list of the nodes they know about.

For instance, assume there are instance1 and instance2, they have their configs and know configs of each other. Also there is an instance3, which does not have its own config, but knows about instance1 and instance2. The router knows about instance3.

Firstly, the router downloads from the instance3 this: {config = {}, nodes = {instance1, instance2}}.

Now the router knows about instance1,2,3. It downloads from instance1 this: {config = {is_master, replicaset_uuid, instance_uuid, zone}, nodes = {instance2}}. Now it downloads the same from instance2.

The config is discovered in parts downloaded from the nodes responsible for their parts.

The config parts are small which means it is not necessary to protect them with a version, a timestamp, or anything else. Can download them each time.

Pros: no config conflicts are possible; see only the real situation in the cluster, even if the target config wasn't applied everywhere yet.
Cons: need to download a config part from each instance. If their number is hundreds, it might take time. Not too much though, especially if use is_async from netbox to collect the parts in parallel.

Summary about the config fetch

The best option is the last one - download the config in parts and discovery other nodes as urls, not a records in a config. It solves both #75 and #209, and almost has no such a thing as a discovery conflict.

But if go for #75 only, then the fastest solution is to send just is_master flag from the storages. There is not much to do wrong really, and it is quite simple to do. Compared to the other options.

It is very likely only #75 would be solved first. And in that case the best final solution is either collect the update periodically or on an event like "no master is known", or via a subscription. Sections "Full config, subscription" and "Only 'is_master' flag".

Conflict resolution

What if the config was found different on several instances? For instance, there are multiple replicas which say they are masters.

The option list is quite short: random choice; do nothing (no master); UTC timestamp the configs and choose the latest.

Random choice: select any config among the conflicted ones. Stupid, but works.
Do nothing: treat it like there is no a master, and return a warning from router.info().
UTC timestamp the configs and choose the latest. Sounds good at first, but then you realize that this due to time shifts might lead to writing to multiple masters simultaneously from different routers. For instance, some of them didn't notice the multi-master situation yet and keep writing.

Summary about conflict resolution

I choose the second option about not doing anything if see a conflict + raise an alarm. This is the safest option. Normally there should be either 0 or 1 masters, and this option works good for that case.

Implementation

There are no clear and separate ways how to do it. I only list key ideas from which I will form a summary in the end.

Location of the implementation
Depends on its complexity.

For a simple master discovery it should be enough to implement it in one of the existing modules. If need to be able to use both on the router and on the storage when auto-election would be enabled - there is no much choice. The implementation must be in replicaset module then.

For full config discovery it probably should live in its own internal module. Similar to the storage refs and the scheduler on the storage. The logic of any of the config discovery solutions is not trivial enough to have it in the main init.lua files. On the other hand if it lives in a module, the module would need its own config. This adds complexity.

How would the discovery module connect to the remote peers?
Connections to the remote nodes are established by the router's main file and are stored in each router object. Establishing them all second time just for the sake of discovery module independence does not look good.

That means either the module should reference the router it is used in so as it could access its connection; or there should be a global connection pool from where the config discovery could take them; or it should reference replicasets objects, which exist both on the router and on the storage. In the last case the implementation would work in both instance types.

Discovery lives in a fiber or not?
The discovery config object needs to do its work periodically. This implies there should be a fiber for that, because there is no way to postpone some work otherwise. I could make the discovery somehow a state machine without a fiber and put it into the existing bucket discovery fiber, but this would look notably complicated.

Alternative would be no to do periodic work - try to discover anything only when don't see a master node in any of the replicasets. Then a fiber is started, finds the master, and is terminated. This works only for #75. For #209 need to poll for the updates periodically to find changes not only about mater switch.

Another alternative - make discovery a function. It takes replicasets table, looks for the ones not having a master, and in a blocking way tries to find a master in them, and then returns regardless of the whether it was a success. The function can be called periodically in one of the existing fibers. For instance, the router could call it in the failover fiber. The storage could call it before applying rebalancing routes.

When the router sees an RW request has failed to a replica, it can reset its is_master flag and wakeup the failover fiber to hurry up the master discovery.

Polling or subscription?
Polling is about all the solutions which try to fetch updates with a period from all or some of the nodes. It is usually easy to implement, but might be not efficient enough.

Subscription is about using box.session.push() to get updates of the config or node's personal info (depending on the what is chosen in the sections above).

Having a subscription also would allow not to have a fiber, because it is simply an async request sent via netbox with a callback to invoke on each message.

Deal with multiple routers
Talking of the multiple routers - the config discovery module must not be a global singleton, like the storage refs and the storage scheduler. It must allow to create 'config discovery' objects stored in each router object individually.

Instance deletion
If the config is going to be discovered in pieces from each individual node, to the above questions there is another one - how to discovery if an instance or a replicaset is deleted?

I see 2 signs:

Node is not reported by any other nodes in the cluster as a peer. Nobody knows about this node;
It does not respond on requests or responds with any error.

When both conditions meet, it could be considered deleted. And when all instances of a replicaset are deleted like that, the replicaset is deleted too.

Summary about implementation

Since I tend to decide to implement #75 only, I think the best solution in terms of complexity vs future compatibility vs impact is to implement a function in replicaset module. It would get replicasets not having a master, and send a blocking requests to their nodes asking who is the master.

Either it finds the masters and all is fine, or it does not and returns an error, or it finds 2 and more masters for some replicasets and also returns an error.

It can also send requests to the known master to check if they are still masters.

The router will call this function periodically in the failover fiber. The fiber will be woken up when router fails an RW request because some node is not a master.

In addition, I can add a hint to the "not a master" error on the storage side which will tell who is probably the real master. The storages should know that (normally).

Summary about everything

Only master discovery will be implemented, via master = 'auto' option in the router config, per replicaset.

It will fetch "are you a master?" attribute from all the storages to locate the master and discovery old master demotion.

The implementation will live in replicaset module as just a function doing some netbox calls (probably in parallel via is_async) in an attempt to find masters.

Router will call the master discovery function periodically in its failover fiber. Also it will wakeup the failover fiber to hurry up in case it fails an RW request with the error about "not a master".

The storage with the "not a master" error will send a hint who might be the current master. If the routers gets the hint, it won't need to wakeup the "big" master discovery.

Future work

This is important to think about where this is all going so as not to shake the public API too much. As I see it, in future even the storages might not need to know the config of all the other storages in advance. They might discover it just like the routers will. Via a centric storage or in parts.

For that the user might create a special replicaset with no buckets, which stores only its own config. On the routers and storages in their config they see only this special replicaset. When start, they register themselves there and download topology of the other replicasets. This way via the special config-shard all the instances will discovery the cluster.

The API would look on the storage the same as on the routers. Moreover, a storage does not need to know its entire replicaset. It needs only a majority of nodes so as they would boostrap in one replicaset UUID. After bootstrap it is enough to see just one, and the others can be discovered.

Discovery can use many places as a source. In scope of this RFC only the configs were considered. But in future it is possible to use SWIM + broadcast, and fetch the nodes from box.cfg.replication, box.info.replication.

The closest milestone is #209 - full config discovery on the routers.

FAQ

What if I have several instances configured as masters?

They might have different configs, each saying this node is a master. Then the router, according to the currently chosen solution, will not use any of them as a master, and will raise an alarm.

Open questions

None.

Gerold103 · 2021-05-28T22:16:35Z

Gerold103
May 28, 2021
Maintainer Author

@rosik and @olegrok, I need feedback if you have any.

4 replies

Gerold103 May 30, 2021
Maintainer Author

I noticed that #209 is on the same quarter as #75 which renders #75 separate implementation useless. For #209 I came up with another solution, which I now consider the best. See in the RFC updated summaries, and the new paragraph called "Individual config + list of nodes".

Gerold103 May 31, 2021
Maintainer Author

I removed some of my summaries since I no longer has a clear view of what to do, I will need more time. Also I moved the idea about a centric config from 'Alternatives' to one of the options in the "Possible solutions and behaviour, config API" section.

Gerold103 Jun 1, 2021
Maintainer Author

I added a new solution for #75, see "Summary about implementation" and the final summary. I think I will use it unless I find more problems.

Gerold103 Jun 2, 2021
Maintainer Author

I added more info about subscription solution. Managed to workaround the reload issue.

olegrok · 2021-05-31T08:15:10Z

olegrok
May 31, 2021
Collaborator

Vlad, thanks for your work. I have several comments.

I'd start to write my comments with some thought about vshard and its place in ecosystem.
I think that there are some cases when our customers uses pure vshard but there are some systems/frameworks that uses vshard as a component (e.g cartridge). So I think RFC should consider it. Cartridge persists current leader of replicaset in its config. I think another solutions (as qumomf/autovshard do the same. However, I'm not completely sure about it). Also there is tarantool/topology that should be used as topology/configuration providers for vshard/cartridge. It's an introduction. Probably it's not completely related to the problem but I want to pointed to this things.

Therefore if there are no objections I would go the last option - full config discovery via all the known nodes + possibly via non-storage nodes.

I agree here with you.

The anchor URIs might even not be fully configured vshard storages. They only would need to implement some function like vshard.storage.pull_config(). They can also be omitted and as anchor nodes a user could specify one of the replicasets.

Do I correctly understand that config_nodes could be separate Tarantool instances that just stores configuration? Do they have a higher priority than master nodes when routers try to load the config? If it's so that probably interface could be more general. User could specify some callback name that could go to some external configuration provider (consul/etcd) or even read local config from file and return the result in special format.

This has to be something implemented in vshard, but also available as public API maybe, for the sake of implementing your own config storage not depending on vshard in the future.

No. The second part is more correct. Currently people usually use external config provider to configure vshard. Sorry I won't stop repeating it :).

The ideas about using a hash does not work - it can only tell if the config has changed for sure (different hashes). To check if it didn't change need to compare the configs due to hash being not perfect - it might have collisions.

I believe cryptographic hashes are good enough. The same way we can imagine uuid collisions. It's possible but probability is too small. BTW people store hashed passwords and don't afraid that someone found another password with the same hash.

This looks relatively easy - send to the storage a long-polling request which would send box.session.push() when the config changes anyhow.

But pushes are asynchronous. You don't have guarantees that your message will be delivered.

Full config, persistent version
The config version is stored in WAL somehow. Maybe in a replica-local space. The storage bumps the version on each reconfig.

It shouldn't be even an option. We can't store configuration inside Tarantool. In case of some issues we will have problems with configuration change or even reading. That's a reason why cartridge uses simple yaml file for it.

The best option is the last one - download the config in parts and discovery other nodes as urls, not a records in a config. It solves both #75 and #209, and almost has no such a thing as a discovery conflict.

Do I understand correct that router periodically fetches config from all masters and if on some rw request master is changed then we reload configuration and repeat request to the new master?

That means either the module should reference the router it is used in so as it could access its connection, or there should be a global connection pool from where the config discovery could take them.

Cartridge already has its own connection pool. I think vshard could reuse some parts of it and give some "general purpose" connection pool that could be used also by another modules. (I'll agree with you if you say that it's huge separate task and should be discussed separately).

Here I read to "Alternatives" paragraph and realized that some of my minds are definitely about it. And it's even doesn't contradicts a thesis about "non-tarantool modules or services" because cartridge and tarantool/conf are Tarantool modules.

P.S. Sorry, Vlad if such format is not quite comfortable. I could split my message into several threads and each of them could be discussed in its own thread.

6 replies

olegrok Jun 1, 2021
Collaborator

Which leads to a question - why is #209 needed?

I suppose that "This prevents us from writing native vshard clients" (from #209) means that currently we can't simply write sharded storage with Lua only on storage side and implement routers in any language. (But it's only an assumption).

I must say I am heavily confused by the number of possible solutions, both in #75 and #209, so I am going to take more time to think about it and ask people.

Yes, it seems to be the best option for now. I hope @rosik and @Mons will express their opinion.

Gerold103 Jun 1, 2021
Maintainer Author

we can't simply write sharded storage with Lua only on storage side

I think I am starting to understand. Do you mean that after vshard.storage.cfg() you should not need to call cfg again nor provide any topology on the storage side?

olegrok Jun 2, 2021
Collaborator

Not exactly. I mean that "stateless" router shouldn't be manually configured and can download all that it needs from storage.

After that you could write application using any language that will be able to get topology from storage calling special API and then such application could insert/read data to/from needed storage.

Gerold103 Jun 2, 2021
Maintainer Author

Not exactly. I mean that "stateless" router shouldn't be manually configured and can download all that it needs from storage.

From where will it learn where to download the config? You need at least the URL of the config storage.

olegrok Jun 2, 2021
Collaborator

I just cite #207

In theory, storages currently know everything about cluster topology and can share the knowledge with routers when they first connect and download buckets.

But yes, at least some of such storage nodes should be known.

R-omk · 2021-05-31T14:37:09Z

R-omk
May 31, 2021

If done inside of vshard, it introduces a third type of the instance in addition to the router and storage.

Такой подход хорошо подходит к концепции cartridge role.

Для такой компоненты уже сейчас можно выделить дополнительный replicaset в котором cartridge отвечает за выбор лидера. А в будущем нативный raft и синхронная репликация полноценно могут конкурировать с продуктами типа etcd/consul.

Далее примерная реализация.

Для конечного router клиента (драйвера), которому нужно получить конфигурацию кластера, расположение бакетов и мастер узлов, достаточно реализовать небольшое подмножество функционала аналогичного etcd/consul по извлечению конфигов:

синхронный вызов - возвращает текущий актуальный конфиг для запрошенной секции (топология, текущая роль узлов (ro/rw), роутинг бакетов ); дополнительно возвращается "курсор" (фактически просто сиквенс) по которому можно подписаться и получать обновления в реальном времени.
запрос на получение изменений по "курсору" , для драйверов с разным уровнем поддержки протокола вводим опцию которая позволяет включить push сообщения, либо остаться в режиме long polling, либо периодически опрос по "курсору"; каждое выходное сообщение содержит новый "курсор" начиная с которого можно получить следующие изменения; дополнительно можно вести служебные сообщения сообщающие о необходимости перечитать весь конфиг заново синхронным методом из предыдущего пункта (например в случае устаревания курсора либо при наличии других существенных изменений)

При таком подходе любые изменения будут доставляться максимально быстро и доступен вариант реализаций простой версии (периодический опрос конфига) либо более продвинутый (стриминг изменений в реальном времени)

Дополнительно можно добавить фильтрацию на секции и отдавать по подписке только явно запрошенные секции.

Синхронизация конфига из единого источника - это что-то похожее на tarantool wal. Это даже можно оформить как отдельный независимый модуль к tarantool поверх netbox и использовать его в местах где нужна такая синхронизация (вместо сложных и ненадежных двухфазных комитов)

Предполагается что каждую секцию конфига записывает тот компонент который за нее отвечает, например: топологию записывает cartridge vshard role (но может и кто то другой, формат должен быть стандартизирован); роутинг бакетов записывает мастер узел принимающий данные на свою сторону; выбор лидера записывает тот компонент который в текущей конфигурации отвечает за failover;

Api записи должен предполагать версионирование секций конфига и контроля непротиворечивости. Так, например, в случае с нативным raft failover если узел пытается себя объявить лидером то api записи должно отвергать попытку записи если другой узел со старшим raft term уже зарегистрировался .

Любая успешная запись в конфиг возвращает "курсор" и далее может указывать его вместе с ошибкой, например: роутер делает запрос по бакету в репликасет в котором бакета уже не существует, на что может получить ошибку и дополнительно "курсор" в конфиге, что для роутера будет говорить о том что нужно дочитать конфиг (как минимум) до этого "курсора" чтобы получить актуальное состояние кластера.

Единый источник "правды" значительно упрощает реализацию расширений драйверов tarantool в конечных ЯП для vshard router.
Роутеру не нужно отвечать за поиск топологии и местоположения бакетов, он просто берет готовое.
Нужно минимизировать такие данные и оставить только то что реально нужно для конечного роутера. В том числе иметь ввиду что адреса стораджей для внешних подключений в общем случае могут быть не такими как для внутреннего.

Config propagation might take time. If it has stuck half-way, the routers might download the new config, but it won't work on some storages.

В описанном подходе время доставки примененного конфига значительно сокращается за счет возможностей push, а также предполагается что конечный конфиг роутеров строится на основе данных которые подтвердили стораджи,
Т.е. либо стораджи сами инициировали изменение секции конифга, либо они подтвердили завершение некого действия на которое нужно было было время, например: начальный бутстрап, ожидание освобождения всех ref, ожидание снятия блокировок и проч. В будущем многие сложные задачи для которых нужна централизованная стейт машина смогут быть реализованы просто и надежно.

P.S. В качестве фолбэка при тотальной недоступности центрального хранилища можно забрать версию конфига с наибольшим "курсором" со стораджей, однако стоит остановить решардинг при недоступности такого центрального хранилища (и все остальные кластерные действия изменяющие конфиг). Кроме этого, при недоступности центрального хранилища, синхронизацию наиболее последнего актуального конфига между стораджами можно сделать через swim.

1 reply

Gerold103 May 31, 2021
Maintainer Author

Такой подход хорошо подходит к концепции cartridge role.

Возможно, но помни, что вшард != картридж. И картридж != эталон, как надо делать.

Для такой компоненты уже сейчас можно выделить дополнительный replicaset в котором cartridge отвечает за выбор лидера. А в будущем нативный raft и синхронная репликация полноценно могут конкурировать с продуктами типа etcd/consul.

Да, все это верно. Но ведь нет никаких препятствий делать это хоть сейчас, сверху вшарда, что карж видимо и делает. Идея в том, чтобы внести что-то достаточно универсальное и несложное (в использовании) внутрь вшарда. Причем даже если это появится в вшарде как-то не так как тебе хочется, это не будет блокировать все равно сделать сверху скачивание конфига откуда-то из другого места.

вводим опцию которая позволяет включить push сообщения,

Про пуши можно забыть, пока их в коре не переделают.

Роутеру не нужно отвечать за поиск топологии и местоположения бакетов

Тут все-таки есть проблемы. Искать новые инстансы/репликасеты, удалять старые, менять зоны - это еще ок через конфиг. Эти вещи зафиксированы, и меняться сами по себе не могут.

Но дискавери бакетов через центральное хранилище - это не будет работать. Во время ребалансинга бакеты могут ехать довольно быстро, и их могут быть миллионы. Это приведет к тому, что такой "конфиг" будет во время ребалансинга неактуален просто всегда, и длиться это может весьма долго. Роутеры будут непрерывно этот конфиг выкачивать. По бакетам единственный верный способ - дискавери через стораджи "из первых рук" + быстрый рероутинг "по горячим следам", когда сторадж сообщает, куда бакет с него только что уехал. За счет этого в том числе ты не тратишь ресурсы на дискавери тех бакетов, на которые еще итак нескоро надо было бы идти. А когда будет надо, они уже батчами скачаются через фоновый дискавери. Это в роутерах уже сделано.

Унести это как дополнительную инфу в центральное хранилище наверное не проблема, если тебе доступность во время решардинга почему-то не сильно важна, но полагаться на нее на роутерах вшарда я бы не стал.

Тоже самое про поиск мастера. Если мастер меняется сам динамически через Raft, то это ведь уже не конфиг. Это динамическое значение, которое само как-то меняется. Конфиг - это то, что зафиксировано руками и само по себе не должно меняться.

Опять же, нет проблемы унести инфу "кто мастер" в это центральное хранилище, но полагаться на эту инфу на роутерах очень сильно не хочется.

В описанном подходе время доставки примененного конфига значительно сокращается за счет возможностей push

Пуша считай, что нет пока.

также предполагается что конечный конфиг роутеров строится на основе данных которые подтвердили стораджи,

Тут есть проблема, что нет информации, что подтвердили стораджи. Вот допустим ты поменял зону в конфиге. И скачал этот конфиг на роутер. Откуда взять инфу, что до стораджа эта информация тоже доехала? Он мог этот конфиг не применить по любой причине и остаться сидеть со старой зоной. Тоже самое про любые другие детали конфига - нет информации, применил ли их стораджа, на которые ты ходишь. И если с RW запросом и мастером ты хотя бы сразу получишь ошибку, то с RO запросом с приоритетом на мастера, даже если сходишь на реплику из-за рассинхрона, ты ошибку не получишь. То есть будешь работать по старому конфигу неопределенное время, хотя сторадж вот перед тобой, возьми да спроси его, мастер ли он.

artur-barsegyan · 2021-06-01T12:58:42Z

artur-barsegyan
Jun 1, 2021

Влад, спасибо за развернутый анализ. Я смог осознать процентов на 60.

Ты указал, что с Central Config Storage есть ряд проблем. Например из-за распространения конфига.
Часть из них мне понятна, но я не компетентен учесть все нюансы. Поэтому хотелось бы обсудить и раскрыть их детальнее.

С точки зрения продакта vshard хочется использовать с модулем topology. Модуль топологии тогда будет понятным кирпичиком в экосистеме.

Для пользователя это могло бы выглядеть так:

local topology = require('topology').new('my-sharded-cluster', { type = 'etcd', endpoints = { ... }, ... )

topology.subscribe('on_change', function(old, new) 
    if type == 'router' do
       vshard.router.cfg(new)
    else
       vshard.storage.cfg(new)
    end
end)

P.S. Мы закоммитились интегрировать модуль topology в Cartridge. Если vshard будет его нативно поддерживать, то мы сможем облегчить миграцию для пользователей, которые захотят с голого vshard переключиться на Cartridge и наоборот.

4 replies

ochaton Jun 1, 2021
Collaborator

@Mons @Gerold103 one of the most crucial problems of using vshard is absense of topology changes subscriptions on routers (rebalancer almost always is disabled or changes nothing).

Also we must keep in mind application layer around cluster (golang, java, nginx, perl ... clients of routers) which must keep up to date list of available/online routers.

This must be solved separate of vshard in different module (tarantool/config for example).
It must be a supervisor of vshard, and be the default setup of open source Tarantool: tarantool+vshard+config.

Config should be located in separate synchronous storage: ETCD, Consul, Tarantool+Raft+Synchro and accessed through unified interface.

Config is the source of truth "How it must be, not how it actually is". We must not release it until we provide tool to perform failover of master in storage (or RAFT), which will actualize centric config.

This module must support adding/removing routers, replicas, shards with zero downtime.
Routers stateless in sense that they keep only credentials to config storage, and fetch whole configuration table from it

Gerold103 Jun 1, 2021
Maintainer Author

С точки зрения продакта vshard хочется использовать с модулем topology

Есть проблема, что я не хочу, чтоб vshard зависел на topology. Потому что есть еще миллион способов получать конфиг. Отрезать их все ради topology нельзя.

Routers stateless in sense that they keep only credentials to config storage, and fetch whole configuration table from it

Ok, I think I understand now. Then the problem is that there are tons of possible ways to make a config storage: tarantool/topology, etcd, consul, tarantool replicaset with raft, and many more I am sure. I see 2 ways to support config subscription.

Implement all these ways in vshard in separate submodules, and allow in router/storage.cfg to write something like this: router.cfg{config_src = "etcd/tnttop/consul/vsh:<url>"} ("vsg" means a future tarantool storage with raft implementing a predefined function). Bad part of this is that it is hard to support;
Do nothing and let the users implement their own way to fetch the config like it happens now. Otherwise I can't imagine a universal way to subscribe on all these config storages.

I will also repeat the common problem of any full config discovery, which is even hidden in one of your words: "How it must be, not how it actually is". The router actually might want to work with how it actually is, not with how it should be. Otherwise, if the new config is half-applied, you risk not to be able to talk to some nodes which are healthy. Moreover, it could happen that the config was not applied to any of the storages, and yet the routers have downloaded it. It can lead to unavailability of a part of the cluster or even of the whole cluster for all requests.

That is why for the purpose of full config discovery I think the router must be able to discover it in parts. The algorithm which I called individual config parts discovery, it is described above. Grep by "Individual config + list of nodes". This solution allows the router to see the real situation, not "how it should be", and try to work with it. This also does not block the full config subscription, regardless of how the latter is implemented.

artur-barsegyan Jun 2, 2021

It is possible to support the topology module in on-demand mode. No explicit dependence on it.

As you described:

router.cfg {config_src =" tnttop "}

topology just encapsulates all the work with external config storage systems. You can just use its API. At the same time, for those who want to configure vshard in their own way, he will continue to use his tools - the old API will be supported in the future.

This will offer a standard way to spread and store the config. But at the same time, it will not stick in the wheels.

Below @Mons suggested implementing config update via push notifications. This can be supported in the topology module.

Gerold103 Jun 2, 2021
Maintainer Author

It is possible to support the topology module in on-demand mode. No explicit dependence on it.

My point is that you still need to implement something on the vshard-side to make config_src =" tnttop " work. And the same for all the other sources.

Mons · 2021-06-02T10:32:03Z

Mons
Jun 2, 2021

What I think from the point of master change discovery. The fastest and the most responsive way is to wait in long-poll with session.push. The same Idea was in the section about config update.
I agree with the concern about the waste of msg_max and fibers, but it could be raised (maybe it’s time to raise the default?).
Other variants with “periodical requests” will lead to huge junk traffic. (Consider the case, that we want to detect master change within 0.1s and we have hundreds of routers. Then we’ll get 10 RPS from every router and thousands of RPS just to detect master change)

1 reply

Gerold103 Jun 2, 2021
Maintainer Author

The fastest and the most responsive way is to wait in long-poll with session.push

Yes, it is.

Other variants with “periodical requests” will lead to huge junk traffic

Perhaps. But the case about 0.1s you described is rather extreme. Moreover, if a master is changed, the old master might send a hint who is the new master, so you update it on the router right when a request fails, without waiting for polling.

Additionally, the master discovery traffic is going to be small compared to the bucket discovery load.

Nonetheless, I added more info about subscription to the RFC, where I managed to solve the issue about message callback reload on the router side. This would solve both master and bucket discovery tasks.

sergos · 2021-06-02T21:04:14Z

sergos
Jun 2, 2021
Maintainer

I see you mix up the router's business of dealing with replicasets and replicaset internal activities of leader election. No mattter Raft or hand-made. I think we should separate those so that router will only update the leader info, not weighting on how many replicas replied which node is the leader - it should be opaque. Although it means replicaset itself should be aware of its participation in the cluster.

1 reply

Gerold103 Jun 2, 2021
Maintainer Author

I see you mix up the router's business of dealing with replicasets and replicaset internal activities of leader election

Of course I did - the leader election is one of the main reasons why router should be able to discover a new master in each replicaset.

I think we should separate those so that router will only update the leader info, not weighting on how many replicas replied which node is the leader

The router will update the leader info, yes, this is the point. By looking in each replicaset who is the leader/master there.

it should be opaque

Don't think I understood this correctly. But opaqueness has its limit. After all there should be a function on the storage which says "I am the leader" or "I am not a leader, but this other instance is" or "I am not a leader and don't know who is". And since this is vshard and it should also work without automatic leader election, it will be a vshard function available to the routers. It will see if this is Raft or manual master configuration.

Although it means replicaset itself should be aware of its participation in the cluster.

It is aware. There was vshard.storage.cfg() called.

rosik · 2021-06-28T11:35:34Z

rosik
Jun 28, 2021

Yesterday (on Friday) I've realized that something similar (subscriptions) may be useful in cartridge too. E.g. there's a WebUI, which polls lots of information from every instance in the cluster.

Here on the main dashboard you can see buchets count (vshard.storage.info), memory usage (box.info.memory), readonliness (box.info.ro)

And there's more in server details dialog.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master and config discovery on the router #280

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Master and config discovery on the router #280

Gerold103 May 28, 2021 Maintainer

Problems with existing functionality

Possible solutions and behaviour, config API

Master = auto

Master not specified = implicit auto

Master is being discovered always

Config = auto per replicaset

Config = auto global

Config = auto, centric storage

Summary about the config look

How to fetch the config

Full config, config hash + timestamp

Full config, bound to the TCP session

Full config, subscription

Full config, persistent version

Only 'is_master' flag

Individual config + list of nodes

Summary about the config fetch

Conflict resolution

Summary about conflict resolution

Implementation

Summary about implementation

Summary about everything

Future work

FAQ

Open questions

Replies: 7 comments · 17 replies

Gerold103 May 28, 2021 Maintainer Author

Gerold103 May 30, 2021 Maintainer Author

Gerold103 May 31, 2021 Maintainer Author

Gerold103 Jun 1, 2021 Maintainer Author

Gerold103 Jun 2, 2021 Maintainer Author

olegrok May 31, 2021 Collaborator

olegrok Jun 1, 2021 Collaborator

Gerold103 Jun 1, 2021 Maintainer Author

olegrok Jun 2, 2021 Collaborator

Gerold103 Jun 2, 2021 Maintainer Author

olegrok Jun 2, 2021 Collaborator

R-omk May 31, 2021

Gerold103 May 31, 2021 Maintainer Author

artur-barsegyan Jun 1, 2021

ochaton Jun 1, 2021 Collaborator

Gerold103 Jun 1, 2021 Maintainer Author

artur-barsegyan Jun 2, 2021

Gerold103 Jun 2, 2021 Maintainer Author

Mons Jun 2, 2021

Gerold103 Jun 2, 2021 Maintainer Author

sergos Jun 2, 2021 Maintainer

Gerold103 Jun 2, 2021 Maintainer Author

rosik Jun 28, 2021

Gerold103
May 28, 2021
Maintainer

Replies: 7 comments 17 replies

Gerold103
May 28, 2021
Maintainer Author

Gerold103 May 30, 2021
Maintainer Author

Gerold103 May 31, 2021
Maintainer Author

Gerold103 Jun 1, 2021
Maintainer Author

Gerold103 Jun 2, 2021
Maintainer Author

olegrok
May 31, 2021
Collaborator

olegrok Jun 1, 2021
Collaborator

Gerold103 Jun 1, 2021
Maintainer Author

olegrok Jun 2, 2021
Collaborator

Gerold103 Jun 2, 2021
Maintainer Author

olegrok Jun 2, 2021
Collaborator

R-omk
May 31, 2021

Gerold103 May 31, 2021
Maintainer Author

artur-barsegyan
Jun 1, 2021

ochaton Jun 1, 2021
Collaborator

Gerold103 Jun 1, 2021
Maintainer Author

Gerold103 Jun 2, 2021
Maintainer Author

Mons
Jun 2, 2021

Gerold103 Jun 2, 2021
Maintainer Author

sergos
Jun 2, 2021
Maintainer

Gerold103 Jun 2, 2021
Maintainer Author

rosik
Jun 28, 2021