Replies: 7 comments 17 replies
-
Vlad, thanks for your work. I have several comments. I'd start to write my comments with some thought about vshard and its place in ecosystem.
I agree here with you.
Do I correctly understand that
No. The second part is more correct. Currently people usually use external config provider to configure vshard. Sorry I won't stop repeating it :).
I believe cryptographic hashes are good enough. The same way we can imagine uuid collisions. It's possible but probability is too small. BTW people store hashed passwords and don't afraid that someone found another password with the same hash.
But pushes are asynchronous. You don't have guarantees that your message will be delivered.
It shouldn't be even an option. We can't store configuration inside Tarantool. In case of some issues we will have problems with configuration change or even reading. That's a reason why cartridge uses simple yaml file for it.
Do I understand correct that router periodically fetches config from all masters and if on some rw request master is changed then we reload configuration and repeat request to the new master?
Cartridge already has its own connection pool. I think vshard could reuse some parts of it and give some "general purpose" connection pool that could be used also by another modules. (I'll agree with you if you say that it's huge separate task and should be discussed separately). Here I read to "Alternatives" paragraph and realized that some of my minds are definitely about it. And it's even doesn't contradicts a thesis about "non-tarantool modules or services" because cartridge and tarantool/conf are Tarantool modules. P.S. Sorry, Vlad if such format is not quite comfortable. I could split my message into several threads and each of them could be discussed in its own thread. |
Beta Was this translation helpful? Give feedback.
-
Такой подход хорошо подходит к концепции cartridge role. Для такой компоненты уже сейчас можно выделить дополнительный replicaset в котором cartridge отвечает за выбор лидера. А в будущем нативный raft и синхронная репликация полноценно могут конкурировать с продуктами типа etcd/consul. Далее примерная реализация. Для конечного router клиента (драйвера), которому нужно получить конфигурацию кластера, расположение бакетов и мастер узлов, достаточно реализовать небольшое подмножество функционала аналогичного etcd/consul по извлечению конфигов:
При таком подходе любые изменения будут доставляться максимально быстро и доступен вариант реализаций простой версии (периодический опрос конфига) либо более продвинутый (стриминг изменений в реальном времени) Дополнительно можно добавить фильтрацию на секции и отдавать по подписке только явно запрошенные секции. Синхронизация конфига из единого источника - это что-то похожее на tarantool wal. Это даже можно оформить как отдельный независимый модуль к tarantool поверх netbox и использовать его в местах где нужна такая синхронизация (вместо сложных и ненадежных двухфазных комитов) Предполагается что каждую секцию конфига записывает тот компонент который за нее отвечает, например: топологию записывает cartridge vshard role (но может и кто то другой, формат должен быть стандартизирован); роутинг бакетов записывает мастер узел принимающий данные на свою сторону; выбор лидера записывает тот компонент который в текущей конфигурации отвечает за failover; Api записи должен предполагать версионирование секций конфига и контроля непротиворечивости. Так, например, в случае с нативным raft failover если узел пытается себя объявить лидером то api записи должно отвергать попытку записи если другой узел со старшим raft term уже зарегистрировался . Любая успешная запись в конфиг возвращает "курсор" и далее может указывать его вместе с ошибкой, например: роутер делает запрос по бакету в репликасет в котором бакета уже не существует, на что может получить ошибку и дополнительно "курсор" в конфиге, что для роутера будет говорить о том что нужно дочитать конфиг (как минимум) до этого "курсора" чтобы получить актуальное состояние кластера. Единый источник "правды" значительно упрощает реализацию расширений драйверов tarantool в конечных ЯП для vshard router.
В описанном подходе время доставки примененного конфига значительно сокращается за счет возможностей push, а также предполагается что конечный конфиг роутеров строится на основе данных которые подтвердили стораджи, P.S. В качестве фолбэка при тотальной недоступности центрального хранилища можно забрать версию конфига с наибольшим "курсором" со стораджей, однако стоит остановить решардинг при недоступности такого центрального хранилища (и все остальные кластерные действия изменяющие конфиг). Кроме этого, при недоступности центрального хранилища, синхронизацию наиболее последнего актуального конфига между стораджами можно сделать через swim. |
Beta Was this translation helpful? Give feedback.
-
Влад, спасибо за развернутый анализ. Я смог осознать процентов на 60. Ты указал, что с Central Config Storage есть ряд проблем. Например из-за распространения конфига. С точки зрения продакта Для пользователя это могло бы выглядеть так: local topology = require('topology').new('my-sharded-cluster', { type = 'etcd', endpoints = { ... }, ... )
topology.subscribe('on_change', function(old, new)
if type == 'router' do
vshard.router.cfg(new)
else
vshard.storage.cfg(new)
end
end) P.S. Мы закоммитились интегрировать модуль |
Beta Was this translation helpful? Give feedback.
-
What I think from the point of master change discovery. The fastest and the most responsive way is to wait in long-poll with |
Beta Was this translation helpful? Give feedback.
-
I see you mix up the router's business of dealing with replicasets and replicaset internal activities of leader election. No mattter Raft or hand-made. I think we should separate those so that router will only update the leader info, not weighting on how many replicas replied which node is the leader - it should be opaque. Although it means replicaset itself should be aware of its participation in the cluster. |
Beta Was this translation helpful? Give feedback.
-
Yesterday (on Friday) I've realized that something similar (subscriptions) may be useful in cartridge too. E.g. there's a WebUI, which polls lots of information from every instance in the cluster. Here on the main dashboard you can see buchets count ( And there's more in server details dialog. |
Beta Was this translation helpful? Give feedback.
-
The related issues are #75 and #209. The discussion starts with a description of how the task looks in my understanding. Then I provide my vision of API and behaviour, some insights at internals, open and frequent questions, alternatives, and what I want to do in the future.
Problems with existing functionality
In order to change master in a replicaset too many actions are required due to quite stiff configuration update process. The config must be updated on all the storage masters and replicas in all replicasets and on all routers so as the switch would be taken into account by all parties.
In big clusters with tens of replicasets and several replicas in each, with tons of routers, it becomes a weak link in the chain of failover steps.
Some instances might be unreachable and won't get the update. Others might fail to update it due to any reason. Thirds might be not accounted somewhy and won't even attempt to update.
This is not only about master change really. In scope of #209 it was proposed to automate the entire config discovery. At least for the routers, which are supposed to be completely stateless, but having to update their huge configs is an obstacle on the stateless way of their functioning.
Another major feature related to master auto-discovery is support of built-in Raft. In the future I assume vshard might support it in such a way that even the storages won't need to have a master specified in the config. Instead, they will use the automatic leader election. Then it is even more important for the routers to be able to discover the leader/master automatically. Although that is not the main issue here with Raft. Rebalancer would need some serious enhancements, I expect.
This RFC is not about config discovery between the storages, but might be applied there as well in the future. Everything below is about routers.
Possible solutions and behaviour, config API
Master = auto
For #75 (master discovery).
The possible look of the config:
In the replicaset settings a user can specify
master = 'auto'
. Then the router will go to the listed nodes and try to fetch from them who is the current master. When found, it will use the node for RW requests.When the router sees the master has resigned, it will restart the discovery process.
Discovery must scan all the instances in the given replicaset, not just one. Because it might be that the old master has just died. But it wasn't deleted from the replicaset yet, while a new master was already elected externally, and the other nodes know about it.
Pros: explicit, a user can see where the auto-discovery is enabled and where not.
Cons: a new option to keep in mind and support. Might be not obvious how would work with #209 when entire replicasets are discovered. Is it assumed, that if I didn't specify a replicaset at all, it would be discovered with implicit master = auto setting? Implicit is usually bad.
Master not specified = implicit auto
For #75 (master discovery).
The same as the previous option, but
master
key simply is not specified anywhere in the replicaset settings. Then it is assumed auto.Pros: solves the cons of the previous option. No new config key to support, and should be compatible with #209.
Cons: firstly, currently it is fine not to have any masters in a replicaset, and the router does not do anything, only raises a warning. Auto-discovery would change the behaviour - the router would start doing something. Although this maybe good, can't say for sure. Also this means master discovery is implicit - I don't like implicit things in general.
Master is being discovered always
For #75 (master discovery).
Regardless of
is_master
setting, master discovery works for all replicasets specified in the config. If the master was specified in a replicaset, and discovery found it is not true, it will use the real master, but will return a warning inrouter.info()
that the config is outdated.Pros: the same as in the previous 2 options.
Cons: I don't like
is_master
being ignored when it mismatches the real master.Config = auto per replicaset
For #209 (full topology and config discovery).
A user must only specify at least one node from a replicaset from where the router will fetch the entire replicaset config. This is per-replicaset.
The idea is that in addition to master change discovery the router would also be able to find replica zone updates, find new replicas, purge deleted replicas. Not very common actions though I imagine.
Pros: solves a bigger task than just master discovery. The routers become much less dependent on a hardcoded config.
Cons: harder to implement than just master discovery. Not obvious what to do when different nodes say different configs. That might mean the full config can only be downloaded from some centric storage so as to be consistent.
Config = auto global
For #209 (full topology and config discovery).
This is the most global imaginable discovery.
The router will try to download the entire cluster config from all the available nodes + from the given anchor nodes. The entire config can just contain the anchor URIs + router-specific options.
The router will try to build its own map of the cluster. A user might just create a replicaset with 0 weight for the sake of config discovery with a few of nodes + RAFT in it, store the config here in a space, and all routers would go here and download the config + update it when it is changed.
Still, the user can hardcode some of replicasets and their replicas.
The anchor URIs might even not be fully configured vshard storages. They only would need to implement some function like
vshard.storage.pull_config()
. They can also be omitted and as anchor nodes a user could specify one of the replicasets.Pros: can discover new and deleted replicasets.
Cons: notably harder to implement. And the same problem with config conflicts as in the previous option.
Config = auto, centric storage
For #209 (full topology and config discovery), but not for #75 (master discovery).
A simplification of the previous two options, although not so flexible. The config looks the same except that there is no
auto_config
option. And discovery works only on the nodes listed inconfig_nodes
. It won't try to download anything from the nodes insharding
. Although some of them might be the same as inconfig_nodes
.This works relatively well if the config is stored in a dedicated place from where all the storages and routers are supposed to download it.
Pros: simpler than the previous option, but still works the same in the common case. No config conflicts. And it is proved to be working fine in real systems.
Cons: master discovery still needs to be implemented not via the centric config storage, because if built-in master election would be ever used, it would mean it is not a part of the config anymore. Also such discovery won't be able to work properly if the config is updated only on a subset of the storages. Routers, having the new config, will try to use it on the storages still having the old config. This might lead to issues. For instance, if master is configured explicitly, router might see it differently that it really is, and won't be able to find the truth until the config is applied on the entire cluster.
Summary about the config look
Have no idea what to choose so far really.
I tend to think the idea about full config discovery via all possible URLs is the most flexible, but it is the hardest one to implement. While the idea about a centric config storage seems to be most widely used. Which does not prove it is good though. This might be due to lack of automatic master election in tarantool until recently that
is_master
is the global config worked fine.Currently I am considering implementing only #75 as
master = 'auto'
option among described above.How to fetch the config
This has to be something implemented in vshard, but also available as public API maybe, for the sake of implementing your own config storage not depending on vshard in the future.
This is though fine to implement it as internal now inside of
vshard.storage._call()
, and expose to public later when there is feedback if anybody really wants it.One of the hard parts here is how to avoid download of the full config when it does not change, as it might be quite big I suppose. A lot of strings, arrays and maps in messagepack, big Lua tables when unpacked.
The ideas about using a hash does not work - it can only tell if the config has changed for sure (different hashes). To check if it didn't change need to compare the configs due to hash being not perfect - it might have collisions.
The "is the config the same" check should be 100% correct. Sometimes it is fine to re-download the config without need, but should be rare.
I see several options.
Full config, config hash + timestamp
For #209 (full config discovery).
Gives almost perfect way to see if anything has changed. The storage calculates the config hash + updates its timestamp when it is changed. The router goes there, checks this pair, and if anything has changed - download the new version.
Pros: simple, understandable.
Cons: not reliable. I need to use the UTC timestamps because the monotonic time won't survive the restart. UTC time might be corrected on the machine. Together with the possible hash collisions, there is a chance a config update might be not noticed by the routers. This could be solved by bumping the timestamp periodically even if nothing changes, but looks like a crutch.
Also need to filter out from the config irrelevant information. For instance, the router does not care about
box.cfg
settings. This adds complexity.Full config, bound to the TCP session
For #209 (full config discovery).
I like this solution more because it is 100% solid. Although might look a bit complicated. The idea is similar to what I did for storage refs in scope of map-reduce.
The router goes to the storage and asks for the config + its version. They are returned first time + the version is remembered in the TCP session via which the request was done.
Next time the router asks for the config, it passes the version. If it matches the one stored in the session, no need to return anything.
Config is re-downloaded only when the connection is re-established (rare, regardless of the reason), and when it is really changed.
Pros: reliable.
Cons: might look complicated.
Full config, subscription
For #209 (full config discovery). Can be done just for #75 (master discovery) as well, without a full config.
This looks relatively easy - send to the storage a long-polling request which would send
box.session.push()
when the config changes anyhow.Here are 2 problems:
net_msg_max
limit for not doing almost anything and will occupy a fiber. Stateless router count might be bigger than now, especially if like Make routers completely stateless #209 says, they would be on the clients, which means the waste can be notable.The problem of reload can be solved with a hack: store the function ref as a global, and call it via a not reloadable function. The ref can be reloaded. For example:
This way
M.on_message
can be reloaded without dropping the connections, but new pushes will use a new function. Already senton_message_proxy
won't be reloaded, but they are trivial, their reload should not be ever needed.The problem is long-poll requests wasting resources can't be solved on any of the current Tarantool versions, but it might be not as bad as it looks.
On the storage side the subscriptions need to attach to a global cond var which is signaled when something is changed on the storage. And they need to end the subscription when the connection breaks. Can be done via a long periodic checks or via
box.session.on_disconnect()
.Another good part here - it could be used for subscribing on
_bucket
updates too which would make bucket discovery on the router much more reactive, and cheap when no changes. It would solve #257 and #238.Pros: get config update right when it happens; reliable; might be not so easy to implement.
Cons: inefficient, wastes resources fibers and
net_msg_max
on the storage.Full config, persistent version
For #209 (full config discovery).
The config version is stored in WAL somehow. Maybe in a replica-local space. The storage bumps the version on each reconfig.
Pros: reliable; understandable.
Cons: requires schema change; makes the config update yield; wastes space in WAL.
Only 'is_master' flag
For #75 (master discovery).
For #75 it is enough to get from each node a boolean if this is a master. Very cheap. This makes not necessary to try to save time on not asking "are you a master?" too often.
Pros: stupid and simple.
Cons: works only for #75.
Individual config + list of nodes
For #209 (full config discovery).
It extends the previous option - let the storage instances send not only their
is_master
flag but also their parameters like zone, replicaset UUID, instance UUID, and a list of the nodes they know about.For instance, assume there are instance1 and instance2, they have their configs and know configs of each other. Also there is an instance3, which does not have its own config, but knows about instance1 and instance2. The router knows about instance3.
Firstly, the router downloads from the instance3 this:
{config = {}, nodes = {instance1, instance2}}
.Now the router knows about instance1,2,3. It downloads from instance1 this:
{config = {is_master, replicaset_uuid, instance_uuid, zone}, nodes = {instance2}}
. Now it downloads the same from instance2.The config is discovered in parts downloaded from the nodes responsible for their parts.
The config parts are small which means it is not necessary to protect them with a version, a timestamp, or anything else. Can download them each time.
Pros: no config conflicts are possible; see only the real situation in the cluster, even if the target config wasn't applied everywhere yet.
Cons: need to download a config part from each instance. If their number is hundreds, it might take time. Not too much though, especially if use
is_async
from netbox to collect the parts in parallel.Summary about the config fetch
The best option is the last one - download the config in parts and discovery other nodes as urls, not a records in a config. It solves both #75 and #209, and almost has no such a thing as a discovery conflict.
But if go for #75 only, then the fastest solution is to send just
is_master
flag from the storages. There is not much to do wrong really, and it is quite simple to do. Compared to the other options.It is very likely only #75 would be solved first. And in that case the best final solution is either collect the update periodically or on an event like "no master is known", or via a subscription. Sections "Full config, subscription" and "Only 'is_master' flag".
Conflict resolution
What if the config was found different on several instances? For instance, there are multiple replicas which say they are masters.
The option list is quite short: random choice; do nothing (no master); UTC timestamp the configs and choose the latest.
Random choice: select any config among the conflicted ones. Stupid, but works.
Do nothing: treat it like there is no a master, and return a warning from
router.info()
.UTC timestamp the configs and choose the latest. Sounds good at first, but then you realize that this due to time shifts might lead to writing to multiple masters simultaneously from different routers. For instance, some of them didn't notice the multi-master situation yet and keep writing.
Summary about conflict resolution
I choose the second option about not doing anything if see a conflict + raise an alarm. This is the safest option. Normally there should be either 0 or 1 masters, and this option works good for that case.
Implementation
There are no clear and separate ways how to do it. I only list key ideas from which I will form a summary in the end.
Location of the implementation
Depends on its complexity.
For a simple master discovery it should be enough to implement it in one of the existing modules. If need to be able to use both on the router and on the storage when auto-election would be enabled - there is no much choice. The implementation must be in
replicaset
module then.For full config discovery it probably should live in its own internal module. Similar to the storage refs and the scheduler on the storage. The logic of any of the config discovery solutions is not trivial enough to have it in the main init.lua files. On the other hand if it lives in a module, the module would need its own config. This adds complexity.
How would the discovery module connect to the remote peers?
Connections to the remote nodes are established by the router's main file and are stored in each router object. Establishing them all second time just for the sake of discovery module independence does not look good.
That means either the module should reference the router it is used in so as it could access its connection; or there should be a global connection pool from where the config discovery could take them; or it should reference
replicasets
objects, which exist both on the router and on the storage. In the last case the implementation would work in both instance types.Discovery lives in a fiber or not?
The discovery config object needs to do its work periodically. This implies there should be a fiber for that, because there is no way to postpone some work otherwise. I could make the discovery somehow a state machine without a fiber and put it into the existing bucket discovery fiber, but this would look notably complicated.
Alternative would be no to do periodic work - try to discover anything only when don't see a master node in any of the replicasets. Then a fiber is started, finds the master, and is terminated. This works only for #75. For #209 need to poll for the updates periodically to find changes not only about mater switch.
Another alternative - make discovery a function. It takes replicasets table, looks for the ones not having a master, and in a blocking way tries to find a master in them, and then returns regardless of the whether it was a success. The function can be called periodically in one of the existing fibers. For instance, the router could call it in the failover fiber. The storage could call it before applying rebalancing routes.
When the router sees an RW request has failed to a replica, it can reset its
is_master
flag and wakeup the failover fiber to hurry up the master discovery.Polling or subscription?
Polling is about all the solutions which try to fetch updates with a period from all or some of the nodes. It is usually easy to implement, but might be not efficient enough.
Subscription is about using
box.session.push()
to get updates of the config or node's personal info (depending on the what is chosen in the sections above).Having a subscription also would allow not to have a fiber, because it is simply an async request sent via netbox with a callback to invoke on each message.
Deal with multiple routers
Talking of the multiple routers - the config discovery module must not be a global singleton, like the storage refs and the storage scheduler. It must allow to create 'config discovery' objects stored in each router object individually.
Instance deletion
If the config is going to be discovered in pieces from each individual node, to the above questions there is another one - how to discovery if an instance or a replicaset is deleted?
I see 2 signs:
When both conditions meet, it could be considered deleted. And when all instances of a replicaset are deleted like that, the replicaset is deleted too.
Summary about implementation
Since I tend to decide to implement #75 only, I think the best solution in terms of complexity vs future compatibility vs impact is to implement a function in
replicaset
module. It would get replicasets not having a master, and send a blocking requests to their nodes asking who is the master.Either it finds the masters and all is fine, or it does not and returns an error, or it finds 2 and more masters for some replicasets and also returns an error.
It can also send requests to the known master to check if they are still masters.
The router will call this function periodically in the failover fiber. The fiber will be woken up when router fails an RW request because some node is not a master.
In addition, I can add a hint to the "not a master" error on the storage side which will tell who is probably the real master. The storages should know that (normally).
Summary about everything
Only master discovery will be implemented, via
master = 'auto'
option in the router config, per replicaset.It will fetch "are you a master?" attribute from all the storages to locate the master and discovery old master demotion.
The implementation will live in
replicaset
module as just a function doing some netbox calls (probably in parallel viais_async
) in an attempt to find masters.Router will call the master discovery function periodically in its failover fiber. Also it will wakeup the failover fiber to hurry up in case it fails an RW request with the error about "not a master".
The storage with the "not a master" error will send a hint who might be the current master. If the routers gets the hint, it won't need to wakeup the "big" master discovery.
Future work
This is important to think about where this is all going so as not to shake the public API too much. As I see it, in future even the storages might not need to know the config of all the other storages in advance. They might discover it just like the routers will. Via a centric storage or in parts.
For that the user might create a special replicaset with no buckets, which stores only its own config. On the routers and storages in their config they see only this special replicaset. When start, they register themselves there and download topology of the other replicasets. This way via the special config-shard all the instances will discovery the cluster.
The API would look on the storage the same as on the routers. Moreover, a storage does not need to know its entire replicaset. It needs only a majority of nodes so as they would boostrap in one replicaset UUID. After bootstrap it is enough to see just one, and the others can be discovered.
Discovery can use many places as a source. In scope of this RFC only the configs were considered. But in future it is possible to use SWIM + broadcast, and fetch the nodes from
box.cfg.replication
,box.info.replication
.The closest milestone is #209 - full config discovery on the routers.
FAQ
What if I have several instances configured as masters?
They might have different configs, each saying this node is a master. Then the router, according to the currently chosen solution, will not use any of them as a master, and will raise an alarm.
Open questions
None.
Beta Was this translation helpful? Give feedback.
All reactions