Implement high availability control plane #1

gothub · 2021-06-02T18:14:36Z

Maintenance tasks such as k8s upgrades, OS upgrades and re-configurations (disk, etc) can require k8s nodes to be offline for reconfiguration and rebooting.

Minimize k8s service disruptions when these maintenance tasks are performed by:

configuring a multi-master k8s configuration
implement high availability services where possible
- currently one pod or service instance of these items is running at any time for metadig:
  - metadig-controller
  - rabbitmq
  - metadig-nginx-controller
  - metadig-scheduler
  - metadig Postgres server
  - metadig-scorer
use appropriate k8s management tools to aid this process, such as draining worker nodes to prepare them for maintenance

This issue supercedes NCEAS/metadig-engine#287

gothub · 2021-06-14T16:59:37Z

Some approaches to implementing a high availability control plane are detailed here

This document discusses both external load balancers (e.g. HAproxy on external nodes) or software load balancing. For the later configuration, keepalive and haproxy run on the control plane nodes, so an external load balancer is not required to switch control to a new active cluster control node in case the current primary becomes unavailable.

With either configuration (external load balancing or internal) extra nodes would need to be added to the cluster that could act as the stand by control nodes.

gothub · 2021-07-08T16:25:09Z

BTW - the link shown above (https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md) uses kubeadm to implement a 3 control-node HA k8s cluster, with a 'stacked' etcd cluster, or optionally with the etcd nodes external to the cluster.

nickatnceas · 2022-05-19T17:26:48Z

Two VMs, k8s-ctrl-2 and k8s-ctrl-3 have been provisioned for K8s over in https://github.nceas.ucsb.edu/NCEAS/Computing/issues/98

The physical: virtual layout of the control plane VMs is:

host-ucsb-6: k8s-ctrl-1
host-ucsb-7: k8s-ctrl-2
host-ucsb-8: k8s-ctrl-3

nickatnceas · 2023-09-12T17:44:09Z

In a Slack discussion we decided to setup backups for K8s and K8s-dev before converting our install to HA.

We may need to upgrade K8s before the HA changes, which in turn may require an OS upgrade on the existing controllers.

gothub self-assigned this Jun 2, 2021

gothub mentioned this issue Jun 2, 2021

Develop a process to reboot k8s nodes with no service downtime NCEAS/metadig-engine#287

Closed

gothub changed the title ~~Develop a process to reboot k8s nodes with no service downtime~~ Implement high availability control plane Jun 14, 2021

nickatnceas mentioned this issue May 19, 2022

Setup a multi-node control plane for K8s #32

Closed

mbjones assigned mbjones, nickatnceas and taojing2002 and unassigned gothub Sep 8, 2022

mbjones added the operations label Sep 8, 2022

mbjones mentioned this issue Sep 9, 2022

upgrade k8s to 1.24 (or 1.25) on dev and prod #35

Open

11 tasks

This was referenced Sep 12, 2023

Setup backups for K8s #37

Closed

Rename K8s controller VMs #38

Open

Upgrade or replace the operating systems running the K8s controller VMs #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement high availability control plane #1

Implement high availability control plane #1

gothub commented Jun 2, 2021

gothub commented Jun 14, 2021 •

edited

Loading

gothub commented Jul 8, 2021

nickatnceas commented May 19, 2022

nickatnceas commented Sep 12, 2023

Implement high availability control plane #1

Implement high availability control plane #1

Comments

gothub commented Jun 2, 2021

gothub commented Jun 14, 2021 • edited Loading

gothub commented Jul 8, 2021

nickatnceas commented May 19, 2022

nickatnceas commented Sep 12, 2023

gothub commented Jun 14, 2021 •

edited

Loading