Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement high availability control plane #1

Open
gothub opened this issue Jun 2, 2021 · 4 comments
Open

Implement high availability control plane #1

gothub opened this issue Jun 2, 2021 · 4 comments
Assignees

Comments

@gothub
Copy link
Contributor

gothub commented Jun 2, 2021

Maintenance tasks such as k8s upgrades, OS upgrades and re-configurations (disk, etc) can require k8s nodes to be offline for reconfiguration and rebooting.

Minimize k8s service disruptions when these maintenance tasks are performed by:

  • configuring a multi-master k8s configuration
  • implement high availability services where possible
    • currently one pod or service instance of these items is running at any time for metadig:
      • metadig-controller
      • rabbitmq
      • metadig-nginx-controller
      • metadig-scheduler
      • metadig Postgres server
      • metadig-scorer
  • use appropriate k8s management tools to aid this process, such as draining worker nodes to prepare them for maintenance

This issue supercedes NCEAS/metadig-engine#287

@gothub gothub self-assigned this Jun 2, 2021
@gothub gothub changed the title Develop a process to reboot k8s nodes with no service downtime Implement high availability control plane Jun 14, 2021
@gothub
Copy link
Contributor Author

gothub commented Jun 14, 2021

Some approaches to implementing a high availability control plane are detailed here

This document discusses both external load balancers (e.g. HAproxy on external nodes) or software load balancing. For the later configuration, keepalive and haproxy run on the control plane nodes, so an external load balancer is not required to switch control to a new active cluster control node in case the current primary becomes unavailable.

With either configuration (external load balancing or internal) extra nodes would need to be added to the cluster that could act as the stand by control nodes.

@gothub
Copy link
Contributor Author

gothub commented Jul 8, 2021

BTW - the link shown above (https://github.com/kubernetes/kubeadm/blob/master/docs/ha-considerations.md) uses kubeadm to implement a 3 control-node HA k8s cluster, with a 'stacked' etcd cluster, or optionally with the etcd nodes external to the cluster.

@nickatnceas
Copy link
Contributor

Two VMs, k8s-ctrl-2 and k8s-ctrl-3 have been provisioned for K8s over in https://github.nceas.ucsb.edu/NCEAS/Computing/issues/98

The physical: virtual layout of the control plane VMs is:

host-ucsb-6: k8s-ctrl-1
host-ucsb-7: k8s-ctrl-2
host-ucsb-8: k8s-ctrl-3

@nickatnceas
Copy link
Contributor

In a Slack discussion we decided to setup backups for K8s and K8s-dev before converting our install to HA.

We may need to upgrade K8s before the HA changes, which in turn may require an OS upgrade on the existing controllers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants