This directory hosts information about and configuration files for our Kubernetes cluster. The cluster has three namespaces:
production
contains all our production services.staging
is a small mirror of production. It has less data and less compute resources, and may have its data reset at any time. It’s designed for more open, loose access so people can test new code against it. We occasionally deploy new releases to staging before production in order to test it, but usually we deploy to both and control what features are turned on with environment variables instead.kube-system
contains infrastructural services that need to work across the cluster, like tools for logs and metrics.
Configuration files are stored in directories with the same name as the namespace.
See 0-setup-guide
for a guide to setting up a new cluster from scratch.
Our deployment needs access to some sensitive data, like credentials, ARNs, etc. We keep configurations for those in a separate secure repository, but they are still Kubernetes configurations that can be deployed the same way the rest of the configurations here are. You can find example versions of these files named example.*.yaml
in this directory.
We use the kubectl
CLI client to update the resources in the cluster with new configurations. You’ll need to configure it on your computer with the address and keys for our cluster. Common workflows:
Let’s say you want to deploy release c5a9e6de994392f23967cef3b2b916ac4d6ca87d
of web-monitoring-db:
-
Find the deployment configuration files in
production
andstaging
and update them. In this example, you’d edit:…and update the
spec.template.spec.containers.image
property to the new correct tag:spec: containers: - name: rails-server image: envirodgi/db-rails-server:c5a9e6de994392f23967cef3b2b916ac4d6ca87d imagePullPolicy: Always
-
Use
kubectl replace -f <config-file-path>
orkubectl apply -f <config-file-path>
to deploy to the cluser:> kubectl replace -f staging/api-deployment.yaml > kubectl replace -f production/api-deployment.yaml
-
Keep an eye on deployment progress with
kubectl get pods
:> kubectl get pods --namespace production NAME READY STATUS RESTARTS AGE api-74f8dcf857-j9cpc 1/1 Running 0 16d api-74f8dcf857-k29qh 1/1 Running 1 16d diffing-5b4f87c9d6-bhk25 1/1 Running 0 1d diffing-5b4f87c9d6-dqr4b 1/1 Running 0 1d import-worker-85578ccd65-c2bwr 1/1 Running 0 16d import-worker-85578ccd65-wvqnt 1/1 Running 3 16d redis-master-6c46f6cfb8-gh94g 1/1 Running 0 152d redis-slave-584c66c5b5-kqbtt 1/1 Running 1 187d redis-slave-584c66c5b5-psstd 1/1 Running 0 152d ui-7ff6d77fdb-rkmhr 1/1 Running 1 41d ui-7ff6d77fdb-s8ddp 1/1 Running 0 41d
Kubernetes works by keeping an eye on everything in the cluster and constantly updating it to make sure it stays in spec with the configuration. Most of the time, Kubernetes will automatically restart a pod that is stuck in a bad state.
If you need to reset a pod manually for some reason, the easiest way is just to delete it! Kubernetes will then create a new one in its place so that the right number of pods is running:
-
List the pods and find the one you want:
> kubectl get pods --namespace production NAME READY STATUS RESTARTS AGE api-74f8dcf857-j9cpc 1/1 Running 0 16d api-74f8dcf857-k29qh 1/1 Running 1 16d diffing-5b4f87c9d6-bhk25 1/1 Running 0 1d diffing-5b4f87c9d6-dqr4b 1/1 Running 0 1d import-worker-85578ccd65-c2bwr 1/1 Running 0 16d import-worker-85578ccd65-wvqnt 1/1 Running 3 16d redis-master-6c46f6cfb8-gh94g 1/1 Running 0 152d redis-slave-584c66c5b5-kqbtt 1/1 Running 1 187d redis-slave-584c66c5b5-psstd 1/1 Running 0 152d ui-7ff6d77fdb-rkmhr 1/1 Running 1 41d ui-7ff6d77fdb-s8ddp 1/1 Running 0 41d
-
Delete the pod. Sometimes this command can take a while:
> kubectl delete pod --namespace production diffing-5b4f87c9d6-dqr4b pod "diffing-5b4f87c9d6-dqr4b" deleted
-
Keep an eye on the recreation progress with
kubectl get pods
.
Kubernetes CronJobs work by creating Jobs on a set schedule. You can check the CronJobs with:
> kubectl get cronjobs
And check the concrete jobs they’ve created with:
> kubectl get jobs
And you’ll also see pods created for the different jobs when you kubectl get pods
.
Get job run times. Kubectl
doesn’t give durations (oddly), so this jq snippet will do that:
> kubectl get jobs --namespace production --output json | jq '.items[] | (.status.completionTime | fromdateiso8601) as $completion | (.status.startTime | fromdateiso8601) as $start | {name: .metadata.name, date: .status.startTime, duration: ($completion - $start)}'
Kubernetes has new releases every few months, and it’s important to keep our cluster running a reasonably up-to-date version, both to alleviate security issues and to make future upgrades straightforward. We use kOps to manage our Kubernetes install on AWS. kOps does not always update as frequently as Kubernetes itself, but the kOps team does a pretty good job of making things reliable, so we tend to stick with Kubernetes versions that production releases of kOps supports.
First, get yourself a copy of kOps and keep it up to date. If you’re on a Mac, Homebrew is the easiest way to get it:
$ brew update && brew install kop
Then, to update the cluster, follow the instructions at: https://kops.sigs.k8s.io/operations/updates_and_upgrades/#upgrading-kubernetes
Some things to keep in mind:
-
You'll probably want to set up the
KOPS_STATE_STORE
environment variable. It points to the S3 bucket where our cluster state information is stored (contact another team member for the bucket name):$ export KOPS_STATE_STORE='s3://<name_of_s3_bucket>'
-
NEVER update across multiple major releases (e.g. don’t update from Kubernetes 1.21.x to 1.23.x). Doing this in the past has frequently caused issues. If there have been multiple major relases, follow the “manual update” instructions all the way through once for each major version. Verify the cluster is in good shape and everything is running correctly (can you reach the API? can you view the UI?) between each.
-
The
kops rolling-update cluster
command can take a really long time — anywhere from 10 to 40 minutes. That’s OK! To see progress in slightly more detail, you can keep an eye on EC2 instances in the AWS web console (you can see them launching and shutting down there before they might be visible to kOps).