- Kubernetes
- Kubernetes provides the following:
- Service discovery
- Horizontal scaling
- Load balancing
- Self-healing
- Leader election
- The Kubernetes API Server exposes the RESTful Kubernetes API. Engineers using the cluster and other Kubernetes components create objects via this API.
- The etcd distributed datastore persists the objects you create through the API, since the API Server itself is stateless. The Server is the only component that talks to etcd.
- The Scheduler decides on which worker node each application instance should run.
- Controllers bring to life the objects you create through the API. Most of them simply create other objects, but some also communicate with external systems (for example, the cloud provider via its API).
Control loop
for {
Actual state = Obtain object X's actual state inside cluster
Expected state = Obtain object X's expected state inside cluster
if actual state == expected state
{
do nothing
}
else
{
perform orchestration behaviors
}
}
Type of controllers
$ cd kubernetes/pkg/controller/
$ ls -d */
deployment/ job/ podautoscaler/
cloud/ disruption/ namespace/
replicaset/ serviceaccount/ volume/
cronjob/ garbagecollector/ nodelifecycle/ replication/ statefulset/ daemon/
...
- The Kubelet, an agent that talks to the API server and manages the applications running on its node. It reports the status of these applications and the node via the API.
- The Container Runtime, which can be Docker or any other runtime compatible with Kubernetes. It runs your applications in containers as instructed by the Kubelet.
- The Kubernetes Service Proxy (Kube Proxy) load-balances network traffic between applications. Its name suggests that traffic flows through it, but that’s no longer the case.
- ReplicaSet: Consists of a definition of replica number definition and a pod template.
- Deployment controller operates on top of replica set, instead of a pod.
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: nginx-set
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
- For deployment,
- To support horizontal scaling, it modifies the replica number.
- To support rolling upgrade, it adds UP-TO-DATE status.
- Motivation: Limitations of deployment controller - Deployment assumes that all pods are stateless. However, distributed applications usually have states.
- StatefulSet abstracts application from two perspectives:
- Topology status. For example:
- Application A must start before application B.
- When pods are recreated, they must share the same network identifiers as before.
- Storage status. For example:
- Topology status. For example:
Internals
- What statefulSet manages is pod.
- Kubernetes number these pods by headless service, and generate DNS records inside DNS servers. As long as the pod numbering stay unchanged, then DNS records don't need to be changed.
- StatefulSet allocate an independent PVC for each pod. Kubernetes will bind a PV for each PVC by using persistent volume. In this case, even
Headless service
Big picture
- Service: Service is a mechanism for applications to expose pods to external env.
- Two types of ways to visit a service:
- VIP: A virtual IP maps to an address.
- DNS: A domain name maps to an address. And it could be divided into two more types
- Headless service
- Normal service
Example definition
- The cluster ip is set to None. It means that after the application is created, it will not have a virtual IP address. All it has will be a domain name.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
-
And all pods represented by headless service are identified by the labels "app: nginx".
<pod-name>.<svc-name>.<namespace>.svc.cluster.local
-
How the DNS record is used by StatefulSet to record pod topology status?
- When kubectl create the service according to yaml, it will number the pod as "statefulset name"-"ordinal index"
- As long as statefulset is not deleted, then when you visit statefulset-0, you will always be landing at app 0; When you visit statefulset-1, you will always be landing at app 1.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.9.1
ports:
- containerPort: 80
name: web
Limitations of using volume
- Requires much knowledge of the storage system themselves.
- For example, the following volume file for ceph exposes these information
- Ceph storage user name, storage server locations, authorization file locations
- Persistent volume / claim to rescue.
apiVersion: v1
kind: Pod
metadata:
name: rbd
spec:
containers:
- image: kubernetes/pause
name: rbd-rw
volumeMounts:
- name: rbdpd
mountPath: /mnt/rbd
volumes:
- name: rbdpd
rbd:
monitors:
- '10.16.154.78:6789'
- '10.16.154.82:6789'
- '10.16.154.83:6789'
pool: kube
image: foo
fsType: ext4
readOnly: true
user: admin
keyring: /etc/ceph/keyring
imageformat: "2"
imagefeatures: "layering"
Process
- Have a PVC defining volume attribute
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pv-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
- Use the PVC inside pod.
apiVersion: v1
kind: Pod
metadata:
name: pv-pod
spec:
containers:
- name: pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: pv-storage
volumes:
- name: pv-storage
persistentVolumeClaim:
claimName: pv-claim
- The PV is defined here
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-volume
labels:
type: local
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
rbd:
monitors:
# 使用 kubectl get pods -n rook-ceph 查看 rook-ceph-mon- 开头的 POD IP 即可得下面的列表
- '10.16.154.78:6789'
- '10.16.154.82:6789'
- '10.16.154.83:6789'
pool: kube
image: foo
fsType: ext4
readOnly: true
user: admin
keyring: /etc/ceph/keyring
Internals
- PVC is like an interface and PV is implementation.
- Pods
- Objects
ImagePullPolicy
- Default value is Always. Each time creating pod will pull the image.
LifeCycle
- For example
- PostStart: Runs immediately after containers get started.
- PreStop: Runs before containers get stopped.
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: lifecycle-demo-container
image: nginx
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
preStop:
exec:
command: ["/usr/sbin/nginx","-s","quit"]
- Secret: Used to store database credential
- ConfigMap: Used to store config info that does not need encryption
- Downward API: Used to make pod's info accessible to containers inside pod.
- ServiceAccountToken: A special type of secret used to store access control related information.
- There will be the gang scheduling problem: How to orchestrate a group of containers.
- Mesos tries to solve using resource hoarding and Google Omega tries to use optimistic lock.
- Kubernetes avoid this problem because pod is the smallest unit.
- Each container is a single process.
- Within a container, PID = 1 represents the process itself. And all other processes are the children of PID = 1 process.
- There could be many relationships between containers: File exchange, use localhost or socket file for communication, frequent remote procedure call, share some linux namespace.
- Pod is only a logical concept and a group of containers having shared resources. All containers in a pod share the same network namespace and could share the same volume.
- Why can't pod be realized by docker run command?
- The dependency of starting different containers.
$ docker run --net=B --volumes-from=B --name=A image-A ...
- Kubernetes has an intermediate container: Infra container. Other containers associate with each other by joining infra container's namespace.
- Infra container: Written in assembly language and super lightweight. Use a special container image called k8s.gcr.io/pause. It always stay in pause state and only has a size of 100-200KB after decompression.
- Container design model: When users want to run multiple applications in a container, they should first think whether they could be designed as multiple containers in a pod.
- All containers inside a pod share the same network namespace. So network related configuration and management could be completed inside pod namespace.
- Anything in the machine level (network, storage, security, orchestration) or Linux namespace level.
Sample: War and webapp
- Problem: Java web depends on a war. It needs to be put under Tomcat's webapps directory.
- Tries to solve with docker:
- Put war under Tomcat's webapps directory. Cons: Will need to update the container image if need to upgrade war.
- Reference the war inside volume. Cons: To make the war within volume accessible to multiple containers, need to build a distributed file system.
- Solution with pod: Side car model. Build war and tomcat into separate container images and combine them inside a pod.
- Init type of containers will start before regular containers.
apiVersion: v1
kind: Pod
metadata:
name: javaweb-2
spec:
initContainers:
- image: geektime/sample:v2
name: war
command: ["cp", "/sample.war", "/app"]
volumeMounts:
- mountPath: /app
name: app-volume
containers:
- image: geektime/tomcat:7.0
name: tomcat
command: ["sh","-c","/root/apache-tomcat-7.0.42-v2/bin/start.sh"]
volumeMounts:
- mountPath: /root/apache-tomcat-7.0.42-v2/webapps
name: app-volume
ports:
- containerPort: 8080
hostPort: 8001
volumes:
- name: app-volume
emptyDir: {}
NodeSelector
- Use case: Associated a pod with a node.
// the pod could only run inside a node with ssd tag.
apiVersion: v1
kind: Pod
...
spec:
nodeSelector:
disktype: ssd
NodeName
- Use case: Orchestration name.
HostAlias
- Use case: Define pod's hosts config file
// Define alias foo.remote / bar.remote for host.
apiVersion: v1
kind: Pod
...
spec:
hostAliases:
- ip: "10.1.2.3"
hostnames:
- "foo.remote"
- "bar.remote"
...
Namespace related
- Use case: Share host network, IPC and PID
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
hostNetwork: true
hostIPC: true
hostPID: true
containers:
- name: nginx
image: nginx
- name: shell
image: busybox
stdin: true
tty: true
- Problem: Under multi-host environments, two container applications might use the same IP address and will have duplicate registry center entries.
- Solution:
- Don't use containers' ip address. Use physical machines' ip address. However, this requires containers to know physical machines' ip address and this is bad abstraction from architecture perspective.
- Kubernetes uses a similar model as XLAN and it replaces docker0 with cni0. The reason is
- Kubernetes does not use Docker's CNM model.
- The first step for creating a pod is to create an infra to hold the pod's network namespace.
- Within CNI model
- All containers could use their own IP addresses to communicate with other containers, without using NAT.
- All hosts could use their own IP addresses to communicate with other hosts, without using NAT.
- Containers see the same self IP address with other containers/hosts.