The smallest schedulable unit in Kubernetes — Pod
Overview
Overview
As I have mentioned in the previous blog, lots of technologies the Kubernetes relies on are created with the help of the operating system. For example, the container is the specific process that corporates with the CGroup and Linux Namespace on OS. However, there is another essential OS concept used in Kubernetes: process group. It means one process group includes several processes. Those kinds of processes have a super affinity relationship. The process group
is implemented as the pod
in the Kubernetes.
Besides, the CGroup and Linux Namespace can supply the service for a process group initially.
Why we need the Pod (containers group) in Kubernetes
You can imagine the scenario of scheduling multiple processes that belong to the same group.
1// Group foo
2// There is a special requirement of scheduling processes:
3// A and B have to be scheduled into the same node.
4Process A -- 10CPU
5Process B -- 10CPU
6Process C -- 5 CPU
7
8// Node
9Node A -- 15 CPU
10Node B -- 20 CPU
If the Process A
and Process C
were scheduled into the Node A
, firstly, the scheduler doesn’t know how to schedule the Process B
. Because the CPU resource almost is consumed by Process A
and Process C
and the scheduling rule requires the Process A
and Process B
must run on the same node. That is a significant problem. Our scheduler seems to pick up a lousy scheduling strategy.
The root cause of the situation as above is the scheduler schedules processes belong to the same group separately. There is a super affinity relationship between Process A
and Process B
. We should handle them as a whole. So if you use the container
instead of the Process
, the same problem also happens in the Kubernetes world.
Util now, I think we can create a new scheduling strategy to solve this problem. It is unnecessary to create a new special content in kubernetes. If you also has the same confusion with me, let’s go back to the Process
in the operating system.
When I perform the pstree -g
command in my development VM, I captured the part of the result as below:
1├-nvim(163523)─┬─gopls(163556)─┬─{gopls}(163556)
2 │ ├─{gopls}(163556)
3 │ ├─{gopls}(163556)
4 │ ├─{gopls}(163556)
5 │ ├─{gopls}(163556)
6 │ └─{gopls}(163556)
7 ├─node(163533)─┬─gopls(163533)─┬─{gopls}(163533)
8 │ │ ├─{gopls}(163533)
9 │ │ ├─{gopls}(163533)
10 │ │ └─{gopls}(163533)
11 │ ├─{node}(163533)
12 │ ├─{node}(163533)
13 │ └─{node}(163533)
14 └─{nvim}(163523)
You can detect that there are so many processes corporate together to support the nvim
work well. And the process also needs multiple threads to corporate. The different processes (threads) play in different roles. But their target is the same: finish all tasks when this process was performed in OS.
Although the docker community advises that developers should make the single container pattern
as a development standard during containerizing applications. We cannot implement a production-level service within one binary file. It must include many containers, and they need to corporate together to serve customers. So, how to handle the affinity relationship among multiple containers and obey the container development pattern at the same time? The answer is: Pod
Pod
is a good proposal; it solves two problems as above: scheduling strategy and handling the affinity relationship. Pod
is a logic concept in Kubernetes.
How to implement the Pod
in Kubernetes ?
For scheduling strategy
For the scheduling strategy of Pod
, the Kubernetes define a pod template that includes many fields to organize multiple containers. Every container can specify its own resource requests and limits for computing resources. That means the requirement of computing resources of the Pod
is the sum of requirements of containers.
1apiVersion: v1
2kind: Pod
3metadata:
4 name: frontend
5spec:
6 containers:
7 - name: db
8 image: mysql
9 env:
10 - name: MYSQL_ROOT_PASSWORD
11 value: "password"
12 resources:
13 requests:
14 memory: "64Mi"
15 cpu: "250m"
16 limits:
17 memory: "128Mi"
18 cpu: "500m"
19 - name: wp
20 image: wordpress
21 resources:
22 requests:
23 memory: "64Mi"
24 cpu: "250m"
25 limits:
26 memory: "128Mi"
27 cpu: "500m"
28
For container design pattern
The container design pattern usually means multiple containers need to share some resources for the corporation so that they can serve to users well. So, the first problem we need to solve is creating an environment that can let multiple containers share network, volume, etc. A direct thought is implementing it in the docker level. We can specify the volume or network parameters when the container is started.
1docker run mysql --volume-from=web-server --network-from=web-server
If we implement it as above, that means the web-server
container needs to be started earlier than mysql
. It breaks the equal relationship among containers that belong to the same pod. So we need to create another container. It should be the first to be started. We call it infra container
.
Infra container
Principle
I draw the graph to help us understand the infra-container.
Web path: http://o6sfmikvw.bkt.clouddn.com/729CEF17-3EB5-4DD9-BDA6-FD8D927E6A9E.png
Disk path: /static/http://o6sfmikvw.bkt.clouddn.com/729CEF17-3EB5-4DD9-BDA6-FD8D927E6A9E.png
Using Page Bundles: false
Containers within the pod usually share the volume and network resources. For the network resource, infra-container creates its own Linux Network namespace firstly. The MySQL and web-server containers are added into the same Network Namespace when they were created. That means they can use the localhost
to communicate with each other. If you can understand Mount Namespace and infra-container. You must know that the volume is implemented in the pod level (infra-container level). That means we can mount any volume into the infra-container. If the MySQL and web-server can share the Mount Namespace with the infra-container, they can see the same volume directory.
1apiVersion: v1
2kind: Pod
3metadata:
4 name: frontend
5spec:
6 Mounts:
7 - name: share-data
8 hostPath:
9 path: /data
10 containers:
11 - name: db
12 image: mysql
13 volumeMounts:
14 - name: share-data
15 mountPath: /data
16 env:
17 - name: MYSQL_ROOT_PASSWORD
18 value: "password"
19 resources:
20 requests:
21 memory: "64Mi"
22 cpu: "250m"
23 limits:
24 memory: "128Mi"
25 cpu: "500m"
26 - name: wp
27 image: wordpress
28 volumeMounts:
29 - name: share-data
30 mountPath: /data
31 resources:
32 requests:
33 memory: "64Mi"
34 cpu: "250m"
35 limits:
36 memory: "128Mi"
37 cpu: "500m"
38
The example as above implement a volume directory on the host machine and mount it into the frontend pod. The volume directory can be seen by wordpress container and mysql container. Any container writes some data to the /data
that can be shared with another container.
Implementation
If you try to run the command docker ps
, you must see some containers that use the same image called gcr.io/google_containers/pause
Web path: http://o6sfmikvw.bkt.clouddn.com/C55377DA-80EC-4730-86D5-FA0E5EB4D043.png
Disk path: /static/http://o6sfmikvw.bkt.clouddn.com/C55377DA-80EC-4730-86D5-FA0E5EB4D043.png
Using Page Bundles: false
The pause
container’s lifecycle is the same as pod
. The google company uses pause
container to implement the concept of infra-container
. You also can develop yours. pause
container usually is responsible for two things:
- Creating a sharing environment for its children containers 1. Linux Namespace(e.g. Linux Network Namespace) 2. Volume
- Enable the Linux PID Namespaces among containers group by the pod. So the
pause
container can be the super parent process to help us recycle orphan containers.
Actually, you can use unshare
command to create a process in new Linux Namespaces.
1sudo unshare --pid --uts --ipc --mount -f chroot rootfs /bin/sh
Because the child process usually inherits the Namespace from the parent process. If you use unshare
command to create the new process, it creates its own Linux Namespace. It is too complicated to use original Linux commands to implement the infra-container
.
Apart from the sharing Namespace and volume, the pause container plays an important role: the super parent process. It will be responsible for reaping the zombie processes.
zombie process is a kind of child process which has already finished all tasks, but the process’s item stayed in
process table
of the OS kernel. That means the parent process doesn’t clean it though call thewait
function.
Actually, there are two reasons that can cause the zombie process:
- The parent process forgets to call the
wait
system call. - The parent process died before the child process, and the new parent process may don’t exist, or it doesn’t call the
wait
.
For the first reason, I think we cannot resolve it. It depends on the developers. For the second reason, we can implement the reaping
function on infra-container. Once their original parent process can not clean any child process run in the same pod, the super-parent process in infra-container recycles it. We need to enable PID namespace sharing among containers in the same pod for implementing this function.
Here is the source code of pause
container by Google:
1/*
2Copyright 2016 The Kubernetes Authors.
3Licensed under the Apache License, Version 2.0 (the "License");
4you may not use this file except in compliance with the License.
5You may obtain a copy of the License at
6 http://www.apache.org/licenses/LICENSE-2.0
7Unless required by applicable law or agreed to in writing, software
8distributed under the License is distributed on an "AS IS" BASIS,
9WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10See the License for the specific language governing permissions and
11limitations under the License.
12*/
13
14#include <signal.h>
15#include <stdio.h>
16#include <stdlib.h>
17#include <sys/types.h>
18#include <sys/wait.h>
19#include <unistd.h>
20
21static void sigdown(int signo) {
22 psignal(signo, "Shutting down, got signal");
23 exit(0);
24}
25
26static void sigreap(int signo) {
27 while (waitpid(-1, NULL, WNOHANG) > 0);
28}
29
30int main() {
31 if (getpid() != 1)
32 /* Not an error because pause sees use outside of infra containers. */
33 fprintf(stderr, "Warning: pause should be the first process\n");
34
35 if (sigaction(SIGINT, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
36 return 1;
37 if (sigaction(SIGTERM, &(struct sigaction){.sa_handler = sigdown}, NULL) < 0)
38 return 2;
39 if (sigaction(SIGCHLD, &(struct sigaction){.sa_handler = sigreap,
40 .sa_flags = SA_NOCLDSTOP},
41 NULL) < 0)
42 return 3;
43
44 for (;;)
45 pause();
46 fprintf(stderr, "Error: infinite loop terminated\n");
47 return 42;
48}
Through the example above, you can see that the pause will be the first(super parent) process of the Pod. If it catches the signal SIGCHLD
, it calls waited
to reap it. The most important is the pause process calls pause
to wait signals during an infinity loop.
Init container
In addition to the infra-container
, there is another special container called Init container
. It will be performed before others which are defined in spec.containers section of the pod. That means we can utilize this mechanism to do some preparations in advance, such as copy necessary files, performing the setting script, e.g.
Besides, if you want to do some works before the container starts or stops, you can use the lifecycle of container: Container Lifecycle Hooks - Kubernetes.
Examples of container design pattern
If you try to recall the solution of collecting the container logs, you can know that we use an another container design pattern sidecar
.
Web path: http://o6sfmikvw.bkt.clouddn.com/logging-with-sidecar-agent.png
Disk path: /static/http://o6sfmikvw.bkt.clouddn.com/logging-with-sidecar-agent.png
Using Page Bundles: false
We deploy a logging-agent container within the pod as a sidecar container, and it shares the same volume called logging
with app-container. In the default situation, the container outputs logs into the stdout. We can use the logging SDK in our service, and it output the logs into a file that stay in the persistent volume shared by logging-agent. The logging-agent can forward logs into remote logging service, such as elastic-search. Although, we cannot see logs of the container through kubectl logs
. We can find another approach to resolve it: GitHub - phusion/baseimage-docker: A minimal Ubuntu base image modified for Docker-friendliness. There is a logging service that was included in the base image, syslog-ng. It can help us redirect logs from the file too stdout. Generally, I prefer to find or develop a convenient base image with a small size for my services.
Implement the Pod by ourselves
Let’s create the fake pod
with the help of pause
container.
First of all, we can run a pause container:
1docker run -d --name pause -p 8080:80 gcr.io/google_containers/pause-amd64:3.0
Then, we can run a blog application that shares resources with the pause
container.
1docker run -d --name ghost --net=container:pause --ipc=container:pause --pid=container:pause ghost
When we run the command docker inspect <ghost_container_id>
1 "Config": {
2 "Hostname": "52784ef438f9",
3 "Domainname": "",
4 "User": "",
5 "AttachStdin": false,
6 "AttachStdout": false,
7 "AttachStderr": false,
8 "ExposedPorts": {
9 "2368/tcp": {}
10 },
11 "Tty": false,
12 "OpenStdin": false,
13 "StdinOnce": false,
14 "Env": [
15 "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
16 "NODE_VERSION=10.17.0",
17 "YARN_VERSION=1.19.1",
18 "GOSU_VERSION=1.10",
19 "NODE_ENV=production",
20 "GHOST_CLI_VERSION=1.13.1",
21 "GHOST_INSTALL=/var/lib/ghost",
22 "GHOST_CONTENT=/var/lib/ghost/content",
23 "GHOST_VERSION=3.0.3"
That means we can access the blog through the 2368 port. But we want to run another nginx container within the same Linux Network Namespace of ghost container. It is a more straightforward approach to help us detect the Sharing Network Namespace
.
1docker run -d --name nginx -v `pwd`/nginx.conf:/etc/nginx/nginx.conf --net=container:pause --ipc=container:pause --pid=container:pause nginx
The nginx config as below:
1error_log stderr;
2events { worker_connections 1024; }
3http {
4 access_log /dev/stdout combined;
5 server {
6 listen 80 default_server;
7 server_name example.com www.example.com;
8 location / {
9 proxy_pass http://127.0.0.1:2368;
10 }
11 }
12}
Now we can access the http://localhost:8080
to access the ghost blog.
Web path: http://o6sfmikvw.bkt.clouddn.com/2D37EECB-3FE2-4329-B55A-F7B332771CE2.png
Disk path: /static/http://o6sfmikvw.bkt.clouddn.com/2D37EECB-3FE2-4329-B55A-F7B332771CE2.png
Using Page Bundles: false
There is a sharing topology among containers as above:
- Pause <-> Ghost
- Pause <—> Nginx
- Nginx <—> Ghost
When you can see the home page of Ghost blog through accessing the port of Nginx server, those containers have shared the Linux Namespace.
Web path: http://o6sfmikvw.bkt.clouddn.com/pause_container.png
Disk path: /static/http://o6sfmikvw.bkt.clouddn.com/pause_container.png
Using Page Bundles: false
Managing the lifetime and relationship of containers is too complex, so the kubernetes create Pod
to help us do that.
Summarize
Finally, after you saw this blog, I want to tell you 3 things:
- It is better to understand any new concept of Kubernetes in the operating system perspective. Because the Kubernetes relies on lots of fundamental technologies of the operating system.
- Even if the small and basic concept like
Pod
, there are many pieces of knowledge can be learned. - Don’t learn too much about a particular concept, and knowledge has no end. Know what it is, why, and how it is enough.