How Kubernetes Deletes A Pod Gracefully

Overview

This blog will introduce the graceful deletion proceess of pod in source code level. It maybe little bit complicated. If you just want to know the process in high level, you can reference the official document

Hey! My pod was stuck in Terminating state

During my on-call time, some users ask for my help with their pods that cannot be deleted successfully in the Kubernetes cluster. Maybe the pod was stuck in the Terminating state, maybe “kubectl delete ” has been returned, but that pod object still exists.

But, the user doesn’t always give me a chance to fix it. A “popular” workaround on the Internet tells them “kubectl delete –force –grace-period=0” can help them. Actually, “force delete” is a kind of dangerous behavior if you didn’t know what’s a consequence. It may cause the data inconsistency in the Kubernetes cluster. It also breaks the troubling scenario so that we may lose the chance to find an implicit and important issue.

The best approach to troubleshoot this problem is to know the expected deletion behavior first and compare it with the actual behavior, and then fix the difference. So, I want to introduce the graceful deletion process of Kubernetes and related working mechanism on this blog. I hope it can help you to troubleshoot this kind of issue.

The deletion process

When you sent a request of deleting a pod object to kube-apiserver component by RESTful API or kubectl command, it was handled by StandardStorage of kube-apiserver first. Because the primary responsibility of kube-apiserver is to maintain resource data on the etcd cluster, other components like kube-controller-manager and kube-scheduler just watch data changes and move them from the actual state to the desired state.

Here is a Store package in kube-apiserver repo; it implemented all methods of StandardStorage. The Delete function is called.

1// Delete removes the item from storage.
2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
3    ...
4    graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
5    ...

First Round – from deletion request to kube-apiserver

In rest.BeforeDelete function, it invokes CheckGracefulDelete method to confirm if the resource you want to delete needs to be deleted gracefully or not. Actually, any resource can implement their graceful deletion logic by RESTGracefulDeleteStrategy interface. Now, only native kubernetes Pod resource is supported.

 1func BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
 2    ...
 3	gracefulStrategy, ok := strategy.(RESTGracefulDeleteStrategy)
 4	if !ok {
 5		// If we're not deleting gracefully there's no point in updating Generation, as we won't update
 6		// the obcject before deleting it.
 7		return false, false, nil
 8	}
 9	// if the object is already being deleted, no need to update generation.
10	if objectMeta.GetDeletionTimestamp() != nil {
11        ...
12	}
13
14	if !gracefulStrategy.CheckGracefulDelete(ctx, obj, options) {
15        
16       return false, false, nil
17    }
18    ...
19}

In the CheckGracefulDelete function, it tries to calculate the graceful deletion period by different approaches. Those approaches can be divided into three types

 1// RESTGracefulDeleteStrategy must be implemented by the registry that supports
 2// graceful deletion.
 3type RESTGracefulDeleteStrategy interface {
 4	// CheckGracefulDelete should return true if the object can be gracefully deleted and set
 5	// any default values on the DeleteOptions.
 6	CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool
 7}
 8
 9// CheckGracefulDelete allows a pod to be gracefully deleted. It updates the DeleteOptions to
10// reflect the desired grace value.
11func (podStrategy) CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool {
12	if options == nil {
13		return false
14	}
15	pod := obj.(*api.Pod)
16	period := int64(0)
17	// user has specified a value
18	if options.GracePeriodSeconds != nil {
19		period = *options.GracePeriodSeconds
20	} else {
21		// use the default value if set, or deletes the pod immediately (0)
22		if pod.Spec.TerminationGracePeriodSeconds != nil {
23			period = *pod.Spec.TerminationGracePeriodSeconds
24		}
25	}
26	// if the pod is not scheduled, delete immediately
27	if len(pod.Spec.NodeName) == 0 {
28		period = 0
29	}
30	// if the pod is already terminated, delete immediately
31	if pod.Status.Phase == api.PodFailed || pod.Status.Phase == api.PodSucceeded {
32		period = 0
33	}
34	// ensure the options and the pod are in sync
35	options.GracePeriodSeconds = &period
36	return true
37}

Running Pod

It picks up the GracePeriodSeconds of DeleteOptions first. You can specify it by command argument of kubectl or api request. For example, “kubectl delete –grace-period=10”, the GracePeriodSeconds should be 30 seconds by default. If you didn’t specify it, pod.Spec.TerminationGracePeriodSeconds will be used.

Pending Pod

The only thing of kube-scheduler did for every pod is update scheduable node name to pod.Spec.NodeName, so that kubelet on that host can run the pod. So, if pod.Spec.NodeName is nil, it means pod doesn’t run. It must be fine to delete it at once ( 0 graceful period means delete at once)

Terminated Pod

For Terminated Pod, it was almost the same as Pending Pod because they are not running.

After the graceful deletion period was confirmed, kube-apiserver will set DeletionTimestamp and DeletionGracePeriodSeconds for the pod.

 1func BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
 2    ....
 3	if !gracefulStrategy.CheckGracefulDelete(ctx, obj, options) {
 4        
 5       return false, false, nil
 6    }
 7    ...
 8	now := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds)))
 9	objectMeta.SetDeletionTimestamp(&now)
10    objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)
11    ...
12	return true, false, nil

kube-apiserver will update DeletionTimestamp and DeletionGracePeriodSeconds to the pod by updateForGracefulDeletionAndFinalizers function

1// Delete removes the item from storage.
2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
3    ...
4    graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
5    ...
6    if graceful || pendingFinalizers || shouldUpdateFinalizers {
7		err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
8	}

In updateForGracefulDeletionAndFinalizers function, kube-apiserver not only updates DeletionTimestamp based on graceful deletion time, but also update finalizers to the pod if it found some new finalizers.

finalizer is another important concept when a resource object was deleted. It was related to the Gargabe Collector ,cascade-deletion, and ownerReference. If a resource object supports cascade-deletion, it must implement GarbageCollectionDeleteStrategy interfaces. Now, you can understand it as a set of pre-delete hooks. They are performed one by one and removed after it was finished.

According to the logic of updateForGracefulDeletionAndFinalizers, you may know the deletion of the pod was controlled by two guys.

  1. finalizers
  2. graceful time

Both of them need the resource object implement GarbageCollectionDeleteStrategy and RESTGracefulDeleteStrategy interfaces. Only finalizers is nil and graceful time is 0, that pod can be deleted from the storage successfully. Otherwise it will be blocked.

 1func (e *Store) updateForGracefulDeletionAndFinalizers(ctx context.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) {
 2	lastGraceful := int64(0)
 3	var pendingFinalizers bool
 4	out = e.NewFunc()
 5	err = e.Storage.GuaranteedUpdate(
 6		ctx,
 7		key,
 8		out,
 9		false, /* ignoreNotFound */
10		&preconditions,
11		storage.SimpleUpdate(func(existing runtime.Object) (runtime.Object, error) {
12			graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, existing, options)
13			if err != nil {
14				return nil, err
15			}
16			if pendingGraceful {
17				return nil, errAlreadyDeleting
18			}
19
20			// Add/remove the orphan finalizer as the options dictates.
21			// Note that this occurs after checking pendingGraceufl, so
22			// finalizers cannot be updated via DeleteOptions if deletion has
23			// started.
24			existingAccessor, err := meta.Accessor(existing)
25			if err != nil {
26				return nil, err
27			}
28			needsUpdate, newFinalizers := deletionFinalizersForGarbageCollection(ctx, e, existingAccessor, options)
29			if needsUpdate {
30				existingAccessor.SetFinalizers(newFinalizers)
31			}
32
33			lastGraceful = *options.GracePeriodSeconds
34			lastExisting = existing
35			return existing, nil
36		}),
37		dryrun.IsDryRun(options.DryRun),
38	)
39	switch err {
40	case nil:
41		// If there are pending finalizers, we never delete the object immediately.
42		if pendingFinalizers {
43			return nil, false, false, out, lastExisting
44		}
45		if lastGraceful > 0 {
46			return nil, false, false, out, lastExisting
47        }
48        return nil, true, true, out, lastExisting
49        ...
50}

After kube-apiserver updated DeletionTimestamp of pod, deleteImmediately was returned as false. The first round of the deletion pod was finished. In the next round, kubelet will be a leader.

 1// Delete removes the item from storage.
 2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
 3	...
 4	graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
 5	if err != nil {
 6		return nil, false, err
 7	}
 8    ....
 9	if graceful || pendingFinalizers || shouldUpdateFinalizers {
10		err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
11	}
12
13	// !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.
14	if !deleteImmediately || err != nil {
15		return out, false, err
16    }
17    ...

Second Round – from kubelet to kube-apiserver

First of all, you need to know. There is a function called syncLoop in the main loop of kubelet. It watched resource changes from multiple places. The most commonplace is kube-apiserver. It will do some reconcile works based on those changes.

 1func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
 2	klog.Info("Starting kubelet main sync loop.")
 3    ...
 4	for {
 5        ...
 6		kl.syncLoopMonitor.Store(kl.clock.Now())
 7		if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
 8			break
 9		}
10		kl.syncLoopMonitor.Store(kl.clock.Now())
11	}
12}
13
14func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
15	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
16	select {
17	case u, open := <-configCh:
18		if !open {
19			klog.Errorf("Update channel is closed. Exiting the sync loop.")
20			return false
21		}
22
23		switch u.Op {
24			...
25		case kubetypes.DELETE:
26			klog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
27			// DELETE is treated as a UPDATE because of graceful deletion.
28            handler.HandlePodUpdates(u.Pods)
29        ....
30		}

For pod deletion, because of graceful deletion, it was handled as update evnet. Every pod event will be handled by an object called podWorker. It maintains a podUpdate map and invokes syncPodFn function to sync the latest change to the “pod”. Please pay attention, this pod is not resource object in kube-apiserver, it is a logical pod object on the host where kubelet was deployed.

 1func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
 2	var lastSyncTime time.Time
 3	for update := range podUpdates {
 4		err := func() error {
 5            podUID := update.Pod.UID
 6            ...
 7			err = p.syncPodFn(syncPodOptions{
 8				mirrorPod:      update.MirrorPod,
 9				pod:            update.Pod,
10				podStatus:      status,
11				killPodOptions: update.KillPodOptions,
12				updateType:     update.UpdateType,
13			})

So, in the syncPodFn function, the pod is deleted by kubelet.killpod function. In the containerRuntime.KillPod, it kills containers of the pod in parallel.

 1func (kl *Kubelet) syncPod(o syncPodOptions) error {
 2    if !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {
 3		var syncErr error
 4		if err := kl.killPod(pod, nil, podStatus, nil); err != nil {
 5            ...
 6        }
 7    }
 8}
 9
10func (kl *Kubelet) killPod(pod *v1.Pod, runningPod *kubecontainer.Pod, status *kubecontainer.PodStatus, gracePeriodOverride *int64) error {
11    ...
12	// Call the container runtime KillPod method which stops all running containers of the pod
13	if err := kl.containerRuntime.KillPod(pod, p, gracePeriodOverride); err != nil {
14		return err
15	}
16    ...
17}

When containerRuntimeManager kills containers, it calculates the gracePeriod time first based on DeletionGracePeriodSeconds and then tries to stop containers

 1// killContainer kills a container through the following steps:
 2// * Run the pre-stop lifecycle hooks (if applicable).
 3// * Stop the container.
 4func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
 5    ...
 6
 7	// From this point , pod and container must be non-nil.
 8	gracePeriod := int64(minimumGracePeriodInSeconds)
 9	switch {
10	case pod.DeletionGracePeriodSeconds != nil:
11		gracePeriod = *pod.DeletionGracePeriodSeconds
12	case pod.Spec.TerminationGracePeriodSeconds != nil:
13		gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
14    }
15    ...
16	err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
17	if err != nil {
18		klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
19	} else {
20		klog.V(3).Infof("Container %q exited normally", containerID.String())
21	}
22
23	m.containerRefManager.ClearRef(containerID)
24
25	return err
26}

Stop container is easy to do. Kubelet just invokes Stop interface of container runtime. In my environment, the container runtime is Docker. Docker will send the SIGTERM to cotnainer first, after grace period passed, send the SIGKILL to container finally.

 1func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {
 2	// Use timeout + default timeout (2 minutes) as timeout to leave extra time
 3	// for SIGKILL container and request latency.
 4	t := r.timeout + time.Duration(timeout)*time.Second
 5	ctx, cancel := getContextWithTimeout(t)
 6	defer cancel()
 7
 8	r.errorMapLock.Lock()
 9	delete(r.lastError, containerID)
10	delete(r.errorPrinted, containerID)
11	r.errorMapLock.Unlock()
12	_, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{
13		ContainerId: containerID,
14		Timeout:     timeout,
15	})
16	if err != nil {
17		klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)
18		return err
19	}
20
21	return nil
22}

After all containers of the pod are cleaned based on the above process, kubelet sends the deletion request of that pod again. It is done by an object called pod status manager. It is responsible for sync pod status from kubelet to kube-apiserver. The only difference with user’s deletion request is: grace period is 0. It means kube-apiserver can delete that pod from etcd storage at once.

 1// syncPod syncs the given status with the API server. The caller must not hold the lock.
 2func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
 3	if !m.needsUpdate(uid, status) {
 4		klog.V(1).Infof("Status for pod %q is up-to-date; skipping", uid)
 5		return
 6    }
 7    ...
 8	if m.canBeDeleted(pod, status.status) {
 9		deleteOptions := metav1.DeleteOptions{
10			GracePeriodSeconds: new(int64),
11			// Use the pod UID as the precondition for deletion to prevent deleting a
12			// newly created pod with the same name and namespace.
13			Preconditions: metav1.NewUIDPreconditions(string(pod.UID)),
14		}
15		err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, deleteOptions)
16		if err != nil {
17			klog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err)
18			return
19		}
20		klog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod))
21		m.deletePodStatus(uid)
22    }

Now, the second round ended. But the deletion process isn’t finished yet. In the next round, the kube-apiserver comes back as the leader.

Third Round – from kube-apiserver to etcd

The start point of the third round is from Delete method of StandardStorage in kube-apiserver. It is almost the same as the first round. In updateForGracefulDeletionAndFinalizers function, it will check two points

  1. If finalizers of resource object are nil
  2. If GracePeriodSeconds is 0

Both requirements are satisfied at the same time, and the deleteImmediately will be returned as true. At this time, it will not exit earlier. kube-apiserver deletes the pod from the storage finally.

 1// Delete removes the item from storage.
 2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
 3    ...
 4    graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
 5    ...
 6    if graceful || pendingFinalizers || shouldUpdateFinalizers {
 7		err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
 8    }
 9    
10    // first around exits from here
11	if !deleteImmediately || err != nil {
12		return out, false, err
13    }
14    
15    ...
16
17    // thrid around will delete the pod from storage foever
18	klog.V(6).Infof("going to delete %s from registry: ", name)
19	out = e.NewFunc()
20	if err := e.Storage.Delete(ctx, key, out, &preconditions, dryrun.IsDryRun(options.DryRun)); err != nil {
21		// Please refer to the place where we set ignoreNotFound for the reason
22		// why we ignore the NotFound error .
23		if storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil {
24			// The lastExisting object may not be the last state of the object
25			// before its deletion, but it's the best approximation.
26			out, err := e.finalizeDelete(ctx, lastExisting, true)
27			return out, true, err
28		}
29		return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)
30	}
31	out, err = e.finalizeDelete(ctx, out, true)
32	return out, true, err

About Finalizers

You can see that finalizers of the resource object are also an important factor that can block the deletion process except for graceful time. Actually, finalizers was set when the resource object was created by the controller. When the controller found the DeletionTimestamp of that object wasn’t nil, it will process deletion logic based on the finalizers name. Finalizers are just a string slice, which includes the finalizer’s name.

In this blog. I will not introduce it too much because it is related to another big topic about Garbage Collection. So I just assume there are no finalizers of finalizers that were removed successfully by default.

Which reasons can cause the deletion process was stuck

According to the expected behavior of the graceful deletion process, we can know.

  1. If the graceful period is 0 and no finalizers, no one can block the deletion process
  2. If the graceful period isn’t 0 or some finalizers left on resource object, both of them may block the deletion process

For the finalizers part, there are two possibilities.

  1. Native controllers or custom controllers(For Custom Resource) don’t process the deletion logic based on finalizers, or it was failed.
  2. Deletion logic of finalizers was finished successfully, when controller removes finalizers, it was failed

For the graceful period, the trouble has strong possibilities on kubelet

  1. When kubelet invokes container runtime to clean cotnainers, it was stuck. So the podStatus manager still doesn’t update GracePeriodSeconds to 0. There are also further reasons, like the volume of pod cannot be unattached or unmounted, or container process was in Uninterruptible state. We should investigate in actual trouble scenarios.
  2. Kubelet cannot connect to kube-apiserver. This trouble may happen before kubelet receive DELETE event of pod, also may happen after kubelet finished cleaning work. The former case means kubelet doesn’t have a chance to do a cleaning worker. The behind case mean kubelet cannot sync the result to kube-apiserver.

About Force Deletion

As I mentioned at the beginning of this blog, some guys use the workaround like below to delete resource objects.

  1. remove finalizers of resource object he(she) wants to delete
  2. using force and grace-period=0 argeuments to run kubectl delete command

If you still want to use this workaround to fix the “deletion process was stuck” issue, you must confirm “cleaning work” about the resource you want to delete has been finished. Otherwise, you will encounter further data inconsistency problems. For example, the volume of the previous pod wasn’t unmounted successfully, so your new pod cannot be started.

By the way, even if you used force and grace-period=0 argeuments, there is still 2 seconds grace period for that pod. But it’s too short.

Reference

  1. https://zhuanlan.zhihu.com/p/161072336 (In Chinese)
  2. https://cizixs.com/2017/06/07/kubelet-source-code-analysis-part-2 (In Chinese)
  3. https://lwn.net/Articles/288056/
  4. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination