How Kubernetes Deletes A Pod Gracefully
Overview
This blog will introduce the graceful deletion proceess of pod in source code level. It maybe little bit complicated. If you just want to know the process in high level, you can reference the official document
Hey! My pod was stuck in Terminating state
During my on-call time, some users ask for my help with their pods that cannot be deleted successfully in the Kubernetes cluster. Maybe the pod was stuck in the Terminating
state, maybe “kubectl delete ” has been returned, but that pod object still exists.
But, the user doesn’t always give me a chance to fix it. A “popular” workaround on the Internet tells them “kubectl delete –force –grace-period=0” can help them. Actually, “force delete” is a kind of dangerous behavior if you didn’t know what’s a consequence. It may cause the data inconsistency in the Kubernetes cluster. It also breaks the troubling scenario so that we may lose the chance to find an implicit and important issue.
The best approach to troubleshoot this problem is to know the expected deletion behavior first and compare it with the actual behavior, and then fix the difference. So, I want to introduce the graceful deletion process of Kubernetes and related working mechanism on this blog. I hope it can help you to troubleshoot this kind of issue.
The deletion process
When you sent a request of deleting a pod object to kube-apiserver component by RESTful API or kubectl command, it was handled by StandardStorage
of kube-apiserver first. Because the primary responsibility of kube-apiserver is to maintain resource data on the etcd cluster, other components like kube-controller-manager and kube-scheduler just watch data changes and move them from the actual state to the desired state.
Here is a Store
package in kube-apiserver repo; it implemented all methods of StandardStorage. The Delete
function is called.
1// Delete removes the item from storage.
2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
3 ...
4 graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
5 ...
First Round – from deletion request to kube-apiserver
In rest.BeforeDelete
function, it invokes CheckGracefulDelete
method to confirm if the resource you want to delete needs to be deleted gracefully or not. Actually, any resource can implement their graceful deletion logic by RESTGracefulDeleteStrategy
interface. Now, only native kubernetes Pod resource is supported.
1func BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
2 ...
3 gracefulStrategy, ok := strategy.(RESTGracefulDeleteStrategy)
4 if !ok {
5 // If we're not deleting gracefully there's no point in updating Generation, as we won't update
6 // the obcject before deleting it.
7 return false, false, nil
8 }
9 // if the object is already being deleted, no need to update generation.
10 if objectMeta.GetDeletionTimestamp() != nil {
11 ...
12 }
13
14 if !gracefulStrategy.CheckGracefulDelete(ctx, obj, options) {
15
16 return false, false, nil
17 }
18 ...
19}
In the CheckGracefulDelete
function, it tries to calculate the graceful deletion period by different approaches. Those approaches can be divided into three types
1// RESTGracefulDeleteStrategy must be implemented by the registry that supports
2// graceful deletion.
3type RESTGracefulDeleteStrategy interface {
4 // CheckGracefulDelete should return true if the object can be gracefully deleted and set
5 // any default values on the DeleteOptions.
6 CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool
7}
8
9// CheckGracefulDelete allows a pod to be gracefully deleted. It updates the DeleteOptions to
10// reflect the desired grace value.
11func (podStrategy) CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool {
12 if options == nil {
13 return false
14 }
15 pod := obj.(*api.Pod)
16 period := int64(0)
17 // user has specified a value
18 if options.GracePeriodSeconds != nil {
19 period = *options.GracePeriodSeconds
20 } else {
21 // use the default value if set, or deletes the pod immediately (0)
22 if pod.Spec.TerminationGracePeriodSeconds != nil {
23 period = *pod.Spec.TerminationGracePeriodSeconds
24 }
25 }
26 // if the pod is not scheduled, delete immediately
27 if len(pod.Spec.NodeName) == 0 {
28 period = 0
29 }
30 // if the pod is already terminated, delete immediately
31 if pod.Status.Phase == api.PodFailed || pod.Status.Phase == api.PodSucceeded {
32 period = 0
33 }
34 // ensure the options and the pod are in sync
35 options.GracePeriodSeconds = &period
36 return true
37}
Running Pod
It picks up the GracePeriodSeconds
of DeleteOptions first. You can specify it by command argument of kubectl or api request. For example, “kubectl delete –grace-period=10”, the GracePeriodSeconds
should be 30 seconds by default. If you didn’t specify it, pod.Spec.TerminationGracePeriodSeconds
will be used.
Pending Pod
The only thing of kube-scheduler did for every pod is update scheduable node name
to pod.Spec.NodeName
, so that kubelet on that host can run the pod. So, if pod.Spec.NodeName
is nil, it means pod doesn’t run. It must be fine to delete it at once ( 0 graceful period means delete at once)
Terminated Pod
For Terminated Pod
, it was almost the same as Pending Pod
because they are not running.
After the graceful deletion period was confirmed, kube-apiserver will set DeletionTimestamp
and DeletionGracePeriodSeconds
for the pod.
1func BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {
2 ....
3 if !gracefulStrategy.CheckGracefulDelete(ctx, obj, options) {
4
5 return false, false, nil
6 }
7 ...
8 now := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds)))
9 objectMeta.SetDeletionTimestamp(&now)
10 objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)
11 ...
12 return true, false, nil
kube-apiserver will update DeletionTimestamp
and DeletionGracePeriodSeconds
to the pod by updateForGracefulDeletionAndFinalizers
function
1// Delete removes the item from storage.
2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
3 ...
4 graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
5 ...
6 if graceful || pendingFinalizers || shouldUpdateFinalizers {
7 err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
8 }
In updateForGracefulDeletionAndFinalizers
function, kube-apiserver not only updates DeletionTimestamp
based on graceful deletion time, but also update finalizers to the pod if it found some new finalizers.
finalizer
is another important concept when a resource object was deleted. It was related to the Gargabe Collector ,cascade-deletion, and ownerReference. If a resource object supports cascade-deletion, it must implement GarbageCollectionDeleteStrategy
interfaces. Now, you can understand it as a set of pre-delete
hooks. They are performed one by one and removed after it was finished.
According to the logic of updateForGracefulDeletionAndFinalizers
, you may know the deletion of the pod was controlled by two guys.
- finalizers
- graceful time
Both of them need the resource object implement GarbageCollectionDeleteStrategy
and RESTGracefulDeleteStrategy
interfaces. Only finalizers is nil and graceful time is 0, that pod can be deleted from the storage successfully. Otherwise it will be blocked.
1func (e *Store) updateForGracefulDeletionAndFinalizers(ctx context.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) {
2 lastGraceful := int64(0)
3 var pendingFinalizers bool
4 out = e.NewFunc()
5 err = e.Storage.GuaranteedUpdate(
6 ctx,
7 key,
8 out,
9 false, /* ignoreNotFound */
10 &preconditions,
11 storage.SimpleUpdate(func(existing runtime.Object) (runtime.Object, error) {
12 graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, existing, options)
13 if err != nil {
14 return nil, err
15 }
16 if pendingGraceful {
17 return nil, errAlreadyDeleting
18 }
19
20 // Add/remove the orphan finalizer as the options dictates.
21 // Note that this occurs after checking pendingGraceufl, so
22 // finalizers cannot be updated via DeleteOptions if deletion has
23 // started.
24 existingAccessor, err := meta.Accessor(existing)
25 if err != nil {
26 return nil, err
27 }
28 needsUpdate, newFinalizers := deletionFinalizersForGarbageCollection(ctx, e, existingAccessor, options)
29 if needsUpdate {
30 existingAccessor.SetFinalizers(newFinalizers)
31 }
32
33 lastGraceful = *options.GracePeriodSeconds
34 lastExisting = existing
35 return existing, nil
36 }),
37 dryrun.IsDryRun(options.DryRun),
38 )
39 switch err {
40 case nil:
41 // If there are pending finalizers, we never delete the object immediately.
42 if pendingFinalizers {
43 return nil, false, false, out, lastExisting
44 }
45 if lastGraceful > 0 {
46 return nil, false, false, out, lastExisting
47 }
48 return nil, true, true, out, lastExisting
49 ...
50}
After kube-apiserver updated DeletionTimestamp
of pod, deleteImmediately
was returned as false. The first round of the deletion pod was finished. In the next round, kubelet will be a leader.
1// Delete removes the item from storage.
2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
3 ...
4 graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
5 if err != nil {
6 return nil, false, err
7 }
8 ....
9 if graceful || pendingFinalizers || shouldUpdateFinalizers {
10 err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
11 }
12
13 // !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.
14 if !deleteImmediately || err != nil {
15 return out, false, err
16 }
17 ...
Second Round – from kubelet to kube-apiserver
First of all, you need to know. There is a function called syncLoop in the main loop of kubelet. It watched resource changes from multiple places. The most commonplace is kube-apiserver. It will do some reconcile works based on those changes.
1func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
2 klog.Info("Starting kubelet main sync loop.")
3 ...
4 for {
5 ...
6 kl.syncLoopMonitor.Store(kl.clock.Now())
7 if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
8 break
9 }
10 kl.syncLoopMonitor.Store(kl.clock.Now())
11 }
12}
13
14func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
15 syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
16 select {
17 case u, open := <-configCh:
18 if !open {
19 klog.Errorf("Update channel is closed. Exiting the sync loop.")
20 return false
21 }
22
23 switch u.Op {
24 ...
25 case kubetypes.DELETE:
26 klog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
27 // DELETE is treated as a UPDATE because of graceful deletion.
28 handler.HandlePodUpdates(u.Pods)
29 ....
30 }
For pod deletion, because of graceful deletion, it was handled as update evnet. Every pod event will be handled by an object called podWorker. It maintains a podUpdate map and invokes syncPodFn
function to sync the latest change to the “pod”. Please pay attention, this pod is not resource object in kube-apiserver, it is a logical pod object on the host where kubelet was deployed.
1func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
2 var lastSyncTime time.Time
3 for update := range podUpdates {
4 err := func() error {
5 podUID := update.Pod.UID
6 ...
7 err = p.syncPodFn(syncPodOptions{
8 mirrorPod: update.MirrorPod,
9 pod: update.Pod,
10 podStatus: status,
11 killPodOptions: update.KillPodOptions,
12 updateType: update.UpdateType,
13 })
So, in the syncPodFn
function, the pod is deleted by kubelet.killpod
function. In the containerRuntime.KillPod
, it kills containers of the pod in parallel.
1func (kl *Kubelet) syncPod(o syncPodOptions) error {
2 if !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {
3 var syncErr error
4 if err := kl.killPod(pod, nil, podStatus, nil); err != nil {
5 ...
6 }
7 }
8}
9
10func (kl *Kubelet) killPod(pod *v1.Pod, runningPod *kubecontainer.Pod, status *kubecontainer.PodStatus, gracePeriodOverride *int64) error {
11 ...
12 // Call the container runtime KillPod method which stops all running containers of the pod
13 if err := kl.containerRuntime.KillPod(pod, p, gracePeriodOverride); err != nil {
14 return err
15 }
16 ...
17}
When containerRuntimeManager kills containers, it calculates the gracePeriod time first based on DeletionGracePeriodSeconds
and then tries to stop containers
1// killContainer kills a container through the following steps:
2// * Run the pre-stop lifecycle hooks (if applicable).
3// * Stop the container.
4func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
5 ...
6
7 // From this point , pod and container must be non-nil.
8 gracePeriod := int64(minimumGracePeriodInSeconds)
9 switch {
10 case pod.DeletionGracePeriodSeconds != nil:
11 gracePeriod = *pod.DeletionGracePeriodSeconds
12 case pod.Spec.TerminationGracePeriodSeconds != nil:
13 gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
14 }
15 ...
16 err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
17 if err != nil {
18 klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
19 } else {
20 klog.V(3).Infof("Container %q exited normally", containerID.String())
21 }
22
23 m.containerRefManager.ClearRef(containerID)
24
25 return err
26}
Stop container is easy to do. Kubelet just invokes Stop interface of container runtime. In my environment, the container runtime is Docker. Docker will send the SIGTERM to cotnainer first, after grace period passed, send the SIGKILL to container finally.
1func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {
2 // Use timeout + default timeout (2 minutes) as timeout to leave extra time
3 // for SIGKILL container and request latency.
4 t := r.timeout + time.Duration(timeout)*time.Second
5 ctx, cancel := getContextWithTimeout(t)
6 defer cancel()
7
8 r.errorMapLock.Lock()
9 delete(r.lastError, containerID)
10 delete(r.errorPrinted, containerID)
11 r.errorMapLock.Unlock()
12 _, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{
13 ContainerId: containerID,
14 Timeout: timeout,
15 })
16 if err != nil {
17 klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)
18 return err
19 }
20
21 return nil
22}
After all containers of the pod are cleaned based on the above process, kubelet sends the deletion request of that pod again. It is done by an object called pod status manager. It is responsible for sync pod status from kubelet to kube-apiserver. The only difference with user’s deletion request is: grace period is 0. It means kube-apiserver can delete that pod from etcd storage at once.
1// syncPod syncs the given status with the API server. The caller must not hold the lock.
2func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
3 if !m.needsUpdate(uid, status) {
4 klog.V(1).Infof("Status for pod %q is up-to-date; skipping", uid)
5 return
6 }
7 ...
8 if m.canBeDeleted(pod, status.status) {
9 deleteOptions := metav1.DeleteOptions{
10 GracePeriodSeconds: new(int64),
11 // Use the pod UID as the precondition for deletion to prevent deleting a
12 // newly created pod with the same name and namespace.
13 Preconditions: metav1.NewUIDPreconditions(string(pod.UID)),
14 }
15 err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, deleteOptions)
16 if err != nil {
17 klog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err)
18 return
19 }
20 klog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod))
21 m.deletePodStatus(uid)
22 }
Now, the second round ended. But the deletion process isn’t finished yet. In the next round, the kube-apiserver comes back as the leader.
Third Round – from kube-apiserver to etcd
The start point of the third round is from Delete
method of StandardStorage
in kube-apiserver. It is almost the same as the first round. In updateForGracefulDeletionAndFinalizers
function, it will check two points
- If finalizers of resource object are nil
- If GracePeriodSeconds is 0
Both requirements are satisfied at the same time, and the deleteImmediately
will be returned as true. At this time, it will not exit earlier. kube-apiserver deletes the pod from the storage finally.
1// Delete removes the item from storage.
2func (e *Store) Delete(ctx context.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
3 ...
4 graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)
5 ...
6 if graceful || pendingFinalizers || shouldUpdateFinalizers {
7 err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj)
8 }
9
10 // first around exits from here
11 if !deleteImmediately || err != nil {
12 return out, false, err
13 }
14
15 ...
16
17 // thrid around will delete the pod from storage foever
18 klog.V(6).Infof("going to delete %s from registry: ", name)
19 out = e.NewFunc()
20 if err := e.Storage.Delete(ctx, key, out, &preconditions, dryrun.IsDryRun(options.DryRun)); err != nil {
21 // Please refer to the place where we set ignoreNotFound for the reason
22 // why we ignore the NotFound error .
23 if storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil {
24 // The lastExisting object may not be the last state of the object
25 // before its deletion, but it's the best approximation.
26 out, err := e.finalizeDelete(ctx, lastExisting, true)
27 return out, true, err
28 }
29 return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)
30 }
31 out, err = e.finalizeDelete(ctx, out, true)
32 return out, true, err
About Finalizers
You can see that finalizers of the resource object are also an important factor that can block the deletion process except for graceful time. Actually, finalizers was set when the resource object was created by the controller. When the controller found the DeletionTimestamp of that object wasn’t nil, it will process deletion logic based on the finalizers name. Finalizers are just a string slice, which includes the finalizer’s name.
In this blog. I will not introduce it too much because it is related to another big topic about Garbage Collection. So I just assume there are no finalizers of finalizers that were removed successfully by default.
Which reasons can cause the deletion process was stuck
According to the expected behavior of the graceful deletion process, we can know.
- If the graceful period is 0 and no finalizers, no one can block the deletion process
- If the graceful period isn’t 0 or some finalizers left on resource object, both of them may block the deletion process
For the finalizers part, there are two possibilities.
- Native controllers or custom controllers(For Custom Resource) don’t process the deletion logic based on finalizers, or it was failed.
- Deletion logic of finalizers was finished successfully, when controller removes finalizers, it was failed
For the graceful period, the trouble has strong possibilities on kubelet
- When kubelet invokes container runtime to clean cotnainers, it was stuck. So the podStatus manager still doesn’t update GracePeriodSeconds to 0. There are also further reasons, like the volume of pod cannot be unattached or unmounted, or container process was in
Uninterruptible
state. We should investigate in actual trouble scenarios. - Kubelet cannot connect to kube-apiserver. This trouble may happen before kubelet receive DELETE event of pod, also may happen after kubelet finished cleaning work. The former case means kubelet doesn’t have a chance to do a cleaning worker. The behind case mean kubelet cannot sync the result to kube-apiserver.
About Force Deletion
As I mentioned at the beginning of this blog, some guys use the workaround like below to delete resource objects.
- remove finalizers of resource object he(she) wants to delete
- using
force
andgrace-period=0
argeuments to run kubectl delete command
If you still want to use this workaround to fix the “deletion process was stuck” issue, you must confirm “cleaning work” about the resource you want to delete has been finished. Otherwise, you will encounter further data inconsistency problems. For example, the volume of the previous pod wasn’t unmounted successfully, so your new pod cannot be started.
By the way, even if you used force
and grace-period=0
argeuments, there is still 2 seconds grace period for that pod. But it’s too short.