Skip to content

TCP doesn't terminate gracefully if node is down #1865

Open
@anupamdialpad

Description

@anupamdialpad

What happened?

Before a pod terminates we make the pod unready so that new connections doesn't get routed to it. So, only nodes which NAT Service ExternalIP to pod IP will have the pod IP entry in the IPVS table. During this time if the node which did the NAT of ExternalIP to pod goes down then there is no way to reach the terminating pod.

What did you expect to happen?

Even if other nodes go down as long as the pod is not terminated there should be a way to reach it.

How can we reproduce the behavior you experienced?

  1. Create a cluster with 2 nodes which are in two different regions.
  2. Service has DSR and maglev enabled
apiVersion: v1
kind: Service
metadata:
  annotations:
    kube-router.io/service.dsr: "tunnel"
    kube-router.io/service.scheduler: "mh"
    kube-router.io/service.schedflags: "flag-1,flag-2"
  1. There are 3 pods behind this service. All the pods are running on eqx-sjc-kubenode1-staging
root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get svc,endpoints
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP     PORT(S)    AGE
service/debian-server-lb   ClusterIP   192.168.97.188   199.27.151.10   8099/TCP   6d7h

NAME                         ENDPOINTS                                      AGE
endpoints/debian-server-lb   10.36.0.3:8099,10.36.0.5:8099,10.36.0.6:8099   6d7h

root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE    IP              NODE
debian-server-8b5467777-cbwt2   1/1     Running   0          18m    10.36.0.6       eqx-sjc-kubenode1-staging 
debian-server-8b5467777-vts6l   1/1     Running   0          2d5h   10.36.0.3       eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv   1/1     Running   0          19m    10.36.0.5       eqx-sjc-kubenode1-staging 
  1. IPVS entries are successfully applied by kube-router
root@eqx-sjc-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn   
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
  -> 10.36.0.6:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0         
  -> 10.36.0.6:8099               Tunnel  1      0          0 

root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn       
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
  -> 10.36.0.6:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0         
  -> 10.36.0.6:8099               Tunnel  1      1          0   
  1. In all the 3 pods start a TCP server on port 8099 using nc -lv 0.0.0.0 8099
  2. Create a session from client which is closer to tlx-dal-kubenode1-staging using nc <service-ip> 8099
  3. Make a pod unready. This keeps pod IP entry in IPVS for tlx-dal-kubenode1-staging only
NAME                            READY   STATUS    RESTARTS   AGE    IP              NODE
debian-server-8b5467777-cbwt2   0/1     Running   0          18m    10.36.0.6       eqx-sjc-kubenode1-staging 
debian-server-8b5467777-vts6l   1/1     Running   0          2d5h   10.36.0.3       eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv   1/1     Running   0          19m    10.36.0.5       eqx-sjc-kubenode1-staging 

root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn       
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
  -> 10.36.0.6:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0         
  -> 10.36.0.6:8099               Tunnel  0      1          0   

root@tlx-dal-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -Lcn 
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:58  ESTABLISHED 103.35.125.24:41876 199.27.151.10:8099 10.36.0.6:8099

root@eqx-sjc-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn  
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0   
  1. shutdown tlx-dal-kubenode1-staging. Now the connection is completely broken

System Information (please complete the following information)

  • Kube-Router Version (kube-router --version): [2.5.0, built on 2025-02-14T20:20:43Z, go1.23.6]
  • Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
--kubeconfig=/usr/local/kube-router/kube-router.kubeconfig 
--run-router=true 
--run-firewall=true 
--run-service-proxy=true 
--v=3 
--peer-router-ips=103.35.124.1 
--peer-router-asns=65322 
--cluster-asn=65321 
--enable-ibgp=false 
--enable-overlay=false 
--bgp-graceful-restart=true 
--bgp-graceful-restart-deferral-time=30s 
--bgp-graceful-restart-time=5m 
--advertise-external-ip=true 
--ipvs-graceful-termination 
--runtime-endpoint=unix:///run/containerd/containerd.sock 
--enable-ipv6=true 
--routes-sync-period=1m0s 
--iptables-sync-period=1m0s 
--ipvs-sync-period=1m0s 
--hairpin-mode=true 
--advertise-pod-cidr=true
  • Kubernetes Version (kubectl version) : 1.29.14
  • Cloud Type: [e.g. AWS, GCP, Azure, on premise] onprem
  • Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] manual
  • Kube-Router Deployment Type: [e.g. DaemonSet, System Service] on host
  • Cluster Size: [e.g. 200 Nodes] 2 nodes
  • kernel version: 5.10.0-34-amd64

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions