Open
Description
What happened?
Before a pod terminates we make the pod unready so that new connections doesn't get routed to it. So, only nodes which NAT Service ExternalIP to pod IP will have the pod IP entry in the IPVS table. During this time if the node which did the NAT of ExternalIP to pod goes down then there is no way to reach the terminating pod.
What did you expect to happen?
Even if other nodes go down as long as the pod is not terminated there should be a way to reach it.
How can we reproduce the behavior you experienced?
- Create a cluster with 2 nodes which are in two different regions.
- Service has DSR and maglev enabled
apiVersion: v1
kind: Service
metadata:
annotations:
kube-router.io/service.dsr: "tunnel"
kube-router.io/service.scheduler: "mh"
kube-router.io/service.schedflags: "flag-1,flag-2"
- There are 3 pods behind this service. All the pods are running on
eqx-sjc-kubenode1-staging
root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get svc,endpoints
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/debian-server-lb ClusterIP 192.168.97.188 199.27.151.10 8099/TCP 6d7h
NAME ENDPOINTS AGE
endpoints/debian-server-lb 10.36.0.3:8099,10.36.0.5:8099,10.36.0.6:8099 6d7h
root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
debian-server-8b5467777-cbwt2 1/1 Running 0 18m 10.36.0.6 eqx-sjc-kubenode1-staging
debian-server-8b5467777-vts6l 1/1 Running 0 2d5h 10.36.0.3 eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv 1/1 Running 0 19m 10.36.0.5 eqx-sjc-kubenode1-staging
- IPVS entries are successfully applied by kube-router
root@eqx-sjc-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
-> 10.36.0.6:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
-> 10.36.0.6:8099 Tunnel 1 0 0
root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
-> 10.36.0.6:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
-> 10.36.0.6:8099 Tunnel 1 1 0
- In all the 3 pods start a TCP server on port 8099 using
nc -lv 0.0.0.0 8099
- Create a session from client which is closer to
tlx-dal-kubenode1-staging
usingnc <service-ip> 8099
- Make a pod unready. This keeps pod IP entry in IPVS for
tlx-dal-kubenode1-staging
only
NAME READY STATUS RESTARTS AGE IP NODE
debian-server-8b5467777-cbwt2 0/1 Running 0 18m 10.36.0.6 eqx-sjc-kubenode1-staging
debian-server-8b5467777-vts6l 1/1 Running 0 2d5h 10.36.0.3 eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv 1/1 Running 0 19m 10.36.0.5 eqx-sjc-kubenode1-staging
root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
-> 10.36.0.6:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
-> 10.36.0.6:8099 Tunnel 0 1 0
root@tlx-dal-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -Lcn
IPVS connection entries
pro expire state source virtual destination
TCP 14:58 ESTABLISHED 103.35.125.24:41876 199.27.151.10:8099 10.36.0.6:8099
root@eqx-sjc-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
- shutdown
tlx-dal-kubenode1-staging
. Now the connection is completely broken
System Information (please complete the following information)
- Kube-Router Version (
kube-router --version
): [2.5.0, built on 2025-02-14T20:20:43Z, go1.23.6] - Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
--kubeconfig=/usr/local/kube-router/kube-router.kubeconfig
--run-router=true
--run-firewall=true
--run-service-proxy=true
--v=3
--peer-router-ips=103.35.124.1
--peer-router-asns=65322
--cluster-asn=65321
--enable-ibgp=false
--enable-overlay=false
--bgp-graceful-restart=true
--bgp-graceful-restart-deferral-time=30s
--bgp-graceful-restart-time=5m
--advertise-external-ip=true
--ipvs-graceful-termination
--runtime-endpoint=unix:///run/containerd/containerd.sock
--enable-ipv6=true
--routes-sync-period=1m0s
--iptables-sync-period=1m0s
--ipvs-sync-period=1m0s
--hairpin-mode=true
--advertise-pod-cidr=true
- Kubernetes Version (
kubectl version
) : 1.29.14 - Cloud Type: [e.g. AWS, GCP, Azure, on premise] onprem
- Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] manual
- Kube-Router Deployment Type: [e.g. DaemonSet, System Service] on host
- Cluster Size: [e.g. 200 Nodes] 2 nodes
- kernel version: 5.10.0-34-amd64