You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`etcd_request_duration_seconds`| Etcd request latency in seconds for each operation and object type. |
59
59
|`etcd_db_total_size_in_bytes` or <br />`apiserver_storage_db_total_size_in_bytes` (starting with EKS v1.26) or <br />`apiserver_storage_size_bytes` (starting with EKS v1.28) | Etcd database size. |
60
60
61
61
Consider using the [Kubernetes Monitoring Overview Dashboard](https://grafana.com/grafana/dashboards/14623) to visualize and monitor Kubernetes API server requests and latency and etcd latency metrics.
62
62
63
-
The following Prometheus query can be used to monitor the current size of etcd. The query assumes there is job called `kube-apiserver` for scraping metrics from API metrics endpoint and the EKS version is below v1.26.
63
+
The following Prometheus query can be used to monitor the current size of etcd. The query assumes there is job called `kube-apiserver` for scraping metrics from API metrics endpoint and the EKS version is below v1.26.
@@ -184,6 +184,28 @@ Consider using [OPA Gatekeeper](https://github.com/open-policy-agent/gatekeeper-
184
184
## Handling Cluster Upgrades
185
185
Since April 2021, Kubernetes release cycle has been changed from four releases a year (once a quarter) to three releases a year. A new minor version (like 1.**21** or 1.**22**) is released approximately [every fifteen weeks](https://kubernetes.io/blog/2021/07/20/new-kubernetes-release-cadence/#what-s-changing-and-when). Starting with Kubernetes 1.19, each minor version is supported for approximately twelve months after it's first released. With the advent of Kubernetes v1.28, the compatibility skew between the control plane and worker nodes has expanded from n-2 to n-3 minor versions. To learn more, see [Best Practices for Cluster Upgrades](../../upgrades/index.md).
186
186
187
+
## Cluster Endpoint Connectivity
188
+
189
+
When working with Amazon EKS (Elastic Kubernetes Service), you may encounter connection timeouts or errors during events such as Kubernetes control plane scaling or patching. These events can cause the kube-apiserver instances to be replaced, potentially resulting in different IP addresses being returned when resolving the FQDN. This document outlines best practices for Kubernetes API consumers to maintain reliable connectivity. Note: Implementing these best practices may require updates to client configurations or scripts to handle new DNS re-resolution and retry strategies effectively.
190
+
191
+
The main issue stems from DNS client-side caching and the potential for stale IP addresses of EKS endpoint - _public NLB for public endpoint or X-ENI for private endpoint_. When the kube-apiserver instances are replaced, the Fully Qualified Domain Name (FQDN) may resolve to new IP addresses. However, due to DNS Time to Live (TTL)settings, which are set to 60 seconds in the AWS managed Route 53 zone, clients may continue to use outdated IP addresses for a short period of time.
192
+
193
+
To mitigate these issues, Kubernetes API consumers (such as kubectl, CI/CD pipelines, and custom applications) should implement the following best practices:
194
+
195
+
- Implement DNS re-resolution
196
+
- Implement Retries with Backoff and Jitter. For example, see [this article titled Failures Happen](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/)
197
+
- Implement Client Timeouts. Set appropriate timeouts to prevent long-running requests from blocking your application. Be aware that some Kubernetes client libraries, particularly those generated by OpenAPI generators, may not allow setting custom timeouts easily.
198
+
199
+
- Example 1 with kubectl:
200
+
201
+
```
202
+
kubectl get pods --request-timeout 10s # default: no timeout
203
+
```
204
+
205
+
- Example 2 with Python: [Kubernetes client provides a _request_timeout parameter](https://github.com/kubernetes-client/python/blob/release-30.0/kubernetes/client/api_client.py#L120)
206
+
207
+
By implementing these best practices, you can significantly improve the reliability and resilience of your applications when interacting with Kubernetes API. Remember to test these implementations thoroughly, especially under simulated failure conditions, to ensure they behave as expected during actual scaling or patching events.
208
+
187
209
## Running large clusters
188
210
189
211
EKS actively monitors the load on control plane instances and automatically scales them to ensure high performance. However, you should account for potential performance issues and limits within Kubernetes and quotas in AWS services when running large clusters.
0 commit comments