Amazon EKS Add-ons are packaged Kubernetes operational software components that AWS can install and manage on your clusters. The EKS Community Add-ons are a set of AWS-validated open-source tools (currently metrics-server, kube-state-metrics, Prometheus node exporter, cert-manager and external-dns) that simplify cluster observability, security and operations. In addition, AWS and partners provide other important EKS add-ons such as the Node Monitoring Agent, SageMaker HyperPod Task Governance, Mountpoint for Amazon S3 (CSI driver), Network Flow Monitor and Pod Identity Agent. Each of these adds critical capabilities to an EKS cluster. Below we cover each add-on in detail: what it does, how it integrates with EKS, security and cost considerations, use cases and example configuration.
This blog filters out the noise, spotlighting the 10 most impactful EKS add-ons, mapped directly to the operational questions your team actually asks. You’ll learn when to use each, how to install them quickly and what hidden costs or gotchas to watch for.
Every EKS cluster ships with CoreDNS, kube-proxy and the VPC CNI, but real workloads demand more: autoscaling support, TLS renewal, workload-level IAM roles and log aggregation.
Amazon now curates and maintains several add-ons, offering:
Pre-validated compatibility with your Kubernetes version
One-click or IaC upgrades
Built-in patching and lifecycle support
But here’s the catch: more isn’t better. Overloading your cluster with unused add-ons adds:
CPU overhead
IAM sprawl
Surprise bills from CloudWatch, Route 53 and S3 traffic
Amazon provides the following five community add-ons for EKS. You install each via the EKS add-on APIs/console (eksctl
or aws eks create-addon
) and then use them like any other Kubernetes component. All are “free” open-source software (no AWS charges beyond underlying resource usage).
The metrics-server collects resource metrics (CPU, memory) from each kubelet and provides them to the Kubernetes API Metrics API for autoscalers. In other words, it enables native HPA (Horizontal Pod Autoscaler) and VPA by supplying up-to-date pod/node usage stats.
Integration/Architecture: Installed as an EKS add-on (eksctl create addon --cluster myCluster --name metrics-server
). It runs in the kube-system
namespace and hooks into the API server. The metrics it provides feed HPA/VPA controllers and monitoring systems.
Security: The metrics-server runs with minimal privileges (only reads node stats) and uses TLS between components. It does not require AWS IAM roles. It’s important to secure the metrics API (e.g. with RBAC and network policies) because it exposes cluster-wide resource data.
Cost/Optimization: No AWS charges. It enables autoscaling based on real usage, helping prevent over-provisioning of nodes (saving EC2 costs) and scaling up when needed (improving utilization).
Use Cases: Use metrics-server to enable HPA/VPA on workloads, ensure services scale with load and feed dashboards/alerts. It is critical for any dynamic scaling strategy.
To install, run:
eksctl create addon --cluster myCluster --name metrics-server --version latest
After installation, the HPA can query /apis/metrics.k8s.io/v1beta1
to retrieve pod metrics.
The kube-state-metrics add-on exposes cluster object state (deployments, pods, nodes, etc.) as Prometheus metrics. Unlike metrics-server (which shows usage), kube-state-metrics exports the state of objects (count of pods, container restart counts, resource request/limits, labels, etc.). Monitoring tools like Prometheus/Grafana consume these metrics for deep cluster monitoring.
Integration/Architecture: Installed as an EKS add-on (eksctl create addon --cluster myCluster --name kube-state-metrics
). It runs in its own namespace (kube-state-metrics
by default) and watches the Kubernetes API to emit metrics via an HTTP endpoint. It is typically scraped by a Prometheus server.
Security: It has read-only access to the Kubernetes API and exposes a public metrics endpoint. Protect it with Kubernetes RBAC (cluster-reader role) and network policies to ensure only trusted systems (e.g. Prometheus) can access it. It does not involve AWS IAM.
Cost/Optimization: It’s open-source and runs on your nodes. By enabling detailed monitoring, it helps spot inefficient resource usage (like unused replicas) and informs rightsizing decisions, indirectly helping save costs.
Use Cases: Use kube-state-metrics to track replica counts, deployments status, pod health, resource usage requests vs limits and integrate with alerting (e.g. alert if any pod is constantly restarting). It’s essential for observability of Kubernetes state.
Example: Install with eksctl:
eksctl create addon --cluster myCluster --name kube-state-metrics --version latest
- job_name: 'kube-state-metrics'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app, __meta_kubernetes_pod_label_release]
action: keep
regex: kube-state-metrics;metrics-serve
The prometheus-node-exporter add-on runs a host-level agent on every node to collect OS and hardware metrics (CPU load, disk I/O, network, etc.). It is a DaemonSet that exports node metrics in Prometheus format.
Integration/Architecture: Added via eksctl create addon --cluster myCluster --name prometheus-node-exporter
. It installs a privileged DaemonSet on all Linux worker nodes (namespace=prometheus-node-exporter
) and exposes /metrics
. Prometheus or CloudWatch Agent can scrape these metrics to monitor the underlying hosts.
Security: Because it runs with host privileges (to read /proc, etc.), it must be trusted. Only deploy on nodes where you trust the agent image. Secure the metrics endpoint with network policy or firewall so that only your monitoring system can access it.
Cost/Optimization: The agent itself is free and lightweight. It enables visibility into node-level resource utilization (CPU, memory, disk, network). You can use these insights to rights-size instance types, identify bottlenecks or detect zombie processes, all of which help optimize cloud costs.
Use Cases: Monitor node health and capacity (e.g. disk space, CPU steal, NIC errors). Alert on OS-level conditions (disk filling up, memory exhaustion). Useful in benchmarks or large clusters to ensure nodes are not overtaxed.
Example: Install via eksctl:
eksctl create addon --cluster myCluster --name prometheus-node-exporter --version latest
Then configure Prometheus to scrape each node exporter’s /metrics
endpoint on port 9100.
cert-manager automates TLS certificate issuance and renewal in Kubernetes It integrates with ACME CAs (like Let’s Encrypt) and internal issuers to provision Secret
objects with keys and certs and updates Ingress, etc.
Integration/Architecture: Added via eksctl create addon --cluster myCluster --name cert-manager
. It deploys several controllers in the cert-manager
namespace. Workflows typically involve creating a ClusterIssuer
or Issuer
resource that defines how to obtain certificates (e.g. via ACME DNS or HTTP challenge or from AWS ACM via IAM).
Security: cert-manager manages private keys and interacts with external CAs, so protect it carefully. Use RBAC to limit who can create Issuers or modify secrets. If using ACME, ensure DNS records are validated securely. You may need to give cert-manager an IAM role/credential (e.g. for Route53 DNS validation).
Cost/Optimization: The software is free. Automating certs reduces manual overhead and avoids downtime from expired certificates. If using Let’s Encrypt (free) or AWS ACM (charges only for public certs after 1-year trial), you save time/cost on manual cert management.
Use Cases: Commonly used to provision HTTPS certificates for Ingresses. For example, with Kubernetes Ingress or ALB ingress, cert-manager can automatically request and renew certs. Also useful for securing internal services with mTLS.
Example: A simplified ACME ClusterIssuer
config:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- dns01:
route53:
region: us-west-2
hostedZoneID: ZXXXXXXXX
This tells cert-manager to use Let’s Encrypt DNS validation via Route 53. It will store the TLS secret in Secret(letsencrypt-prod-key)
and label Ingress resources with cert-manager.io/cluster-issuer: letsencrypt-prod
to obtain certs.
The external-dns add-on automatically manages DNS records in Route 53 (or other DNS providers) based on Kubernetes resources. It watches Services and Ingress objects (with certain annotations) and creates/updates matching DNS A/ALIAS records.
Integration/Architecture: Installed via eksctl create addon --cluster myCluster --name external-dns
. It runs as a Deployment (often in namespace external-dns
) and needs an IAM role that allows Route 53 changes. AWS publishes a managed policy (AmazonRoute53FullAccess
by default) but you can scope it down (read Route53 zones, change record sets only in specific zones).
Security: Giving it broad DNS write permissions is powerful: a compromised pod could hijack DNS records. Mitigate risk by scoping the IAM role to only needed hosted zones. Run the service account with IRSA or Pod Identity for least-privilege. Audit the zone change logs.
Cost/Optimization: The add-on itself is free. It saves operational overhead (no manual DNS edits). Indirectly, it can reduce downtime (by automatically correcting records if services change IPs). There is no direct AWS cost per se (Route 53 does have API call/record costs, but minimal for normal usage).
Use Cases: Multi-environment clusters (dev/stage/prod) where services should have DNS names. Automated blue/green or canary deployments where new service IPs update DNS. Multi-cluster setups using DNS for service discovery.
Example: To install External DNS and attach it to a service account with IAM:
eksctl create addon --cluster myCluster --name external-dns --version latest \
--service-account-role-arn arn:aws:iam:::role/ExternalDNSRole
Fluent Bit is a lightweight, high-performance log processor and forwarder. Deployed as a DaemonSet, it collects container logs and metrics from each node and ships them to destinations such as Amazon CloudWatch Logs, Amazon S3 or Amazon Kinesis Data Firehose. Fluent Bit uses minimal CPU/memory, making it ideal for Kubernetes.
Architecture: In an EKS cluster, Fluent Bit runs on every node to tail log files (typically from /var/log/containers
). It can filter or transform logs before forwarding and it supports multiple outputs. For example, you might send application logs to CloudWatch Logs for real-time analysis and to S3 for long-term retention. Because it is resource-efficient by design, it has less impact on node performance than heavier agents.
Security: Fluent Bit can be configured to use IAM roles (via IRSA) when writing to AWS services, ensuring log data is sent securely. By centralizing logs off-node, it minimizes the risk of losing logs if a node fails. It also reduces the need to run privileged logging agents since it handles log collection within the normal node context.
Cost: Fluent Bit itself adds minimal cost (tiny resource use). It provides cost savings by enabling log filtering and smart routing: for example, you might drop debug logs or batch them to S3 instead of CloudWatch, reducing CloudWatch storage costs. Centralized logging helps detect issues faster (reducing MTTR), which is a hidden cost saving.
The Node Monitoring Agent (NMA) runs on every worker node as a DaemonSet and collects detailed health metrics (CPU, memory, disk, networking, GPU health, etc.). It detects node-level problems (e.g. kernel panics, out-of-memory, GPU driver failures, disk I/O errors) that the kubelet alone may not catch. On detecting an issue, NMA updates the node’s status and emits Kubernetes events, enabling EKS Auto Repair to automatically replace unhealthy nodes.
Performance & Resilience: By continuously monitoring low-level signals (including GPU-specific issues via NVIDIA’s DCGM), NMA greatly improves cluster availability. It automatically replaces failing nodes before workloads degrade, reducing unplanned downtime and freeing engineers from manual node health checks. Benchmarked key benefits include “improved workload availability: [auto-detecting and replacing unhealthy nodes]… reduced operational overhead”. In GPU-heavy workloads, early detection of ECC memory errors or driver crashes helps keep ML training reliably running. Integrating with Kubernetes disruption budgets and Karpenter, NMA ensures node failures are healed smoothly under Kubernetes safety policies.
Security: A healthy node infrastructure also supports security. NMA can surface signals like unexpected process crashes or resource exhaustion (which might indicate a breach or crypto-mining attack), allowing proactive investigation. And because the agent is managed and patched by AWS, it runs the latest secure binaries by default. (No sensitive credentials are handled by NMA, so security impact is mainly in improving overall cluster hygiene.)
Cost Optimization: Detecting and repairing node failures automatically minimizes business impact (lost work and outages often cost far more than the spare resources used for auto-repair). It also reduces labor costs by eliminating manual troubleshooting. There is some cost trade‑off: auto-repair may replace nodes (creating short-lived extra instances), but this cost is typically offset by avoiding longer outages. In practice, customers see better long-term resource ROI by maintaining healthy nodes and maximizing cluster utilization.
eksctl create addon --cluster myCluster --name eks-node-monitoring-agent
). It runs on each node and parses system logs to detect failures. In EKS Auto Mode clusters this agent is built-in; for classic clusters you enable it manually. When integrated with managed Node Groups, EKS auto-repair can automatically replace failing instances.Example: Enable it on a new cluster or existing node group by running:
eksctl create addon --cluster myCluster --name eks-node-monitoring-agent --version latest
Then enable auto-repair on your managed node group:
aws eks update-nodegroup-config --cluster-name myCluster --nodegroup-name myNodeGroup --node-repair-config enabled=true
This setup continuously monitors and fixes node health with no extra manual scripts.
HyperPod Task Governance centrally manages who can run tasks and how many resources they get. Admins allocate GPUs/CPU quotas to teams, assign task priority levels and enable “lend & borrow” policies so idle capacity can be shared. The add-on provides a dashboard showing real-time cluster utilization (vCPUs, GPUs) by team and priority, helping to pinpoint over- or under-utilized resources.
amazon-sagemaker-hyperpod-taskgovernance
) provides multi-tenant scheduling and governance for AI/ML workloads on EKS. It lets administrators define compute quotas, priority classes and lend/borrow policies per team or project. Internally it integrates with the open-source Kueue scheduler and HyperPod training operators, but as an AWS add-on it simplifies setup.ClusterQueue
(quota) and a WorkloadPriorityClass
(priority) in Kubernetes. Here’s a snippet of using eksctl to enable the add-on:
eksctl create addon --cluster myCluster --name amazon-sagemaker-hyperpod-taskgovernance --version v1.0.0-eksbuild.1
The Mountpoint for Amazon S3 CSI driver lets Kubernetes pods mount S3 buckets as volumes Under the hood it uses the Mountpoint for S3 file system (developed by AWS Labs) and exposes a standard Kubernetes CSI (s3.csi.aws.com
) interface. This lets containerized apps read (and stream) S3 objects through a POSIX-like interface without changing the app.
Mountpoint presents an S3 bucket as a read/write filesystem (for sequential writes) inside pods. This is ideal for data-intensive workloads (e.g. distributed training) that require high aggregate throughput from S3. The driver supports high-speed, concurrent reads from many clients and sequential writes to create new files.
The Mountpoint for Amazon S3 CSI Driver (aws-mountpoint-s3-csi-driver
) is a CSI plugin that mounts S3 buckets as volumes in EKS pods. It is built on AWS’s Mountpoint technology and presents an S3 bucket as a POSIX-like file system (with some POSIX limitations) so containers can read/write S3 objects without code changes. Applications can use standard filesystem I/O on this mount, benefiting from S3’s scalability and throughput.
eksctl
) to deploy the CSI driver controllers/daemonsets. Then you create a PersistentVolume
(static) that references an existing S3 bucket. For example:
apiVersion: v1
kind: PersistentVolume
metadata:
name: s3-pv
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: s3-csi
csi:
driver: s3.csi.aws.com
volumeHandle: my-training-data-bucket
The pod can then mount this volume. Note that dynamic provisioning (auto-creating buckets) is not supported; you must provide an existing S3 bucket name
After installing the add-on, create a StorageClass and PV as above. A pod spec might then look like:
apiVersion: v1
kind: Pod
metadata:
name: s3-reader
spec:
containers:
- name: app
image: amazonlinux
command: ["/bin/sh","-c","ls /mnt/data"]
volumeMounts:
- name: data
mountPath: /mnt/data
volumes:
- name: data
persistentVolumeClaim:
claimName: pvc-s3
where pvc-s3
is a PersistentVolumeClaim bound to the s3-pv
above. This pod will see the contents of the S3 bucket in /mnt/data
.
The AWS Network Flow Monitor Agent for EKS (a CloudWatch add-on) collects network flow metrics (VPC flow logs and aggregated stats) from each node and sends them to the Network Flow Monitor backend. This provides deep visibility into pod-to-pod and pod-to-service network traffic.
Cost Optimization: There is some overhead: the flow logs generate CloudWatch ingestion and storage costs proportional to traffic volume. However, by identifying and eliminating inefficient network usage (for example, unnecessary cross-AZ data transfer or chatty services) organizations can reduce data-transfer and monitoring costs in the long run. The value of proactive network troubleshooting often outweighs the relatively small cost of the logs. In sum, this add-on trades minimal monitoring expense for much greater insight into network efficiency and security.
The AWS Network Flow Monitor Agent (aws-network-flow-monitoring-agent
) is an observability add-on that captures cluster network flows. Deployed as a DaemonSet, it passively collects TCP connection statistics (source/dest IPs, ports, bytes, etc.) from every node and ships flow logs to Amazon CloudWatch’s Network Monitor service. This provides pod-to-pod and pod-to-external connectivity visibility akin to VPC Flow Logs but at the Kubernetes level.
aws-network-flow-monitoring-agent
). It requires that the Pod Identity Agent add-on be installed first. You must create an IAM role with the CloudWatchNetworkFlowMonitorAgentPublishPolicy
attached and associate it with the agent’s service account via pod identity (see below). Then install via console or CLI. For example:
aws eks create-addon --cluster-name myCluster \
--addon-name aws-network-flow-monitoring-agent \
--pod-identity-associations serviceAccount=aws-network-flow-monitor-agent-service-account,roleArn=arn:aws:iam::123456789012:role/FlowMonitorRole
The Pod Identity Agent add-on (eks-pod-identity-agent
) enables IAM Roles for Pods using the EKS Auth system. It runs a DaemonSet on each node, listening on a link-local address to serve AWS credentials to pods that are associated with IAM roles. Essentially, it allows Kubernetes service accounts to assume IAM roles without IAM OpenID providers (an alternative to IRSA).
What it does: When a Pod uses a service account mapped to an IAM role (via an EKS Pod Identity association), the agent provides that pod with short-lived AWS credentials. This is similar to how EC2 instance profiles work, but scoped to individual pods. EKS Pod Identity eliminates the need for third-party tools (like kube2iam or kiam) and does not require configuring an OIDC provider.
Performance & Resilience: The agent greatly improves scalability of credential delivery. AWS documentation notes that instead of every pod having to fetch credentials from AWS independently, “each set of temporary credentials is assumed by the EKS Auth service… then the Pod Identity Agent… issues the credentials to the SDKs. Thus the load is reduced to once for each node, instead of once per pod. In practice, this means large clusters with many pods avoid redundant AWS API calls (faster start-up and less risk of throttling). The agent’s host-networked operation also ensures high availability of the credential endpoint – if one node’s agent fails, only pods on that node are affected and can be rescheduled.
Security: Pod Identity enforces least privilege IAM to pods. Only pods using a specially annotated service account receive the IAM role’s credentials. Other pods on the same node cannot access those creds, thanks to the agent’s isolation. This provides strict credential isolation: “a Pod’s containers can only retrieve credentials for the IAM role that’s associated with [its] service account”. Administrators bind IAM roles to service accounts via EKS APIs; since each role has a narrow trust policy (only allowing eks.amazonaws.com
as principal), there’s no risk of pods obtaining overly broad credentials. CloudTrail auditing can track Pod Identity usage for compliance. (Note: in EKS Auto Mode, Pod Identity is built-in and need not be installed manually.)
Cost Optimization: Using the EKS Pod Identity Agent saves operational cost by removing the need for custom credential brokers. It also avoids leakage risks of static AWS credentials or long-lived tokens, thereby potentially saving on security incident costs. By integrating pod IAM with existing EKS IAM roles, it cuts down on setup complexity (and related human errors) compared to managing separate identity systems. The actual run-time cost is minimal (no extra AWS API charges beyond the normal AssumeRole calls, which are negligible).
Installation: Install the agent add-on (as above). Then create an identity association, for example via CLI:
aws eks create-addon --cluster-name myCluster --addon-name eks-pod-identity-agent --version latest
After that, in AWS console or CLI you define an association mapping (eks:AssociateIdentityProviderConfig
, etc.) that links a Kubernetes service account (namespace + name) to an IAM role ARN. Pods using that service account will then automatically use the role’s credentials.
Whether you’re scaling AI workloads, improving network visibility, or cutting cloud costs, these EKS add-ons provide powerful tools to improve your Kubernetes operations on AWS.
Adopting them as part of a secure, observable and cost-efficient EKS architecture gives your DevOps and platform teams a strong foundation to innovate without complexity.
We use cookies to enhance your browsing experience, analyze traffic, and serve personalized marketing content. You can accept all cookies or manage your preferences.