Amazon EKS and Grafana stack
Build secure Amazon EKS cluster with Grafana stack
I will outline the steps for setting up an Amazon EKS environment that prioritizes security, including the configuration of standard applications.
The Amazon EKS setup should align with the following criteria:
- Utilize two Availability Zones (AZs), or a single zone if possible, to reduce payments for cross-AZ traffic
- Spot instances
- Less expensive region -
us-east-1 - Most price efficient EC2 instance type
t4g.medium(2 x CPU, 4GB RAM) using AWS Graviton based on ARM - Use Bottlerocket OS for a minimal operating system, CPU, and memory footprint
- Leverage Network Load Balancer (NLB) for highly cost-effective and optimized load balancing, seamlessly integrated with kgateway.
- Karpenter to enable automatic node scaling that matches the specific resource requirements of pods
- The Amazon EKS control plane must be encrypted using KMS
- Worker node EBS volumes must be encrypted
- Cluster logging to CloudWatch must be configured
- Network Policies should be enabled where supported
- EKS Pod Identities should be used to allow applications and pods to communicate with AWS APIs
Build Amazon EKS
Requirements
You will need to configure the AWS CLI and set up other necessary secrets and variables.
1
2
3
4
5
6
7
# AWS Credentials
export AWS_ACCESS_KEY_ID="xxxxxxxxxxxxxxxxxx"
export AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export AWS_SESSION_TOKEN="xxxxxxxx"
export AWS_ROLE_TO_ASSUME="arn:aws:iam::7xxxxxxxxxx7:role/Gixxxxxxxxxxxxxxxxxxxxle"
export GOOGLE_CLIENT_ID="10xxxxxxxxxxxxxxxud.apps.googleusercontent.com"
export GOOGLE_CLIENT_SECRET="GOxxxxxxxxxxxxxxxtw"
If you plan to follow this document and its tasks, you will need to set up a few environment variables, such as:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# AWS Region
export AWS_REGION="${AWS_REGION:-us-east-1}"
# Hostname / FQDN definitions
export CLUSTER_FQDN="${CLUSTER_FQDN:-k01.k8s.mylabs.dev}"
# Base Domain: k8s.mylabs.dev
export BASE_DOMAIN="${CLUSTER_FQDN#*.}"
# Cluster Name: k01
export CLUSTER_NAME="${CLUSTER_FQDN%%.*}"
export MY_EMAIL="petr.ruzicka@gmail.com"
export TMP_DIR="${TMP_DIR:-${PWD}}"
export KUBECONFIG="${KUBECONFIG:-${TMP_DIR}/${CLUSTER_FQDN}/kubeconfig-${CLUSTER_NAME}.conf}"
# Tags used to tag the AWS resources
export TAGS="${TAGS:-Owner=${MY_EMAIL},Environment=dev,Cluster=${CLUSTER_FQDN}}"
export AWS_PARTITION="aws"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) && export AWS_ACCOUNT_ID
mkdir -pv "${TMP_DIR}/${CLUSTER_FQDN}"
Confirm that all essential variables have been properly configured:
1
2
3
4
5
6
7
8
: "${AWS_ACCESS_KEY_ID?}"
: "${AWS_REGION?}"
: "${AWS_SECRET_ACCESS_KEY?}"
: "${AWS_ROLE_TO_ASSUME?}"
: "${GOOGLE_CLIENT_ID?}"
: "${GOOGLE_CLIENT_SECRET?}"
echo -e "${MY_EMAIL} | ${CLUSTER_NAME} | ${BASE_DOMAIN} | ${CLUSTER_FQDN}\n${TAGS}"
Install the required tools:
You can bypass these procedures if you already have all the essential software installed.
Configure AWS Route 53 Domain delegation
The DNS delegation tasks should be executed as a one-time operation.
Create a DNS zone for the EKS clusters:
1
2
3
4
5
6
7
export CLOUDFLARE_EMAIL="petr.ruzicka@gmail.com"
export CLOUDFLARE_API_KEY="1xxxxxxxxx0"
aws route53 create-hosted-zone --output json \
--name "${BASE_DOMAIN}" \
--caller-reference "$(date)" \
--hosted-zone-config="{\"Comment\": \"Created by petr.ruzicka@gmail.com\", \"PrivateZone\": false}" | jq
Utilize your domain registrar to update the nameservers for your zone (e.g., mylabs.dev) to point to Amazon Route 53 nameservers. Here’s how to discover the required Route 53 nameservers:
1
2
3
4
NEW_ZONE_ID=$(aws route53 list-hosted-zones --query "HostedZones[?Name==\`${BASE_DOMAIN}.\`].Id" --output text)
NEW_ZONE_NS=$(aws route53 get-hosted-zone --output json --id "${NEW_ZONE_ID}" --query "DelegationSet.NameServers")
NEW_ZONE_NS1=$(echo "${NEW_ZONE_NS}" | jq -r ".[0]")
NEW_ZONE_NS2=$(echo "${NEW_ZONE_NS}" | jq -r ".[1]")
Establish the NS record in k8s.mylabs.dev (your BASE_DOMAIN) for proper zone delegation. This operation’s specifics may vary based on your domain registrar; I use Cloudflare and employ Ansible for automation:
1
2
ansible -m cloudflare_dns -c local -i "localhost," localhost -a "zone=mylabs.dev record=${BASE_DOMAIN} type=NS value=${NEW_ZONE_NS1} solo=true proxied=no account_email=${CLOUDFLARE_EMAIL} account_api_token=${CLOUDFLARE_API_KEY}"
ansible -m cloudflare_dns -c local -i "localhost," localhost -a "zone=mylabs.dev record=${BASE_DOMAIN} type=NS value=${NEW_ZONE_NS2} solo=false proxied=no account_email=${CLOUDFLARE_EMAIL} account_api_token=${CLOUDFLARE_API_KEY}"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
localhost | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"result": {
"record": {
"content": "ns-885.awsdns-46.net",
"created_on": "2020-11-13T06:25:32.18642Z",
"id": "dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxb",
"locked": false,
"meta": {
"auto_added": false,
"managed_by_apps": false,
"managed_by_argo_tunnel": false,
"source": "primary"
},
"modified_on": "2020-11-13T06:25:32.18642Z",
"name": "k8s.mylabs.dev",
"proxiable": false,
"proxied": false,
"ttl": 1,
"type": "NS",
"zone_id": "2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxe",
"zone_name": "mylabs.dev"
}
}
}
localhost | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"result": {
"record": {
"content": "ns-1692.awsdns-19.co.uk",
"created_on": "2020-11-13T06:25:37.605605Z",
"id": "9xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxb",
"locked": false,
"meta": {
"auto_added": false,
"managed_by_apps": false,
"managed_by_argo_tunnel": false,
"source": "primary"
},
"modified_on": "2020-11-13T06:25:37.605605Z",
"name": "k8s.mylabs.dev",
"proxiable": false,
"proxied": false,
"ttl": 1,
"type": "NS",
"zone_id": "2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxe",
"zone_name": "mylabs.dev"
}
}
}
Create the service-linked role
Creating the service-linked role for Spot Instances is a one-time operation.
Create the AWSServiceRoleForEC2Spot role to use Spot Instances in the Amazon EKS cluster:
1
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
Details: Work with Spot Instances
Create Route53 zone and KMS key infrastructure
Generate a CloudFormation template that defines an Amazon Route 53 zone and an AWS Key Management Service (KMS) key.
Add the new domain CLUSTER_FQDN to Route 53, and set up DNS delegation from the BASE_DOMAIN.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
tee "${TMP_DIR}/${CLUSTER_FQDN}/aws-cf-route53-kms.yml" << \EOF
AWSTemplateFormatVersion: 2010-09-09
Description: Route53 entries and KMS key
Parameters:
BaseDomain:
Description: "Base domain where cluster domains + their subdomains will live - Ex: k8s.mylabs.dev"
Type: String
ClusterFQDN:
Description: "Cluster FQDN (domain for all applications) - Ex: k01.k8s.mylabs.dev"
Type: String
ClusterName:
Description: "Cluster Name - Ex: k01"
Type: String
Resources:
HostedZone:
Type: AWS::Route53::HostedZone
Properties:
Name: !Ref ClusterFQDN
RecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneName: !Sub "${BaseDomain}."
Name: !Ref ClusterFQDN
Type: NS
TTL: 60
ResourceRecords: !GetAtt HostedZone.NameServers
KMSAlias:
Type: AWS::KMS::Alias
Properties:
AliasName: !Sub "alias/eks-${ClusterName}"
TargetKeyId: !Ref KMSKey
KMSKey:
Type: AWS::KMS::Key
Properties:
Description: !Sub "KMS key for ${ClusterName} Amazon EKS"
EnableKeyRotation: true
PendingWindowInDays: 7
KeyPolicy:
Version: "2012-10-17"
Id: !Sub "eks-key-policy-${ClusterName}"
Statement:
- Sid: Allow direct access to key metadata to the account
Effect: Allow
Principal:
AWS:
- !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:root"
Action:
- kms:*
Resource: "*"
- Sid: Allow access through EBS for all principals in the account that are authorized to use EBS
Effect: Allow
Principal:
AWS: "*"
Action:
- kms:Encrypt
- kms:Decrypt
- kms:ReEncrypt*
- kms:GenerateDataKey*
- kms:CreateGrant
- kms:DescribeKey
Resource: "*"
Condition:
StringEquals:
kms:ViaService: !Sub "ec2.${AWS::Region}.amazonaws.com"
kms:CallerAccount: !Sub "${AWS::AccountId}"
S3AccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub "eksctl-${ClusterName}-s3-access-policy"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:AbortMultipartUpload
- s3:DeleteObject
- s3:GetObject
- s3:ListMultipartUploadParts
- s3:ListObjects
- s3:PutObject
- s3:PutObjectTagging
Resource: !Sub "arn:aws:s3:::${ClusterFQDN}/*"
- Effect: Allow
Action:
- s3:ListBucket
Resource: !Sub "arn:aws:s3:::${ClusterFQDN}"
Outputs:
KMSKeyArn:
Description: The ARN of the created KMS Key to encrypt EKS related services
Value: !GetAtt KMSKey.Arn
Export:
Name:
Fn::Sub: "${AWS::StackName}-KMSKeyArn"
KMSKeyId:
Description: The ID of the created KMS Key to encrypt EKS related services
Value: !Ref KMSKey
Export:
Name:
Fn::Sub: "${AWS::StackName}-KMSKeyId"
S3AccessPolicyArn:
Description: IAM policy ARN for S3 access by EKS workloads
Value: !Ref S3AccessPolicy
Export:
Name:
Fn::Sub: "${AWS::StackName}-S3AccessPolicy"
EOF
# shellcheck disable=SC2001
eval aws cloudformation deploy --capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "BaseDomain=${BASE_DOMAIN} ClusterFQDN=${CLUSTER_FQDN} ClusterName=${CLUSTER_NAME}" \
--stack-name "${CLUSTER_NAME}-route53-kms" --template-file "${TMP_DIR}/${CLUSTER_FQDN}/aws-cf-route53-kms.yml" --tags "${TAGS//,/ }"
AWS_CLOUDFORMATION_DETAILS=$(aws cloudformation describe-stacks --stack-name "${CLUSTER_NAME}-route53-kms" --query "Stacks[0].Outputs[? OutputKey==\`KMSKeyArn\` || OutputKey==\`KMSKeyId\` || OutputKey==\`S3AccessPolicyArn\`].{OutputKey:OutputKey,OutputValue:OutputValue}")
AWS_KMS_KEY_ARN=$(echo "${AWS_CLOUDFORMATION_DETAILS}" | jq -r ".[] | select(.OutputKey==\"KMSKeyArn\") .OutputValue")
AWS_KMS_KEY_ID=$(echo "${AWS_CLOUDFORMATION_DETAILS}" | jq -r ".[] | select(.OutputKey==\"KMSKeyId\") .OutputValue")
AWS_S3_ACCESS_POLICY_ARN=$(echo "${AWS_CLOUDFORMATION_DETAILS}" | jq -r ".[] | select(.OutputKey==\"S3AccessPolicyArn\") .OutputValue")
After running the CloudFormation stack, you should see the following Route53 zones:
Route53 k01.k8s.mylabs.dev zone
You should also see the following KMS key:
Create Karpenter infrastructure
Use CloudFormation to set up the infrastructure needed by the EKS cluster. See CloudFormation for a complete description of what cloudformation.yaml does for Karpenter.
1
2
3
4
5
6
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/website/content/en/v1.8/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TMP_DIR}/${CLUSTER_FQDN}/cloudformation-karpenter.yml"
eval aws cloudformation deploy \
--stack-name "${CLUSTER_NAME}-karpenter" \
--template-file "${TMP_DIR}/${CLUSTER_FQDN}/cloudformation-karpenter.yml" \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "ClusterName=${CLUSTER_NAME}" --tags "${TAGS//,/ }"
Create Amazon EKS
I will use eksctl to create the Amazon EKS cluster.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
tee "${TMP_DIR}/${CLUSTER_FQDN}/eksctl-${CLUSTER_NAME}.yml" << EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_REGION}
tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
$(echo "${TAGS}" | sed "s/,/\\n /g; s/=/: /g")
availabilityZones:
- ${AWS_REGION}a
- ${AWS_REGION}b
accessConfig:
accessEntries:
- principalARN: arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/admin
accessPolicies:
- policyARN: arn:${AWS_PARTITION}:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy
accessScope:
type: cluster
iam:
withOIDC: true
podIdentityAssociations:
- namespace: aws-load-balancer-controller
serviceAccountName: aws-load-balancer-controller
roleName: eksctl-${CLUSTER_NAME}-aws-load-balancer-controller
wellKnownPolicies:
awsLoadBalancerController: true
- namespace: cert-manager
serviceAccountName: cert-manager
roleName: eksctl-${CLUSTER_NAME}-cert-manager
wellKnownPolicies:
certManager: true
- namespace: external-dns
serviceAccountName: external-dns
roleName: eksctl-${CLUSTER_NAME}-external-dns
wellKnownPolicies:
externalDNS: true
- namespace: karpenter
serviceAccountName: karpenter
roleName: eksctl-${CLUSTER_NAME}-karpenter
permissionPolicyARNs:
- arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
- namespace: loki
serviceAccountName: loki
roleName: eksctl-${CLUSTER_NAME}-loki
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
- namespace: mimir
serviceAccountName: mimir
roleName: eksctl-${CLUSTER_NAME}-mimir
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
- namespace: tempo
serviceAccountName: tempo
roleName: eksctl-${CLUSTER_NAME}-tempo
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
- namespace: velero
serviceAccountName: velero
roleName: eksctl-${CLUSTER_NAME}-velero
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
permissionPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: [
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateTags",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshots"
]
Resource:
- "*"
iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
username: system:node:
groups:
- system:bootstrappers
- system:nodes
addons:
- name: coredns
- name: eks-pod-identity-agent
- name: kube-proxy
- name: snapshot-controller
- name: aws-ebs-csi-driver
configurationValues: |-
defaultStorageClass:
enabled: true
controller:
extraVolumeTags:
$(echo "${TAGS}" | sed "s/,/\\n /g; s/=/: /g")
loggingFormat: json
- name: vpc-cni
configurationValues: |-
enableNetworkPolicy: "true"
env:
ENABLE_PREFIX_DELEGATION: "true"
managedNodeGroups:
- name: mng01-ng
amiFamily: Bottlerocket
instanceType: t4g.medium
desiredCapacity: 2
availabilityZones:
- ${AWS_REGION}a
minSize: 2
maxSize: 3
volumeSize: 20
volumeEncrypted: true
volumeKmsKeyID: ${AWS_KMS_KEY_ID}
privateNetworking: true
nodeRepairConfig:
enabled: true
bottlerocket:
settings:
kubernetes:
seccomp-default: true
secretsEncryption:
keyARN: ${AWS_KMS_KEY_ARN}
cloudWatch:
clusterLogging:
logRetentionInDays: 1
enableTypes:
- all
EOF
eksctl create cluster --config-file "${TMP_DIR}/${CLUSTER_FQDN}/eksctl-${CLUSTER_NAME}.yml" --kubeconfig "${KUBECONFIG}" || eksctl utils write-kubeconfig --cluster="${CLUSTER_NAME}" --kubeconfig "${KUBECONFIG}"
Enhance the security posture of the EKS cluster by addressing the following concerns:
1
2
3
AWS_VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:alpha.eksctl.io/cluster-name,Values=${CLUSTER_NAME}" --query 'Vpcs[*].VpcId' --output text)
AWS_SECURITY_GROUP_ID=$(aws ec2 describe-security-groups --filters "Name=vpc-id,Values=${AWS_VPC_ID}" "Name=group-name,Values=default" --query 'SecurityGroups[*].GroupId' --output text)
AWS_NACL_ID=$(aws ec2 describe-network-acls --filters "Name=vpc-id,Values=${AWS_VPC_ID}" --query 'NetworkAcls[*].NetworkAclId' --output text)
The default security group should have no rules configured:
1 2
aws ec2 revoke-security-group-egress --group-id "${AWS_SECURITY_GROUP_ID}" --protocol all --port all --cidr 0.0.0.0/0 | jq || true aws ec2 revoke-security-group-ingress --group-id "${AWS_SECURITY_GROUP_ID}" --protocol all --port all --source-group "${AWS_SECURITY_GROUP_ID}" | jq || true
The VPC NACL allows unrestricted SSH access, and the VPC NACL allows unrestricted RDP access:
1 2
aws ec2 create-network-acl-entry --network-acl-id "${AWS_NACL_ID}" --ingress --rule-number 1 --protocol tcp --port-range "From=22,To=22" --cidr-block 0.0.0.0/0 --rule-action Deny aws ec2 create-network-acl-entry --network-acl-id "${AWS_NACL_ID}" --ingress --rule-number 2 --protocol tcp --port-range "From=3389,To=3389" --cidr-block 0.0.0.0/0 --rule-action Deny
The VPC should have Route 53 DNS resolver with logging enabled:
1 2 3 4 5 6 7 8 9
AWS_CLUSTER_LOG_GROUP_ARN=$(aws logs describe-log-groups --query "logGroups[?logGroupName=='/aws/eks/${CLUSTER_NAME}/cluster'].arn" --output text) AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID=$(aws route53resolver create-resolver-query-log-config \ --name "${CLUSTER_NAME}-vpc-dns-logs" \ --destination-arn "${AWS_CLUSTER_LOG_GROUP_ARN}" \ --creator-request-id "$(uuidgen)" --query 'ResolverQueryLogConfig.Id' --output text) aws route53resolver associate-resolver-query-log-config \ --resolver-query-log-config-id "${AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID}" \ --resource-id "${AWS_VPC_ID}"
Prometheus Operator CRDs
Prometheus Operator CRDs provides the Custom Resource Definitions (CRDs) that define the Prometheus operator resources. These CRDs are required before installing ServiceMonitor resources.
Install the prometheus-operator-crds Helm chart to set up the necessary CRDs:
1
helm install prometheus-operator-crds oci://ghcr.io/prometheus-community/charts/prometheus-operator-crds
AWS Load Balancer Controller
The AWS Load Balancer Controller is a controller that manages Elastic Load Balancers for a Kubernetes cluster.
Install the aws-load-balancer-controller Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
# renovate: datasource=helm depName=aws-load-balancer-controller registryUrl=https://aws.github.io/eks-charts
AWS_LOAD_BALANCER_CONTROLLER_HELM_CHART_VERSION="1.14.1"
helm repo add --force-update eks https://aws.github.io/eks-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-aws-load-balancer-controller.yml" << EOF
serviceAccount:
name: aws-load-balancer-controller
clusterName: ${CLUSTER_NAME}
serviceMonitor:
enabled: true
EOF
helm upgrade --install --version "${AWS_LOAD_BALANCER_CONTROLLER_HELM_CHART_VERSION}" --namespace aws-load-balancer-controller --create-namespace --wait --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-aws-load-balancer-controller.yml" aws-load-balancer-controller eks/aws-load-balancer-controller
Pod Scheduling PriorityClasses
Configure PriorityClasses to control the scheduling priority of pods in your cluster. PriorityClasses allow you to influence which pods are scheduled or evicted first when resources are constrained. These classes help ensure that critical workloads receive scheduling priority over less important workloads.
Create custom PriorityClass resources to define priority levels for different workload types:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tee "${TMP_DIR}/${CLUSTER_FQDN}/k8s-scheduling-priorityclass.yml" << EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-priority
value: 100001000
globalDefault: false
description: "This priority class should be used for critical workloads only"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000000
globalDefault: false
description: "This priority class should be used for high priority workloads"
EOF
Add Storage Classes and Volume Snapshots
Configure persistent storage for your EKS cluster by setting up GP3 storage classes and volume snapshot capabilities. This ensures encrypted, expandable storage with proper backup functionality.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
tee "${TMP_DIR}/${CLUSTER_FQDN}/k8s-storage-snapshot-storageclass-volumesnapshotclass.yml" << EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
kmsKeyId: ${AWS_KMS_KEY_ARN}
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-vsc
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Delete
EOF
Delete the gp2 StorageClass, as gp3 will be used instead:
1
kubectl delete storageclass gp2 || true
Karpenter
Karpenter is a Kubernetes node autoscaler built for flexibility, performance, and simplicity.
Install the karpenter Helm chart and customize its default values to fit your environment and storage backend:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# renovate: datasource=github-tags depName=aws/karpenter-provider-aws
KARPENTER_HELM_CHART_VERSION="1.8.2"
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-karpenter.yml" << EOF
serviceMonitor:
enabled: true
settings:
clusterName: ${CLUSTER_NAME}
eksControlPlane: true
interruptionQueue: ${CLUSTER_NAME}
featureGates:
spotToSpotConsolidation: true
EOF
helm upgrade --install --version "${KARPENTER_HELM_CHART_VERSION}" --namespace karpenter --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-karpenter.yml" karpenter oci://public.ecr.aws/karpenter/karpenter
Configure Karpenter by applying the following provisioner definition:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
tee "${TMP_DIR}/${CLUSTER_FQDN}/k8s-karpenter-nodepool.yml" << EOF | kubectl apply -f -
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket
amiSelectorTerms:
- alias: bottlerocket@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
role: "KarpenterNodeRole-${CLUSTER_NAME}"
tags:
Name: "${CLUSTER_NAME}-karpenter"
$(echo "${TAGS}" | sed "s/,/\\n /g; s/=/: /g")
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 2Gi
volumeType: gp3
encrypted: true
kmsKeyID: ${AWS_KMS_KEY_ARN}
- deviceName: /dev/xvdb
ebs:
volumeSize: 20Gi
volumeType: gp3
encrypted: true
kmsKeyID: ${AWS_KMS_KEY_ARN}
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
# keep-sorted start
- key: "karpenter.k8s.aws/instance-memory"
operator: Gt
values: ["4095"]
- key: "topology.kubernetes.io/zone"
operator: In
values: ["${AWS_DEFAULT_REGION}a"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["t4g", "t3a"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"]
# keep-sorted end
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
EOF
cert-manager
cert-manager adds certificates and certificate issuers as resource types in Kubernetes clusters and simplifies the process of obtaining, renewing, and using those certificates.
The cert-manager ServiceAccount was created by eksctl. Install the cert-manager Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# renovate: datasource=helm depName=cert-manager registryUrl=https://charts.jetstack.io
CERT_MANAGER_HELM_CHART_VERSION="1.19.1"
helm repo add --force-update jetstack https://charts.jetstack.io
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-cert-manager.yml" << EOF
global:
priorityClassName: high-priority
crds:
enabled: true
replicaCount: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: controller
topologyKey: kubernetes.io/hostname
extraArgs:
- --enable-certificate-owner-ref=true
serviceAccount:
name: cert-manager
enableCertificateOwnerRef: true
webhook:
replicaCount: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: webhook
topologyKey: kubernetes.io/hostname
cainjector:
replicaCount: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: cert-manager
app.kubernetes.io/component: cainjector
topologyKey: kubernetes.io/hostname
prometheus:
servicemonitor:
enabled: true
EOF
helm upgrade --install --version "${CERT_MANAGER_HELM_CHART_VERSION}" --namespace cert-manager --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-cert-manager.yml" cert-manager jetstack/cert-manager
Install Velero
Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes. It enables disaster recovery, data migration, and scheduled backups by integrating with cloud storage providers such as AWS S3.
Install the velero Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# renovate: datasource=helm depName=velero registryUrl=https://vmware-tanzu.github.io/helm-charts
VELERO_HELM_CHART_VERSION="11.1.1"
helm repo add --force-update vmware-tanzu https://vmware-tanzu.github.io/helm-charts
cat > "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-velero.yml" << EOF
initContainers:
- name: velero-plugin-for-aws
# renovate: datasource=docker depName=velero/velero-plugin-for-aws extractVersion=^(?<version>.+)$
image: velero/velero-plugin-for-aws:v1.13.0
volumeMounts:
- mountPath: /target
name: plugins
priorityClassName: high-priority
metrics:
serviceMonitor:
enabled: true
# prometheusRule:
# enabled: true
# spec:
# - alert: VeleroBackupPartialFailures
# annotations:
# message: Velero backup {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} partially failed backups.
# expr: velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
# for: 15m
# labels:
# severity: warning
# - alert: VeleroBackupFailures
# annotations:
# message: Velero backup {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} failed backups.
# expr: velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
# for: 15m
# labels:
# severity: warning
# - alert: VeleroBackupSnapshotFailures
# annotations:
# message: Velero backup {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} failed snapshot backups.
# expr: increase(velero_volume_snapshot_failure_total{schedule!=""}[1h]) > 0
# for: 15m
# labels:
# severity: warning
# - alert: VeleroRestorePartialFailures
# annotations:
# message: Velero restore {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} partially failed restores.
# expr: increase(velero_restore_partial_failure_total{schedule!=""}[1h]) > 0
# for: 15m
# labels:
# severity: warning
# - alert: VeleroRestoreFailures
# annotations:
# message: Velero restore {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} failed restores.
# expr: increase(velero_restore_failure_total{schedule!=""}[1h]) > 0
# for: 15m
# labels:
# severity: warning
configuration:
backupStorageLocation:
- name:
provider: aws
bucket: ${CLUSTER_FQDN}
prefix: velero
config:
region: ${AWS_DEFAULT_REGION}
volumeSnapshotLocation:
- name:
provider: aws
config:
region: ${AWS_DEFAULT_REGION}
serviceAccount:
server:
name: velero
credentials:
useSecret: false
# Create scheduled backup to periodically backup the let's encrypt production resources in the "cert-manager" namespace:
schedules:
monthly-backup-cert-manager-production:
labels:
letsencrypt: production
schedule: "@monthly"
template:
ttl: 2160h
includedNamespaces:
- cert-manager
includedResources:
- certificates.cert-manager.io
- secrets
labelSelector:
matchLabels:
letsencrypt: production
EOF
helm upgrade --install --version "${VELERO_HELM_CHART_VERSION}" --namespace velero --create-namespace --wait --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-velero.yml" velero vmware-tanzu/velero
Restore cert-manager objects
The following steps will guide you through restoring a Let’s Encrypt production certificate, previously backed up by Velero to S3, onto a new cluster.
Initiate the restore process for the cert-manager objects.
1
2
while [ -z "$(kubectl -n velero get backupstoragelocations default -o jsonpath='{.status.lastSyncedTime}')" ]; do sleep 5; done
velero restore create --from-schedule velero-monthly-backup-cert-manager-production --labels letsencrypt=production --wait --existing-resource-policy=update
View details about the restore process:
1
velero restore describe --selector letsencrypt=production --details
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Name: velero-monthly-backup-cert-manager-production-20251030075321
Namespace: velero
Labels: letsencrypt=production
Annotations: <none>
Phase: Completed
Total items to be restored: 3
Items restored: 3
Started: 2025-10-30 07:53:22 +0100 CET
Completed: 2025-10-30 07:53:24 +0100 CET
Backup: velero-monthly-backup-cert-manager-production-20250921155028
Namespaces:
Included: all namespaces found in the backup
Excluded: <none>
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
Cluster-scoped: auto
Namespace mappings: <none>
Label selector: <none>
Or label selector: <none>
Restore PVs: auto
CSI Snapshot Restores: <none included>
Existing Resource Policy: update
ItemOperationTimeout: 4h0m0s
Preserve Service NodePorts: auto
Uploader config:
HooksAttempted: 0
HooksFailed: 0
Resource List:
cert-manager.io/v1/Certificate:
- cert-manager/ingress-cert-production(created)
v1/Secret:
- cert-manager/ingress-cert-production(created)
- cert-manager/letsencrypt-production-dns(created)
Verify that the certificate was restored properly:
1
kubectl describe certificates -n cert-manager ingress-cert-production
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Name: ingress-cert-production
Namespace: cert-manager
Labels: letsencrypt=production
velero.io/backup-name=velero-monthly-backup-cert-manager-production-20250921155028
velero.io/restore-name=velero-monthly-backup-cert-manager-production-20251030075321
Annotations: <none>
API Version: cert-manager.io/v1
Kind: Certificate
Metadata:
Creation Timestamp: 2025-10-30T06:53:23Z
Generation: 1
Resource Version: 5521
UID: 33422558-3105-4936-87d8-468befb5dc2b
Spec:
Common Name: *.k01.k8s.mylabs.dev
Dns Names:
*.k01.k8s.mylabs.dev
k01.k8s.mylabs.dev
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: letsencrypt-production-dns
Secret Name: ingress-cert-production
Secret Template:
Labels:
Letsencrypt: production
Status:
Conditions:
Last Transition Time: 2025-10-30T06:53:23Z
Message: Certificate is up to date and has not expired
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Not After: 2025-12-20T10:53:07Z
Not Before: 2025-09-21T10:53:08Z
Renewal Time: 2025-11-20T10:53:07Z
Events: <none>
ExternalDNS
ExternalDNS synchronizes exposed Kubernetes Services and Ingresses with DNS providers.
ExternalDNS will manage the DNS records. The external-dns ServiceAccount was created by eksctl. Install the external-dns Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# renovate: datasource=helm depName=external-dns registryUrl=https://kubernetes-sigs.github.io/external-dns/
EXTERNAL_DNS_HELM_CHART_VERSION="1.19.0"
helm repo add --force-update external-dns https://kubernetes-sigs.github.io/external-dns/
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-external-dns.yml" << EOF
serviceAccount:
name: external-dns
priorityClassName: high-priority
serviceMonitor:
enabled: true
interval: 20s
policy: sync
domainFilters:
- ${CLUSTER_FQDN}
EOF
helm upgrade --install --version "${EXTERNAL_DNS_HELM_CHART_VERSION}" --namespace external-dns --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-external-dns.yml" external-dns external-dns/external-dns
Ingress NGINX Controller
ingress-nginx is an Ingress controller for Kubernetes that uses nginx as a reverse proxy and load balancer.
Install the ingress-nginx Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# renovate: datasource=helm depName=ingress-nginx registryUrl=https://kubernetes.github.io/ingress-nginx
INGRESS_NGINX_HELM_CHART_VERSION="4.13.3"
helm repo add --force-update ingress-nginx https://kubernetes.github.io/ingress-nginx
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-ingress-nginx.yml" << EOF
controller:
config:
annotations-risk-level: Critical
use-proxy-protocol: true
allowSnippetAnnotations: true
ingressClassResource:
default: true
extraArgs:
default-ssl-certificate: cert-manager/ingress-cert-production
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/component: controller
topologyKey: kubernetes.io/hostname
replicaCount: 2
service:
annotations:
# https://www.qovery.com/blog/our-migration-from-kubernetes-built-in-nlb-to-alb-controller/
# https://www.youtube.com/watch?v=xwiRjimKW9c
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: ${TAGS//\'/}
service.beta.kubernetes.io/aws-load-balancer-name: eks-${CLUSTER_NAME}
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: proxy_protocol_v2.enabled=true
service.beta.kubernetes.io/aws-load-balancer-type: external
# loadBalancerClass: eks.amazonaws.com/nlb
metrics:
enabled: true
serviceMonitor:
enabled: true
# prometheusRule:
# enabled: true
# rules:
# - alert: NGINXConfigFailed
# expr: count(nginx_ingress_controller_config_last_reload_successful == 0) > 0
# for: 1s
# labels:
# severity: critical
# annotations:
# description: bad ingress config - nginx config test failed
# summary: uninstall the latest ingress changes to allow config reloads to resume
# - alert: NGINXCertificateExpiry
# expr: (avg(nginx_ingress_controller_ssl_expire_time_seconds{host!="_"}) by (host) - time()) < 604800
# for: 1s
# labels:
# severity: critical
# annotations:
# description: ssl certificate(s) will expire in less then a week
# summary: renew expiring certificates to avoid downtime
# - alert: NGINXTooMany500s
# expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
# for: 1m
# labels:
# severity: warning
# annotations:
# description: Too many 5XXs
# summary: More than 5% of all requests returned 5XX, this requires your attention
# - alert: NGINXTooMany400s
# expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
# for: 1m
# labels:
# severity: warning
# annotations:
# description: Too many 4XXs
# summary: More than 5% of all requests returned 4XX, this requires your attention
priorityClassName: critical-priority
EOF
helm upgrade --install --version "${INGRESS_NGINX_HELM_CHART_VERSION}" --namespace ingress-nginx --create-namespace --wait --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-ingress-nginx.yml" ingress-nginx ingress-nginx/ingress-nginx
Loki
Grafana Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost-effective and easy to operate, as it does not index the contents of the logs, but rather a set of labels for each log stream.
Install the loki Helm chart and customize its default values to fit your environment and storage requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# renovate: datasource=helm depName=loki registryUrl=https://grafana.github.io/helm-charts
LOKI_HELM_CHART_VERSION="6.45.2"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-loki.yml" << EOF
global:
priorityClassName: high-priority
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 2
storage:
bucketNames:
chunks: ${CLUSTER_FQDN}
ruler: ${CLUSTER_FQDN}
admin: ${CLUSTER_FQDN}
s3:
region: ${AWS_REGION}
endpoint: s3.${AWS_REGION}.amazonaws.com
object_store:
storage_prefix: ruzickap
s3:
endpoint: s3.${AWS_REGION}.amazonaws.com
region: ${AWS_REGION}
schemaConfig:
configs:
- from: 2024-04-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
region: ${AWS_REGION}
# bucketnames: loki-chunk
# bucketnames: loki-chunk
# s3forcepathstyle: false
# s3: s3://s3.${AWS_REGION}.amazonaws.com/loki-storage
# endpoint: s3.${AWS_REGION}.amazonaws.com
limits_config:
retention_period: 1w
# Log retention in Loki is achieved through the Compactor (https://grafana.com/docs/loki/v3.5.x/get-started/components/#compactor)
# compactor:
# delete_request_store: s3
# retention_enabled: true
lokiCanary:
kind: Deployment
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: A horizontally-scalable, highly-available log aggregation system
gethomepage.dev/group: Apps
gethomepage.dev/icon: https://raw.githubusercontent.com/grafana/loki/5a8bc848dbe453ce27576d2058755a90f79d07b6/docs/sources/logo.png
gethomepage.dev/name: Loki
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
hosts:
- loki.${CLUSTER_FQDN}
tls:
- hosts:
- loki.${CLUSTER_FQDN}
singleBinary:
replicas: 2
write:
replicas: 0
read:
replicas: 0
backend:
replicas: 0
EOF
helm upgrade --install --version "${LOKI_HELM_CHART_VERSION}" --namespace loki --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-loki.yml" loki grafana/loki
Mimir
Grafana Mimir is an open source, horizontally scalable, multi-tenant time series database for Prometheus metrics, designed for high availability and cost efficiency. It enables you to centralize metrics from multiple clusters or environments, and integrates seamlessly with Grafana dashboards for visualization and alerting.
Install the mimir-distributed Helm chart and customize its default values to fit your environment and storage backend:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# renovate: datasource=helm depName=mimir-distributed registryUrl=https://grafana.github.io/helm-charts
MIMIR_DISTRIBUTED_HELM_CHART_VERSION="6.1.0-weekly.368"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mimir-distributed.yml" << EOF
serviceAccount:
name: mimir
mimir:
structuredConfig:
limits:
compactor_blocks_retention_period: 30d
# {"ts":"2025-11-04T19:30:40.472926117Z","level":"error","msg":"non-recoverable error","component_path":"/","component_id":"prometheus.remote_write.mimir","subcomponent":"rw","remote_name":"5b0906","url":"http://mimir-gateway.mimir.svc.cluster.local/api/v1/push","failedSampleCount":2000,"failedHistogramCount":0,"failedExemplarCount":0,"err":"server returned HTTP status 400 Bad Request: received a series whose number of labels exceeds the limit (actual: 31, limit: 30) series: 'karpenter_nodes_allocatable{arch=\"amd64\", capacity_type=\"spot\", container=\"controller\", endpoint=\"http-metrics\", instance=\"192.168.92.152:8080\", instance_capability_flex=\"false\", instance_category=\"t\"…' (err-mimir-max-label-names-per-series). To adjust the related per-tenant limit, configure -validation.max-label-names-per-series, or contact your service administrator.\n"}
max_label_names_per_series: 50
# Default is 150000
max_global_series_per_user: 300000
common:
# https://grafana.com/docs/mimir/v2.17.x/configure/configuration-parameters/
storage:
backend: s3
s3:
endpoint: s3.${AWS_REGION}.amazonaws.com
region: ${AWS_REGION}
storage_class: ONEZONE_IA
alertmanager_storage:
s3:
bucket_name: ${CLUSTER_FQDN}
storage_prefix: mimiralertmanager
blocks_storage:
s3:
bucket_name: ${CLUSTER_FQDN}
storage_prefix: mimirblocks
ruler_storage:
s3:
bucket_name: ${CLUSTER_FQDN}
storage_prefix: mimirruler
alertmanager:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: alertmanager
topologyKey: kubernetes.io/hostname
distributor:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: distributor
topologyKey: kubernetes.io/hostname
ingester:
zoneAwareReplication:
enabled: false
replicas: 2
priorityClassName: high-priority
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: ingester
topologyKey: kubernetes.io/hostname
overrides_exporter:
priorityClassName: high-priority
ruler:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: ruler
topologyKey: kubernetes.io/hostname
ruler_querier:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: ruler-querier
topologyKey: kubernetes.io/hostname
ruler_query_frontend:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: ruler-query-frontend
topologyKey: kubernetes.io/hostname
ruler_query_scheduler:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: ruler-query-scheduler
topologyKey: kubernetes.io/hostname
querier:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: querier
topologyKey: kubernetes.io/hostname
query_frontend:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: query-frontend
topologyKey: kubernetes.io/hostname
query_scheduler:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: query-scheduler
topologyKey: kubernetes.io/hostname
store_gateway:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: store-gateway
topologyKey: kubernetes.io/hostname
compactor:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: compactor
topologyKey: kubernetes.io/hostname
# https://github.com/grafana/helm-charts/blob/main/charts/rollout-operator/values.yaml
rollout_operator:
serviceMonitor:
enabled: true
priorityClassName: high-priority
minio:
enabled: false
kafka:
priorityClassName: high-priority
gateway:
priorityClassName: high-priority
replicas: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mimir
app.kubernetes.io/component: gateway
topologyKey: kubernetes.io/hostname
EOF
helm upgrade --install --version "${MIMIR_DISTRIBUTED_HELM_CHART_VERSION}" --namespace mimir --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mimir-distributed.yml" mimir grafana/mimir-distributed
Tempo
Grafana Tempo is an open source, easy-to-use, and high-scale distributed tracing backend. It is designed to be cost-effective and simple to operate, as it only requires object storage to operate its backend and does not index the trace data.
Install the tempo Helm chart and customize its default values to fit your environment and storage requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# renovate: datasource=helm depName=tempo registryUrl=https://grafana.github.io/helm-charts
TEMPO_HELM_CHART_VERSION="1.24.0"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-tempo.yml" << EOF
replicas: 2
tempo:
# https://youtu.be/PmE9mgYaoQA?t=817
metricsGenerator:
enabled: true
remoteWriteUrl: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
storage:
trace:
backend: s3
s3:
bucket: ${CLUSTER_FQDN}
endpoint: s3.${AWS_REGION}.amazonaws.com
serviceMonitor:
enabled: true
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: tempo
app.kubernetes.io/name: tempo
topologyKey: kubernetes.io/hostname
priorityClassName: high-priority
EOF
helm upgrade --install --version "${TEMPO_HELM_CHART_VERSION}" --namespace tempo --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-tempo.yml" tempo grafana/tempo
Pyroscope
Grafana Pyroscope is a Continuous Profiling Platform.
Install the pyroscope Helm chart and customize its default values to fit your environment and storage requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# renovate: datasource=helm depName=pyroscope registryUrl=https://grafana.github.io/helm-charts
PYROSCOPE_HELM_CHART_VERSION="1.16.0"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-pyroscope.yml" << EOF
pyroscope:
replicaCount: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: pyroscope
topologyKey: kubernetes.io/hostname
priorityClassName: high-priority
ingress:
enabled: true
className: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: Continuous Profiling Platform
gethomepage.dev/group: Apps
gethomepage.dev/icon: https://raw.githubusercontent.com/grafana/pyroscope/d3818254b7c70a43104effcfd300ff885035ac50/images/logo.png
gethomepage.dev/name: Pyroscope
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
hosts:
- pyroscope.${CLUSTER_FQDN}
tls:
- hosts:
- pyroscope.${CLUSTER_FQDN}
serviceMonitor:
enabled: true
EOF
helm upgrade --install --version "${PYROSCOPE_HELM_CHART_VERSION}" --namespace pyroscope --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-pyroscope.yml" pyroscope grafana/pyroscope
Grafana Kubernetes Monitoring Helm chart
The Grafana Kubernetes Monitoring Helm chart offers a complete solution for configuring infrastructure, zero-code instrumentation, and gathering telemetry.
Install the k8s-monitoring Helm chart and customize its default values to fit your environment and storage requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# renovate: datasource=helm depName=k8s-monitoring registryUrl=https://grafana.github.io/helm-charts
K8S_MONITORING_HELM_CHART_VERSION="3.6.0"
# https://github.com/suxess-it/kubriX/blob/main/platform-apps/charts/k8s-monitoring/values-kubrix-default.yaml
# https://github.com/ar2pi/potato-cluster/blob/main/kubernetes/helm/grafana-k8s-monitoring/values.yaml
# https://github.com/valesordev/valesor.dev/blob/main/infra/alloy/k8s-monitoring-values.yaml
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-k8s-monitoring.yml" << EOF
cluster:
name: "${CLUSTER_NAME}"
destinations:
- name: prometheus
type: prometheus
url: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
# tenantId: "1"
- name: loki
type: loki
url: http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push
# tenantId: "1"
- name: otlpgateway
type: otlp
url: http://tempo.tempo.svc.cluster.local:4317
tls:
insecure: true
insecureSkipVerify: true
- name: pyroscope
type: pyroscope
url: http://pyroscope.pyroscope.svc.cluster.local:4040
tls:
insecure_skip_verify: true
# VVV disable ???
clusterMetrics:
enabled: true
# clusterEvents:
# enabled: true
# nodeLogs:
# enabled: true
# podLogs:
# enabled: true
# applicationObservability:
# enabled: true
# receivers:
# otlp:
# grpc:
# enabled: true
# http:
# enabled: true
autoInstrumentation:
enabled: true
annotationAutodiscovery:
enabled: true
prometheusOperatorObjects:
enabled: true
# profiling:
# enabled: true
# # vvv enabled ????
# profilesReceiver:
# enabled: false
alloy-metrics:
enabled: true
# alloy-singleton:
# enabled: true
# alloy-logs:
# enabled: true
# alloy:
# clustering:
# enabled: true
# alloy-receiver:
# enabled: true
# alloy-profiles:
# enabled: true
collectorCommon:
alloy:
priorityClassName: system-node-critical
controller:
priorityClassName: system-node-critical
EOF
helm upgrade --install --version "${K8S_MONITORING_HELM_CHART_VERSION}" --namespace k8s-monitoring --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-k8s-monitoring.yml" k8s-monitoring grafana/k8s-monitoring
Grafana
Grafana is an open-source analytics and monitoring platform that allows you to query, visualize, alert on, and understand your metrics, logs, and traces. It provides a powerful and flexible way to create dashboards and visualizations for monitoring your Kubernetes cluster and applications.
Install the grafana Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
# renovate: datasource=helm depName=grafana registryUrl=https://grafana.github.io/helm-charts
GRAFANA_HELM_CHART_VERSION="10.1.4"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-grafana.yml" << EOF
serviceMonitor:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/description: Visualization Platform
gethomepage.dev/enabled: "true"
gethomepage.dev/group: Observability
gethomepage.dev/icon: grafana.svg
gethomepage.dev/name: Grafana
gethomepage.dev/app: grafana
gethomepage.dev/pod-selector: "app.kubernetes.io/name=grafana"
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
nginx.ingress.kubernetes.io/configuration-snippet: |
auth_request_set \$email \$upstream_http_x_auth_request_email;
proxy_set_header X-Email \$email;
path: /
pathType: Prefix
hosts:
- grafana.${CLUSTER_FQDN}
tls:
- hosts:
- grafana.${CLUSTER_FQDN}
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
url: http://mimir-gateway.mimir.svc.cluster.local/prometheus
access: proxy
isDefault: true
jsonData:
prometheusType: Mimir
prometheusVersion: 2.9.1
# httpHeaderName1: X-Scope-OrgID
# secureJsonData:
# httpHeaderValue1: 1
- name: Loki
type: loki
url: http://loki-gateway.loki.svc.cluster.local/
access: proxy
# jsonData:
# httpHeaderName1: X-Scope-OrgID
# secureJsonData:
# httpHeaderValue1: "1"
- name: Tempo
type: tempo
url: http://tempo.tempo.svc.cluster.local:3200
access: proxy
- name: Pyroscope
type: grafana-pyroscope-datasource
url: http://pyroscope.pyroscope.svc.cluster.local:4040
notifiers:
notifiers.yaml:
notifiers:
- name: email-notifier
type: email
uid: email1
org_id: 1
is_default: true
settings:
addresses: ${MY_EMAIL}
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
editable: false
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
# keep-sorted start numeric=yes
1860-node-exporter-full:
# renovate: depName="Node Exporter Full"
gnetId: 1860
revision: 37
datasource: Prometheus
# 3662-prometheus-2-0-overview:
# # renovate: depName="Prometheus 2.0 Overview"
# gnetId: 3662
# revision: 2
# datasource: Prometheus
# 9614-nginx-ingress-controller:
# # renovate: depName="NGINX Ingress controller"
# gnetId: 9614
# revision: 1
# datasource: Prometheus
# 12006-kubernetes-apiserver:
# # renovate: depName="Kubernetes apiserver"
# gnetId: 12006
# revision: 1
# datasource: Prometheus
# # https://github.com/DevOps-Nirvana/Grafana-Dashboards
# 14314-kubernetes-nginx-ingress-controller-nextgen-devops-nirvana:
# # renovate: depName="Kubernetes Nginx Ingress Prometheus NextGen"
# gnetId: 14314
# revision: 2
# datasource: Prometheus
# 15038-external-dns:
# # renovate: depName="External-dns"
# gnetId: 15038
# revision: 3
# datasource: Prometheus
15757-kubernetes-views-global:
# renovate: depName="Kubernetes / Views / Global"
gnetId: 15757
revision: 42
15758-kubernetes-views-namespaces:
# renovate: depName="Kubernetes / Views / Namespaces"
gnetId: 15758
revision: 41
15759-kubernetes-views-nodes:
# renovate: depName="Kubernetes / Views / Nodes"
gnetId: 15759
revision: 40
# https://grafana.com/orgs/imrtfm/dashboards - https://github.com/dotdc/grafana-dashboards-kubernetes
15760-kubernetes-views-pods:
# renovate: depName="Kubernetes / Views / Pods"
gnetId: 15760
revision: 37
15761-kubernetes-system-api-server:
# renovate: depName="Kubernetes / System / API Server"
gnetId: 15761
revision: 18
16006-mimir-alertmanager-resources:
# renovate: depName="Mimir / Alertmanager resources"
gnetId: 16006
revision: 17
16007-mimir-alertmanager:
# renovate: depName="Mimir / Alertmanager"
gnetId: 16007
revision: 17
16008-mimir-compactor-resources:
# renovate: depName="Mimir / Compactor resources"
gnetId: 16008
revision: 17
16009-mimir-compactor:
# renovate: depName="Mimir / Compactor"
gnetId: 16009
revision: 17
16010-mimir-config:
# renovate: depName="Mimir / Config"
gnetId: 16010
revision: 17
16011-mimir-object-store:
# renovate: depName="Mimir / Object Store"
gnetId: 16011
revision: 17
16012-mimir-overrides:
# renovate: depName="Mimir / Overrides"
gnetId: 16012
revision: 17
16013-mimir-queries:
# renovate: depName="Mimir / Queries"
gnetId: 16013
revision: 17
16014-mimir-reads-networking:
# renovate: depName="Mimir / Reads networking"
gnetId: 16014
revision: 17
16015-mimir-reads-resources:
# renovate: depName="Mimir / Reads resources"
gnetId: 16015
revision: 17
16016-mimir-reads:
# renovate: depName="Mimir / Reads"
gnetId: 16016
revision: 17
16017-mimir-rollout-progress:
# renovate: depName="Mimir / Rollout progress"
gnetId: 16017
revision: 17
16018-mimir-ruler:
# renovate: depName="Mimir / Ruler"
gnetId: 16018
revision: 17
16019-mimir-scaling:
# renovate: depName="Mimir / Scaling"
gnetId: 16019
revision: 17
16020-mimir-slow-queries:
# renovate: depName="Mimir / Slow queries"
gnetId: 16020
revision: 17
16021-mimir-tenants:
# renovate: depName="Mimir / Tenants"
gnetId: 16021
revision: 17
16022-mimir-top-tenants:
# renovate: depName="Mimir / Top tenants"
gnetId: 16022
revision: 16
16023-mimir-writes-networking:
# renovate: depName="Mimir / Writes networking"
gnetId: 16023
revision: 16
16024-mimir-writes-resources:
# renovate: depName="Mimir / Writes resources"
gnetId: 16024
revision: 17
16026-mimir-writes:
# renovate: depName="Mimir / Writes"
gnetId: 16026
revision: 17
17605-mimir-overview-networking:
# renovate: depName="Mimir / Overview networking"
gnetId: 17605
revision: 13
17606-mimir-overview-resources:
# renovate: depName="Mimir / Overview resources"
gnetId: 17606
revision: 13
17607-mimir-overview:
# renovate: depName="Mimir / Overview"
gnetId: 17607
revision: 13
17608-mimir-remote-ruler-reads:
# renovate: depName="Mimir / Remote ruler reads"
gnetId: 17608
revision: 13
17609-mimir-remote-ruler-reads-resources:
# renovate: depName="Mimir / Remote ruler reads resources"
gnetId: 17609
revision: 13
# 19105-prometheus:
# # renovate: depName="Prometheus"
# gnetId: 19105
# revision: 6
# datasource: Prometheus
# 19268-prometheus:
# # renovate: depName="Prometheus All Metrics"
# gnetId: 19268
# revision: 1
# datasource: Prometheus
# 20340-cert-manager:
# # renovate: depName="cert-manager"
# gnetId: 20340
# revision: 1
# datasource: Prometheus
# 20842-cert-manager-kubernetes:
# # renovate: depName="Cert-manager-Kubernetes"
# gnetId: 20842
# revision: 1
# datasource: Prometheus
19923-beyla-red-metrics:
# renovate: depName="Beyla RED Metrics"
gnetId: 19923
revision: 3
datasource: Prometheus
# keep-sorted end
grafana.ini:
analytics:
check_for_updates: false
auth.basic:
enabled: false
auth.proxy:
enabled: true
header_name: X-Email
header_property: email
users:
auto_assign_org_role: Admin
smtp:
enabled: true
host: mailpit-smtp.mailpit.svc.cluster.local:25
from_address: grafana@${CLUSTER_FQDN}
networkPolicy:
enabled: true
EOF
helm upgrade --install --version "${GRAFANA_HELM_CHART_VERSION}" --namespace grafana --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-grafana.yml" grafana grafana/grafana
Mailpit
Mailpit will be used to receive email alerts from Prometheus.
Install the mailpit Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# renovate: datasource=helm depName=mailpit registryUrl=https://jouve.github.io/charts/
MAILPIT_HELM_CHART_VERSION="0.29.1"
helm repo add --force-update jouve https://jouve.github.io/charts/
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mailpit.yml" << EOF
replicaCount: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: mailpit
topologyKey: kubernetes.io/hostname
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: An email and SMTP testing tool with API for developers
gethomepage.dev/group: Apps
gethomepage.dev/icon: https://raw.githubusercontent.com/axllent/mailpit/61241f11ac94eb33bd84e399129992250eff56ce/server/ui/favicon.svg
gethomepage.dev/name: Mailpit
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
hostname: mailpit.${CLUSTER_FQDN}
EOF
helm upgrade --install --version "${MAILPIT_HELM_CHART_VERSION}" --namespace mailpit --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mailpit.yml" mailpit jouve/mailpit
kubectl label namespace mailpit pod-security.kubernetes.io/enforce=baseline
Screenshot:
OAuth2 Proxy
Use OAuth2 Proxy to protect application endpoints with Google Authentication.
Install the oauth2-proxy Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# renovate: datasource=helm depName=oauth2-proxy registryUrl=https://oauth2-proxy.github.io/manifests
OAUTH2_PROXY_HELM_CHART_VERSION="8.3.2"
helm repo add --force-update oauth2-proxy https://oauth2-proxy.github.io/manifests
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-oauth2-proxy.yml" << EOF
config:
clientID: ${GOOGLE_CLIENT_ID}
clientSecret: ${GOOGLE_CLIENT_SECRET}
cookieSecret: "$(openssl rand -base64 32 | head -c 32 | base64)"
configFile: |-
cookie_domains = ".${CLUSTER_FQDN}"
set_authorization_header = "true"
set_xauthrequest = "true"
upstreams = [ "file:///dev/null" ]
whitelist_domains = ".${CLUSTER_FQDN}"
authenticatedEmailsFile:
enabled: true
restricted_access: |-
${MY_EMAIL}
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: A reverse proxy that provides authentication with Google, Azure, OpenID Connect and many more identity providers
gethomepage.dev/group: Cluster Management
gethomepage.dev/icon: https://raw.githubusercontent.com/oauth2-proxy/oauth2-proxy/899c743afc71e695964165deb11f50b9a0703c97/docs/static/img/logos/OAuth2_Proxy_icon.svg
gethomepage.dev/name: OAuth2-Proxy
hosts:
- oauth2-proxy.${CLUSTER_FQDN}
tls:
- hosts:
- oauth2-proxy.${CLUSTER_FQDN}
priorityClassName: critical-priority
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: authentication-proxy
app.kubernetes.io/instance: oauth2-proxy
topologyKey: kubernetes.io/hostname
replicaCount: 2
metrics:
servicemonitor:
enabled: true
EOF
helm upgrade --install --version "${OAUTH2_PROXY_HELM_CHART_VERSION}" --namespace oauth2-proxy --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-oauth2-proxy.yml" oauth2-proxy oauth2-proxy/oauth2-proxy
Homepage
Install Homepage to provide a nice dashboard.
Install the homepage Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# renovate: datasource=helm depName=homepage registryUrl=http://jameswynn.github.io/helm-charts
HOMEPAGE_HELM_CHART_VERSION="2.1.0"
helm repo add --force-update jameswynn http://jameswynn.github.io/helm-charts
cat > "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-homepage.yml" << EOF
enableRbac: true
serviceAccount:
create: true
ingress:
main:
enabled: true
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/name: Homepage
gethomepage.dev/description: A modern, secure, highly customizable application dashboard
gethomepage.dev/group: Apps
gethomepage.dev/icon: homepage.png
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
ingressClassName: nginx
hosts:
- host: ${CLUSTER_FQDN}
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- ${CLUSTER_FQDN}
config:
bookmarks:
services:
widgets:
- logo:
icon: kubernetes.svg
- kubernetes:
cluster:
show: true
cpu: true
memory: true
showLabel: true
label: "${CLUSTER_NAME}"
nodes:
show: true
cpu: true
memory: true
showLabel: true
kubernetes:
mode: cluster
settings:
hideVersion: true
title: ${CLUSTER_FQDN}
favicon: https://raw.githubusercontent.com/homarr-labs/dashboard-icons/38631ad11695467d7a9e432d5fdec7a39a31e75f/svg/kubernetes.svg
layout:
Apps:
icon: mdi-apps
Observability:
icon: mdi-chart-bell-curve-cumulative
Cluster Management:
icon: mdi-tools
env:
- name: HOMEPAGE_ALLOWED_HOSTS
value: ${CLUSTER_FQDN}
- name: LOG_TARGETS
value: stdout
EOF
helm upgrade --install --version "${HOMEPAGE_HELM_CHART_VERSION}" --namespace homepage --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-homepage.yml" homepage jameswynn/homepage
Clean-up
Back up the certificate before deleting the cluster (in case it was renewed):
1
2
3
if [[ "$(kubectl get --raw /api/v1/namespaces/cert-manager/services/cert-manager:9402/proxy/metrics | awk '/certmanager_http_acme_client_request_count.*acme-v02\.api.*finalize/ { print $2 }')" -gt 0 ]]; then
velero backup create --labels letsencrypt=production --ttl 2160h --from-schedule velero-monthly-backup-cert-manager-production
fi
Stop Karpenter from launching additional nodes:
1
2
helm uninstall -n karpenter karpenter || true
helm uninstall -n ingress-nginx ingress-nginx || true
Remove any remaining EC2 instances provisioned by Karpenter (if they still exist):
1
2
3
4
for EC2 in $(aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" "Name=tag:karpenter.sh/nodepool,Values=*" Name=instance-state-name,Values=running --query "Reservations[].Instances[].InstanceId" --output text); do
echo "🗑️ Removing Karpenter EC2: ${EC2}"
aws ec2 terminate-instances --instance-ids "${EC2}"
done
Disassociate a Route 53 Resolver query log configuration from an Amazon VPC:
1
2
3
4
5
6
7
8
RESOLVER_QUERY_LOG_CONFIGS_ID=$(aws route53resolver list-resolver-query-log-configs --query "ResolverQueryLogConfigs[?contains(DestinationArn, '/aws/eks/${CLUSTER_NAME}/cluster')].Id" --output text)
if [[ -n "${RESOLVER_QUERY_LOG_CONFIGS_ID}" ]]; then
RESOLVER_QUERY_LOG_CONFIG_ASSOCIATIONS_RESOURCEID=$(aws route53resolver list-resolver-query-log-config-associations --filters "Name=ResolverQueryLogConfigId,Values=${RESOLVER_QUERY_LOG_CONFIGS_ID}" --query 'ResolverQueryLogConfigAssociations[].ResourceId' --output text)
if [[ -n "${RESOLVER_QUERY_LOG_CONFIG_ASSOCIATIONS_RESOURCEID}" ]]; then
aws route53resolver disassociate-resolver-query-log-config --resolver-query-log-config-id "${RESOLVER_QUERY_LOG_CONFIGS_ID}" --resource-id "${RESOLVER_QUERY_LOG_CONFIG_ASSOCIATIONS_RESOURCEID}"
sleep 5
fi
fi
Clean up AWS Route 53 Resolver query log configurations:
1
2
3
4
AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID=$(aws route53resolver list-resolver-query-log-configs --query "ResolverQueryLogConfigs[?Name=='${CLUSTER_NAME}-vpc-dns-logs'].Id" --output text)
if [[ -n "${AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID}" ]]; then
aws route53resolver delete-resolver-query-log-config --resolver-query-log-config-id "${AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID}"
fi
Remove the EKS cluster and its created components:
1
2
3
if eksctl get cluster --name="${CLUSTER_NAME}"; then
eksctl delete cluster --name="${CLUSTER_NAME}" --force
fi
Remove the Route 53 DNS records from the DNS Zone:
1
2
3
4
5
6
7
8
9
10
CLUSTER_FQDN_ZONE_ID=$(aws route53 list-hosted-zones --query "HostedZones[?Name==\`${CLUSTER_FQDN}.\`].Id" --output text)
if [[ -n "${CLUSTER_FQDN_ZONE_ID}" ]]; then
aws route53 list-resource-record-sets --hosted-zone-id "${CLUSTER_FQDN_ZONE_ID}" | jq -c '.ResourceRecordSets[] | select (.Type != "SOA" and .Type != "NS")' |
while read -r RESOURCERECORDSET; do
aws route53 change-resource-record-sets \
--hosted-zone-id "${CLUSTER_FQDN_ZONE_ID}" \
--change-batch '{"Changes":[{"Action":"DELETE","ResourceRecordSet": '"${RESOURCERECORDSET}"' }]}' \
--output text --query 'ChangeInfo.Id'
done
fi
Delete Instance profile which belongs to Karpenter role:
1
2
3
4
5
6
if AWS_INSTANCE_PROFILES_FOR_ROLE=$(aws iam list-instance-profiles-for-role --role-name "KarpenterNodeRole-${CLUSTER_NAME}" --query 'InstanceProfiles[].{Name:InstanceProfileName}' --output text); then
if [[ -n "${AWS_INSTANCE_PROFILES_FOR_ROLE}" ]]; then
aws iam remove-role-from-instance-profile --instance-profile-name "${AWS_INSTANCE_PROFILES_FOR_ROLE}" --role-name "KarpenterNodeRole-${CLUSTER_NAME}"
aws iam delete-instance-profile --instance-profile-name "${AWS_INSTANCE_PROFILES_FOR_ROLE}"
fi
fi
Remove the CloudFormation stack:
1
2
3
4
5
aws cloudformation delete-stack --stack-name "${CLUSTER_NAME}-route53-kms"
aws cloudformation delete-stack --stack-name "${CLUSTER_NAME}-karpenter"
aws cloudformation wait stack-delete-complete --stack-name "${CLUSTER_NAME}-route53-kms"
aws cloudformation wait stack-delete-complete --stack-name "${CLUSTER_NAME}-karpenter"
aws cloudformation wait stack-delete-complete --stack-name "eksctl-${CLUSTER_NAME}-cluster"
Remove volumes and snapshots related to the cluster (as a precaution):
1
2
3
4
5
6
7
8
9
10
for VOLUME in $(aws ec2 describe-volumes --filter "Name=tag:KubernetesCluster,Values=${CLUSTER_NAME}" "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" --query 'Volumes[].VolumeId' --output text); do
echo "💾 Removing Volume: ${VOLUME}"
aws ec2 delete-volume --volume-id "${VOLUME}"
done
# Remove EBS snapshots associated with the cluster
for SNAPSHOT in $(aws ec2 describe-snapshots --owner-ids self --filter "Name=tag:Name,Values=${CLUSTER_NAME}-dynamic-snapshot*" "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" --query 'Snapshots[].SnapshotId' --output text); do
echo "📸 Removing Snapshot: ${SNAPSHOT}"
aws ec2 delete-snapshot --snapshot-id "${SNAPSHOT}"
done
Remove the CloudWatch log group:
1
2
3
if [[ "$(aws logs describe-log-groups --query "logGroups[?logGroupName==\`/aws/eks/${CLUSTER_NAME}/cluster\`] | [0].logGroupName" --output text)" = "/aws/eks/${CLUSTER_NAME}/cluster" ]]; then
aws logs delete-log-group --log-group-name "/aws/eks/${CLUSTER_NAME}/cluster"
fi
Remove the ${TMP_DIR}/${CLUSTER_FQDN} directory:
1
2
3
4
5
6
7
8
9
10
if [[ -d "${TMP_DIR}/${CLUSTER_FQDN}" ]]; then
for FILE in "${TMP_DIR}/${CLUSTER_FQDN}"/{kubeconfig-${CLUSTER_NAME}.conf,{aws-cf-route53-kms,cloudformation-karpenter,eksctl-${CLUSTER_NAME},helm_values-{aws-load-balancer-controller,cert-manager,external-dns,grafana,homepage,ingress-nginx,k8s-monitoring,karpenter,loki,mailpit,mimir-distributed,oauth2-proxy,pyroscope,tempo,velero},k8s-{karpenter-nodepool,scheduling-priorityclass,storage-snapshot-storageclass-volumesnapshotclass}}.yml}; do
if [[ -f "${FILE}" ]]; then
rm -v "${FILE}"
else
echo "❌ File not found: ${FILE}"
fi
done
rmdir "${TMP_DIR}/${CLUSTER_FQDN}"
fi
Enjoy … 😉













