#
AutoPilot Deployment
All dashboards and APIs are deployed to a dedicated EKS instance in a completely separate AWS account. This is done to ensure that the production environment is completely isolated from the rest of the infrastructure. This guide will walk you through the steps needed to create a new EKS cluster and a new RDS cluster so that it is ready for the AutoPilot application to be deployed to it.
Assumptions
- Brand new AWS account has been created
- SSO has been setup for this account with proper users, groups, & permission sets
- You are currently signed in to the account using SSO via cli
- AWS access is configured through SSO profile named
production-autopilot-sso - You have access to vault.jetrails.com via cli
- You have kubectl and kubectx installed
#
Deploy CloudFormation Stacks
First we have to launch some stacks using the jetrails/aws-cloudformation-templates repo. You can find the latest versions of these compiled templates by looking at our CI system. Look for builds on the master branch since those are built and promoted to production. In the examples below I assume you compile them locally and are in the aws-cloudformation-templates repo directory.
This stack exports the prefix list that our cluster will use to whitelist Cloudflare. Please note that if the account does not have an S3 bucket, then one can easily be created by AWS by uploading any template manually though the CloudFormation GUI:
aws cloudformation create-stack \
--profile autopilot \
--template-body file://./dist/prefix-lists-$USER.yml \
--capabilities CAPABILITY_NAMED_IAM \
--stack-name cloudflare-prefix-lists
This stack deploys resources for AWS Backup. We simply tag resources a certain way and it will back it up for us.
aws cloudformation create-stack \
--profile autopilot \
--template-body file://./dist/backup-$USER.yml \
--capabilities CAPABILITY_NAMED_IAM \
--stack-name backup
Now we can spin up a stack using the CloudFormation template that is found in ./stacks/k8s-cluster.yaml. This template contains everything that is needed to spin up an EKS cluster with encrypted secrets. Make sure that you check the default perameters since they contain some ips that we use to whitelist access to the control plane.
aws cloudformation deploy \
--profile autopilot \
--template-file ./stacks/k8s-cluster.yaml \
--stack-name az-use1-k8s-production \
--capabilities CAPABILITY_NAMED_IAM
After the stack has successfully deployed, we can print out the outputs using the following command:
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production
We can now save some of the outputs into variables because we will use them in the upcoming commands:
CLUSTER_NAME=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' \
--output text
)
ELASTIC_IP_1=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`ElasticIp1`].OutputValue' \
--output text
)
ELASTIC_IP_2=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`ElasticIp2`].OutputValue' \
--output text
)
ELASTIC_IP_3=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`ElasticIp3`].OutputValue' \
--output text
)
VPC_ID=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`VpcId`].OutputValue' \
--output text
)
CIDR_BLOCK=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`CidrBlock`].OutputValue' \
--output text
)
PUBLIC_SUBNET_ID_1=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`PublicSubnetId1`].OutputValue' \
--output text
)
PUBLIC_SUBNET_ID_2=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`PublicSubnetId2`].OutputValue' \
--output text
)
AVAILABILITY_ZONE=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`AvailabilityZone1`].OutputValue' \
--output text
)
FILE_SYSTEM_ID=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`FileSystemId`].OutputValue' \
--output text
)
EFS_CSI_DRIVER_ROLE_ARN=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`EfsCsiDriverRoleArn`].OutputValue' \
--output text
)
ACCESS_POINT_REDIS_SESSIONS=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`AccessPointRedisSessionsId`].OutputValue' \
--output text
)
ACCESS_POINT_RABBITMQ=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`AccessPointRabbitmqId`].OutputValue' \
--output text
)
#
Setup EKS Cluster
Now we can update our ~/.kube/config by adding the new cluster to it using the following command:
aws eks update-kubeconfig \
--profile autopilot \
--alias az-use1-k8s-production \
--name $CLUSTER_NAME
You should now have access to the created cluster, you can verify the connection by running the following:
kubectx az-use1-k8s-production
kubectl get ns
Once you confirmed you have access to the created EKS cluster, we will provision the EKS cluster itself. Lets start by creating supporting k8s objects:
kubectl create namespace kube-critical
kubectl label namespace/kube-critical name=kube-critical
kubens kube-critical
kubectl apply -f kube/priority-class
In the kube-critical namespace, we will install an ingress nginx controller and a CRD to sync secrets from our vault deployment (vault.jetrails.com).
#
Install Ingress NGINX Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --values values/ingress-nginx.yaml
You can run kubectl get svc and record the endpoint for the attached load balancer. This A record should be used to create a CNAME for az-use1-k8s-production.jetrails.com (proxied).
Now we need to lockdown traffic to the loadbalancer to only let Cloudflare access port 443. Go to the security group that was created and attached to the loadbalancer. Edit the inbound rules to something that looks like the following:
Notice the existing HTTPS rule that allowed all traffic was removed first and then a brand new rule was created that refrerenced our managed Cloudflare IPv4 prefix list.
#
Install Vault Secrets Operator
Next we will install the vault-secrets-operator chart which syncs secrets from vault.jetrails.com.
helm repo add ricoberger https://ricoberger.github.io/helm-charts
helm repo update
helm upgrade --install vault-secrets-operator ricoberger/vault-secrets-operator --values values/vault-secrets-operator.yaml
Gather needed information:
kubectl apply -f kube/secret/vault-secrets-operator-secret.yaml
export VAULT_SECRETS_OPERATOR_NAMESPACE=$(kubectl get sa vault-secrets-operator -o jsonpath="{.metadata.namespace}")
export VAULT_SECRET_NAME="vault-secrets-operator-secret"
export SA_JWT_TOKEN=$(kubectl get secret $VAULT_SECRET_NAME -o jsonpath="{.data.token}" | base64 --decode; echo)
export SA_CA_CRT=$(kubectl get secret $VAULT_SECRET_NAME -o jsonpath="{.data['ca\.crt']}" | base64 --decode; echo)
export K8S_HOST=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
Create kubernetes route with custom path:
vault auth enable --path="az-use1-k8s-production" kubernetes
Figure out token issuer, (run kubectl proxy in another tab before running the next command):
export TOKEN_ISSUER=$(curl --silent http://127.0.0.1:8001/api/v1/namespaces/default/serviceaccounts/default/token -H "Content-Type: application/json" -X POST -d '{"apiVersion": "authentication.k8s.io/v1", "kind": "TokenRequest"}' | jq -r '.status.token' | cut -d . -f2 | base64 -D | jq -r '.iss')
vault write auth/az-use1-k8s-production/config \
issuer="$TOKEN_ISSUER" \
token_reviewer_jwt="$SA_JWT_TOKEN" \
kubernetes_host="$K8S_HOST" \
kubernetes_ca_cert="$SA_CA_CRT"
vault write auth/az-use1-k8s-production/role/vault-secrets-operator \
bound_service_account_names="vault-secrets-operator" \
bound_service_account_namespaces="$VAULT_SECRETS_OPERATOR_NAMESPACE" \
policies=k8s-production \
ttl=24h
Now we have to whitelist the connection to our vault server (on DigitalOcean) from our K8s cluster (on AWS). You can find the elastic ips that need to be whitelisted in the cloudformation outputs. We saved them to env vars in the earlier steps.
echo $ELASTIC_IP_1
echo $ELASTIC_IP_2
echo $ELASTIC_IP_3
Best place to save them is in Cloudflare's lists for k8s_nodes:
Next, you will want to kill the pod to force a restart (or you can wait for a restart). Then it is time to test it out:
cat <<EOF | kubectl apply -f -
apiVersion: ricoberger.de/v1alpha1
kind: VaultSecret
metadata:
name: test
spec:
keys:
- ca.crt
path: ssl/origin-pull.cloudflare.com
type: Opaque
EOF
Once done, clean up:
kubectl delete vaultsecret test
#
Environment For Applications
We are done setting up what we need. We can now create a new namespace for our application to be deployed to:
kubectl create namespace api
kubectl label namespace/api name=api
kubens api
kubectl apply -f kube/network-policy/network-separation.yaml
kubectl create namespace portals
kubectl label namespace/portals name=portals
kubens portals
kubectl apply -f kube/network-policy/network-separation.yaml
Done! We now will want to create a service account, generate an access token for it and store it in vault so we can use it in our CI/CD pipeline.
kubens api
kubectl apply -f kube/drone-access/drone-access.yaml
export DRONE_DEPLOY_TOKEN_API=$(kubectl get secret drone-deploy-token -o jsonpath="{.data.token}" | base64 --decode; echo)
vault kv put az-use1-k8s-production.jetrails.com/api/drone-helm/token \
api="$K8S_HOST" \
token="$DRONE_DEPLOY_TOKEN_API" \
x-drone-branches="master" \
x-drone-events="push" \
x-drone-repos="jetrails/api"
kubens portals
kubectl apply -f kube/drone-access/drone-access.yaml
export DRONE_DEPLOY_TOKEN_PORTALS=$(kubectl get secret drone-deploy-token -o jsonpath="{.data.token}" | base64 --decode; echo)
vault kv put az-use1-k8s-production.jetrails.com/portals/drone-helm/token \
api="$K8S_HOST" \
token="$DRONE_DEPLOY_TOKEN_PORTALS" \
x-drone-branches="master" \
x-drone-events="push" \
x-drone-repos="jetrails/portals"
#
Setup RDS Cluster
Finally lets create an RDS cluster for our application to use:
DATABASE_NAME=`jrctl utility mkpass -S -l 16`
DATABASE_USER=`jrctl utility mkpass -S -l 16`
DATABASE_PASS=`jrctl utility mkpass -S -l 32`
aws cloudformation deploy \
--profile autopilot \
--template-file ./stacks/rds-cluster.yaml \
--stack-name rds-cluster \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides \
VpcId="$VPC_ID" \
CidrBlock="$CIDR_BLOCK" \
PublicSubnetId1="$PUBLIC_SUBNET_ID_1" \
PublicSubnetId2="$PUBLIC_SUBNET_ID_2" \
AvailabilityZone="$AVAILABILITY_ZONE" \
DatabaseName="$DATABASE_NAME" \
DatabaseUser="$DATABASE_USER" \
DatabasePass="$DATABASE_PASS"
Extract the database endpoint from the stack outputs:
DATABASE_ENDPOINT=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name rds-cluster \
--query 'Stacks[0].Outputs[?OutputKey==`DatabaseWriteEndpoint`].OutputValue' \
--output text
)
Finally put in the credentials into vault:
vault kv put aws.amazon.com/production/database \
hostname="$DATABASE_ENDPOINT" \
database="$DATABASE_NAME" \
username="$DATABASE_USER" \
password="$DATABASE_PASS" \
port="3306" \
cli-command="mysql -h $DATABASE_ENDPOINT -u $DATABASE_USER -p$DATABASE_PASS $DATABASE_NAME"
#
Setup Persistent Storage
For simplicity and cost related reasons, it turns out that using EFS directly is the best option. The following docs were used to write the CFN template for the EFS CSI driver:
- https://aws.amazon.com/awstv/watch/5aa92cec069/
- https://www.eksworkshop.com/docs/fundamentals/storage/efs/efs-csi-driver
If you already installed the main cluster stack, then almost everything has already been provisioned for you.
The only thing you will need to do is install the EFS CSI driver and create a storage class for it. Let's start by installing the EFS CSI driver:
aws eks create-addon \
--region=us-east-1 \
--profile autopilot \
--cluster-name $CLUSTER_NAME \
--addon-name aws-efs-csi-driver \
--service-account-role-arn $EFS_CSI_DRIVER_ROLE_ARN
Now we can wait for the addon to be active:
aws eks wait addon-active \
--region=us-east-1 \
--profile autopilot \
--cluster-name $CLUSTER_NAME \
--addon-name aws-efs-csi-driver
Once that is done, we can create a storage class for the EFS CSI driver:
kubectl apply -f - <<EOF
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: $FILE_SYSTEM_ID
directoryPerms: "700"
EOF
You can now use the storage class in your deployments. Here is an example of how to do that with dynamic provisioning:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: efs-app
spec:
containers:
- name: app
image: alpine
command: ["/bin/sh"]
args: ["-c", "sleep 500000"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-claim
For persisting data for redis and rabbitmq, we will use static provisioning and specify a root path. We can do this by specifying the access point that the stack made for us:
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-redis-sessions-data
spec:
capacity:
storage: 1Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: $FILE_SYSTEM_ID::$ACCESS_POINT_REDIS_SESSIONS
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-rabbitmq-data
spec:
capacity:
storage: 1Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: $FILE_SYSTEM_ID::$ACCESS_POINT_RABBITMQ
EOF
Since we manually deployed the PV, we must deploy the PVC with our applications. More information about that is in the helm chart located in the jetrails/api repository.
#
Installing Monitoring Software
Add repos for the charts we want to install:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
Then we should create a new namespace for our monitoring software:
kubectl create namespace kube-monitoring
kubectl label namespace/kube-monitoring name=kube-monitoring
kubens kube-monitoring
kubectl apply -f kube/priority-class
Install the secret that contains the slack webhook endpoint:
kubectl apply -f kube/secret/monitoring-slack-webhook.yaml
kubectl apply -f kube/secret/origin-pull.yaml
Install the datasources configmaps:
kubectl apply -f kube/configmap/datasources.yaml
Install loki-stack helm chart:
SLACK_WEBHOOK=`kubectl -n kube-monitoring get secret monitoring-slack-webhook -o jsonpath='{.data.endpoint}' | base64 -d`
helm upgrade --install loki grafana/loki-stack --values values/loki.yaml --set grafana.notifiers.slack.settings.url="$SLACK_WEBHOOK"
We need to setup a cname for grafana.jetrails.com to point to this cluster. You can edit the hostname in the values file specified above. If you are using Cloudflare, then make sure you create a page rule to disable performance and set the cache level to bypass.
Now you need to update the loki-promtail configmap after the helm chart is installed:
kubectl apply -f kube/configmap/loki-promtail.yaml
Note: You can ignore the warning for now. If you are updating the configmap, you will need to restart the promtail pod.
Get admin credentials by running:
kubectl get secret loki-grafana -o jsonpath="{.data.admin-user}" | base64 -d
kubectl get secret loki-grafana -o jsonpath="{.data.admin-password}" | base64 -d
When you log in to the Grafana UI, please navigate to Administration -> Click on a Service account. Create a service account with privileges such as admin, create a token for it, and copy it somewhere. That's your Grafana API key.
API_KEY="<REPLACE-ME>"
Now you can run this to configure grafana alerts:
./scripts/configure_grafana.sh "$API_KEY" "$SLACK_WEBHOOK"
Go to Home > Alerting > Notification policies and edit the default policy. Change default contact point to be "autopilot-logs" and save.
For more info checkout this Github task.
#
Upgrading EKS Cluster On AWS
Make sure you have kubectl and eksctl installed.
Before upgrading, use pluto-cli to determine if any k8s objects are depracated and if any charts/templates need to be updated in order to work with the latest k8s version.
Versions need to be upgraded from minor to minor, so if you are planning on upgrading from 1.25 to 1.30, then you need to upgrade to 1.26 first. Then 1.27, 1.28, 1.29, and finally 1.30. You should upgrade the control plane first, then core-dns, kube-proxy, aws-node and finally the node group.
#
Gather Information
print_k8s_info () {
k8sVersion=`kubectl version -o json | jq -r '.serverVersion.gitVersion'`
coreDnsVersion=`kubectl get deployment -n kube-system coredns -o=jsonpath='{$.spec.template.spec.containers[:1].image}'`
kubeProxyVersion=`kubectl get daemonset -n kube-system kube-proxy -o=jsonpath='{$.spec.template.spec.containers[:1].image}'`
awsVpcCniVersion=`kubectl get daemonset -n kube-system aws-node -o=jsonpath='{$.spec.template.spec.containers[:1].image}'`
k8sNodeVersion=`kubectl get nodes -o json | jq -r '.items[].status.nodeInfo.kubeletVersion' | xargs`
echo "Kubernetes Version: $k8sVersion"
echo "CoreDNS Version: $coreDnsVersion"
echo "Kube Proxy Version: $kubeProxyVersion"
echo "AWS VPC CNI Version: $awsVpcCniVersion"
echo "Kubernetes Node Version: $k8sNodeVersion"
}
print_k8s_info
#
Upgrade Commands
Control plane needs to be upgraded via CFN template.
eksctl utils update-coredns --profile autopilot --cluster ControlPlane-plTeNmD7jV86 --approve
eksctl utils update-kube-proxy --profile autopilot --cluster ControlPlane-plTeNmD7jV86 --approve
eksctl utils update-aws-node --profile autopilot --cluster ControlPlane-plTeNmD7jV86 --approve
eksctl upgrade nodegroup --profile autopilot --name ManagedNodeGroup-zmkenIMnwlzk --cluster ControlPlane-plTeNmD7jV86 --kubernetes-version=1.29
#
Upgrade Addons
CLUSTER_NAME=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' \
--output text
)
EFS_CSI_DRIVER_ROLE_ARN=$(
aws cloudformation describe-stacks \
--profile autopilot \
--stack-name az-use1-k8s-production \
--query 'Stacks[0].Outputs[?OutputKey==`EfsCsiDriverRoleArn`].OutputValue' \
--output text
)
aws eks update-addon \
--region=us-east-1 \
--profile autopilot \
--cluster-name $CLUSTER_NAME \
--addon-name aws-efs-csi-driver \
--addon-version v2.1.8-eksbuild.1 \
--service-account-role-arn $EFS_CSI_DRIVER_ROLE_ARN
aws eks wait addon-active \
--region=us-east-1 \
--profile autopilot \
--cluster-name $CLUSTER_NAME \
--addon-name aws-efs-csi-driver