# AutoPilot Deployment

All dashboards and APIs are deployed to a dedicated EKS instance in a completely separate AWS account. This is done to ensure that the production environment is completely isolated from the rest of the infrastructure. This guide will walk you through the steps needed to create a new EKS cluster and a new RDS cluster so that it is ready for the AutoPilot application to be deployed to it.

Assumptions

Brand new AWS account has been created
SSO has been setup for this account with proper users, groups, & permission sets
You are currently signed in to the account using SSO via cli
AWS access is configured through SSO profile named production-autopilot-sso
You have access to vault.jetrails.com via cli
You have kubectl and kubectx installed

# Deploy CloudFormation Stacks

First we have to launch some stacks using the jetrails/aws-cloudformation-templates repo. You can find the latest versions of these compiled templates by looking at our CI system. Look for builds on the master branch since those are built and promoted to production. In the examples below I assume you compile them locally and are in the aws-cloudformation-templates repo directory.

This stack exports the prefix list that our cluster will use to whitelist Cloudflare. Please note that if the account does not have an S3 bucket, then one can easily be created by AWS by uploading any template manually though the CloudFormation GUI:

aws cloudformation create-stack \
  --profile autopilot \
  --template-body file://./dist/prefix-lists-$USER.yml \
  --capabilities CAPABILITY_NAMED_IAM \
  --stack-name cloudflare-prefix-lists

This stack deploys resources for AWS Backup. We simply tag resources a certain way and it will back it up for us.

aws cloudformation create-stack \
  --profile autopilot \
  --template-body file://./dist/backup-$USER.yml \
  --capabilities CAPABILITY_NAMED_IAM \
  --stack-name backup

Now we can spin up a stack using the CloudFormation template that is found in ./stacks/k8s-cluster.yaml. This template contains everything that is needed to spin up an EKS cluster with encrypted secrets. Make sure that you check the default perameters since they contain some ips that we use to whitelist access to the control plane.

aws cloudformation deploy \
  --profile autopilot \
  --template-file ./stacks/k8s-cluster.yaml \
  --stack-name az-use1-k8s-production \
  --capabilities CAPABILITY_NAMED_IAM

After the stack has successfully deployed, we can print out the outputs using the following command:

aws cloudformation describe-stacks \
  --profile autopilot \
  --stack-name az-use1-k8s-production

We can now save some of the outputs into variables because we will use them in the upcoming commands:

CLUSTER_NAME=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' \
    --output text
)
ELASTIC_IP_1=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`ElasticIp1`].OutputValue' \
    --output text
)
ELASTIC_IP_2=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`ElasticIp2`].OutputValue' \
    --output text
)
ELASTIC_IP_3=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`ElasticIp3`].OutputValue' \
    --output text
)
VPC_ID=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`VpcId`].OutputValue' \
    --output text
)
CIDR_BLOCK=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`CidrBlock`].OutputValue' \
    --output text
)
PUBLIC_SUBNET_ID_1=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`PublicSubnetId1`].OutputValue' \
    --output text
)
PUBLIC_SUBNET_ID_2=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`PublicSubnetId2`].OutputValue' \
    --output text
)
AVAILABILITY_ZONE=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`AvailabilityZone1`].OutputValue' \
    --output text
)
FILE_SYSTEM_ID=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`FileSystemId`].OutputValue' \
    --output text
)
EFS_CSI_DRIVER_ROLE_ARN=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`EfsCsiDriverRoleArn`].OutputValue' \
    --output text
)
ACCESS_POINT_REDIS_SESSIONS=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`AccessPointRedisSessionsId`].OutputValue' \
    --output text
)
ACCESS_POINT_RABBITMQ=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`AccessPointRabbitmqId`].OutputValue' \
    --output text
)

# Setup EKS Cluster

Now we can update our ~/.kube/config by adding the new cluster to it using the following command:

aws eks update-kubeconfig \
  --profile autopilot \
  --alias az-use1-k8s-production \
  --name $CLUSTER_NAME

You should now have access to the created cluster, you can verify the connection by running the following:

kubectx az-use1-k8s-production
kubectl get ns

Once you confirmed you have access to the created EKS cluster, we will provision the EKS cluster itself. Lets start by creating supporting k8s objects:

kubectl create namespace kube-critical
kubectl label namespace/kube-critical name=kube-critical
kubens kube-critical
kubectl apply -f kube/priority-class

In the kube-critical namespace, we will install an ingress nginx controller and a CRD to sync secrets from our vault deployment (vault.jetrails.com).

# Install Ingress NGINX Controller

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --values values/ingress-nginx.yaml

You can run kubectl get svc and record the endpoint for the attached load balancer. This A record should be used to create a CNAME for az-use1-k8s-production.jetrails.com (proxied).

Now we need to lockdown traffic to the loadbalancer to only let Cloudflare access port 443. Go to the security group that was created and attached to the loadbalancer. Edit the inbound rules to something that looks like the following:

Notice the existing HTTPS rule that allowed all traffic was removed first and then a brand new rule was created that refrerenced our managed Cloudflare IPv4 prefix list.

# Install Vault Secrets Operator

Next we will install the vault-secrets-operator chart which syncs secrets from vault.jetrails.com.

helm repo add ricoberger https://ricoberger.github.io/helm-charts
helm repo update
helm upgrade --install vault-secrets-operator ricoberger/vault-secrets-operator --values values/vault-secrets-operator.yaml

Gather needed information:

kubectl apply -f kube/secret/vault-secrets-operator-secret.yaml
export VAULT_SECRETS_OPERATOR_NAMESPACE=$(kubectl get sa vault-secrets-operator -o jsonpath="{.metadata.namespace}")
export VAULT_SECRET_NAME="vault-secrets-operator-secret"
export SA_JWT_TOKEN=$(kubectl get secret $VAULT_SECRET_NAME -o jsonpath="{.data.token}" | base64 --decode; echo)
export SA_CA_CRT=$(kubectl get secret $VAULT_SECRET_NAME -o jsonpath="{.data['ca\.crt']}" | base64 --decode; echo)
export K8S_HOST=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')

Create kubernetes route with custom path:

vault auth enable --path="az-use1-k8s-production" kubernetes

Figure out token issuer, (run kubectl proxy in another tab before running the next command):

export TOKEN_ISSUER=$(curl --silent http://127.0.0.1:8001/api/v1/namespaces/default/serviceaccounts/default/token -H "Content-Type: application/json" -X POST -d '{"apiVersion": "authentication.k8s.io/v1", "kind": "TokenRequest"}' | jq -r '.status.token' | cut -d . -f2 | base64 -D | jq -r '.iss')

vault write auth/az-use1-k8s-production/config \
	issuer="$TOKEN_ISSUER" \
	token_reviewer_jwt="$SA_JWT_TOKEN" \
	kubernetes_host="$K8S_HOST" \
	kubernetes_ca_cert="$SA_CA_CRT"
vault write auth/az-use1-k8s-production/role/vault-secrets-operator \
	bound_service_account_names="vault-secrets-operator" \
	bound_service_account_namespaces="$VAULT_SECRETS_OPERATOR_NAMESPACE" \
	policies=k8s-production \
	ttl=24h

Now we have to whitelist the connection to our vault server (on DigitalOcean) from our K8s cluster (on AWS). You can find the elastic ips that need to be whitelisted in the cloudformation outputs. We saved them to env vars in the earlier steps.

echo $ELASTIC_IP_1
echo $ELASTIC_IP_2
echo $ELASTIC_IP_3

Best place to save them is in Cloudflare's lists for k8s_nodes:

Next, you will want to kill the pod to force a restart (or you can wait for a restart). Then it is time to test it out:

cat <<EOF | kubectl apply -f -
apiVersion: ricoberger.de/v1alpha1
kind: VaultSecret
metadata:
  name: test
spec:
  keys:
    - ca.crt
  path: ssl/origin-pull.cloudflare.com
  type: Opaque
EOF

Once done, clean up:

kubectl delete vaultsecret test

# Environment For Applications

We are done setting up what we need. We can now create a new namespace for our application to be deployed to:

kubectl create namespace api
kubectl label namespace/api name=api
kubens api
kubectl apply -f kube/network-policy/network-separation.yaml

kubectl create namespace portals
kubectl label namespace/portals name=portals
kubens portals
kubectl apply -f kube/network-policy/network-separation.yaml

Done! We now will want to create a service account, generate an access token for it and store it in vault so we can use it in our CI/CD pipeline.

kubens api
kubectl apply -f kube/drone-access/drone-access.yaml
export DRONE_DEPLOY_TOKEN_API=$(kubectl get secret drone-deploy-token -o jsonpath="{.data.token}" | base64 --decode; echo)
vault kv put az-use1-k8s-production.jetrails.com/api/drone-helm/token \
  api="$K8S_HOST" \
  token="$DRONE_DEPLOY_TOKEN_API" \
  x-drone-branches="master" \
  x-drone-events="push" \
  x-drone-repos="jetrails/api"

kubens portals
kubectl apply -f kube/drone-access/drone-access.yaml
export DRONE_DEPLOY_TOKEN_PORTALS=$(kubectl get secret drone-deploy-token -o jsonpath="{.data.token}" | base64 --decode; echo)
vault kv put az-use1-k8s-production.jetrails.com/portals/drone-helm/token \
  api="$K8S_HOST" \
  token="$DRONE_DEPLOY_TOKEN_PORTALS" \
  x-drone-branches="master" \
  x-drone-events="push" \
  x-drone-repos="jetrails/portals"

# Setup RDS Cluster

Finally lets create an RDS cluster for our application to use:

DATABASE_NAME=`jrctl utility mkpass -S -l 16`
DATABASE_USER=`jrctl utility mkpass -S -l 16`
DATABASE_PASS=`jrctl utility mkpass -S -l 32`

aws cloudformation deploy \
  --profile autopilot \
  --template-file ./stacks/rds-cluster.yaml \
  --stack-name rds-cluster \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
    VpcId="$VPC_ID" \
    CidrBlock="$CIDR_BLOCK" \
    PublicSubnetId1="$PUBLIC_SUBNET_ID_1" \
    PublicSubnetId2="$PUBLIC_SUBNET_ID_2" \
    AvailabilityZone="$AVAILABILITY_ZONE" \
    DatabaseName="$DATABASE_NAME" \
    DatabaseUser="$DATABASE_USER" \
    DatabasePass="$DATABASE_PASS"

Extract the database endpoint from the stack outputs:

DATABASE_ENDPOINT=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name rds-cluster \
    --query 'Stacks[0].Outputs[?OutputKey==`DatabaseWriteEndpoint`].OutputValue' \
    --output text
)

Finally put in the credentials into vault:

vault kv put aws.amazon.com/production/database \
  hostname="$DATABASE_ENDPOINT" \
  database="$DATABASE_NAME" \
  username="$DATABASE_USER" \
  password="$DATABASE_PASS" \
  port="3306" \
  cli-command="mysql -h $DATABASE_ENDPOINT -u $DATABASE_USER -p$DATABASE_PASS $DATABASE_NAME"

# Setup Persistent Storage

For simplicity and cost related reasons, it turns out that using EFS directly is the best option. The following docs were used to write the CFN template for the EFS CSI driver:

If you already installed the main cluster stack, then almost everything has already been provisioned for you.

The only thing you will need to do is install the EFS CSI driver and create a storage class for it. Let's start by installing the EFS CSI driver:

aws eks create-addon \
  --region=us-east-1 \
  --profile autopilot \
  --cluster-name $CLUSTER_NAME \
  --addon-name aws-efs-csi-driver \
  --service-account-role-arn $EFS_CSI_DRIVER_ROLE_ARN

Now we can wait for the addon to be active:

aws eks wait addon-active \
  --region=us-east-1 \
  --profile autopilot \
  --cluster-name $CLUSTER_NAME \
  --addon-name aws-efs-csi-driver

Once that is done, we can create a storage class for the EFS CSI driver:

kubectl apply -f - <<EOF
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: $FILE_SYSTEM_ID
  directoryPerms: "700"
EOF

You can now use the storage class in your deployments. Here is an example of how to do that with dynamic provisioning:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 1Gi

---

apiVersion: v1
kind: Pod
metadata:
  name: efs-app
spec:
  containers:
    - name: app
      image: alpine
      command: ["/bin/sh"]
      args: ["-c", "sleep 500000"]
      volumeMounts:
        - name: persistent-storage
          mountPath: /data
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: efs-claim

For persisting data for redis and rabbitmq, we will use static provisioning and specify a root path. We can do this by specifying the access point that the stack made for us:

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-redis-sessions-data
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: $FILE_SYSTEM_ID::$ACCESS_POINT_REDIS_SESSIONS

---

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-rabbitmq-data
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: $FILE_SYSTEM_ID::$ACCESS_POINT_RABBITMQ
EOF

Since we manually deployed the PV, we must deploy the PVC with our applications. More information about that is in the helm chart located in the jetrails/api repository.

# Installing Monitoring Software

Add repos for the charts we want to install:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Then we should create a new namespace for our monitoring software:

kubectl create namespace kube-monitoring
kubectl label namespace/kube-monitoring name=kube-monitoring
kubens kube-monitoring
kubectl apply -f kube/priority-class

Install the secret that contains the slack webhook endpoint:

kubectl apply -f kube/secret/monitoring-slack-webhook.yaml
kubectl apply -f kube/secret/origin-pull.yaml

Install the datasources configmaps:

kubectl apply -f kube/configmap/datasources.yaml

Install loki-stack helm chart:

SLACK_WEBHOOK=`kubectl -n kube-monitoring get secret monitoring-slack-webhook -o jsonpath='{.data.endpoint}' | base64 -d`
helm upgrade --install loki grafana/loki-stack --values values/loki.yaml --set grafana.notifiers.slack.settings.url="$SLACK_WEBHOOK"

We need to setup a cname for grafana.jetrails.com to point to this cluster. You can edit the hostname in the values file specified above. If you are using Cloudflare, then make sure you create a page rule to disable performance and set the cache level to bypass.

Now you need to update the loki-promtail configmap after the helm chart is installed:

kubectl apply -f kube/configmap/loki-promtail.yaml

Note: You can ignore the warning for now. If you are updating the configmap, you will need to restart the promtail pod.

Get admin credentials by running:

kubectl get secret loki-grafana -o jsonpath="{.data.admin-user}" | base64 -d
kubectl get secret loki-grafana -o jsonpath="{.data.admin-password}" | base64 -d

When you log in to the Grafana UI, please navigate to Administration -> Click on a Service account. Create a service account with privileges such as admin, create a token for it, and copy it somewhere. That's your Grafana API key.

API_KEY="<REPLACE-ME>"

Now you can run this to configure grafana alerts:

./scripts/configure_grafana.sh "$API_KEY" "$SLACK_WEBHOOK"

Go to Home > Alerting > Notification policies and edit the default policy. Change default contact point to be "autopilot-logs" and save.

For more info checkout this Github task.

# Upgrading EKS Cluster On AWS

Make sure you have kubectl and eksctl installed.

Before upgrading, use pluto-cli to determine if any k8s objects are depracated and if any charts/templates need to be updated in order to work with the latest k8s version.

Versions need to be upgraded from minor to minor, so if you are planning on upgrading from 1.25 to 1.30, then you need to upgrade to 1.26 first. Then 1.27, 1.28, 1.29, and finally 1.30. You should upgrade the control plane first, then core-dns, kube-proxy, aws-node and finally the node group.

# Gather Information

print_k8s_info () {
  k8sVersion=`kubectl version -o json | jq -r '.serverVersion.gitVersion'`
  coreDnsVersion=`kubectl get deployment -n kube-system coredns -o=jsonpath='{$.spec.template.spec.containers[:1].image}'`
  kubeProxyVersion=`kubectl get daemonset -n kube-system kube-proxy -o=jsonpath='{$.spec.template.spec.containers[:1].image}'`
  awsVpcCniVersion=`kubectl get daemonset -n kube-system aws-node -o=jsonpath='{$.spec.template.spec.containers[:1].image}'`
  k8sNodeVersion=`kubectl get nodes -o json | jq -r '.items[].status.nodeInfo.kubeletVersion' | xargs`

  echo "Kubernetes Version:       $k8sVersion"
  echo "CoreDNS Version:          $coreDnsVersion"
  echo "Kube Proxy Version:       $kubeProxyVersion"
  echo "AWS VPC CNI Version:      $awsVpcCniVersion"
  echo "Kubernetes Node Version:  $k8sNodeVersion"
}

print_k8s_info

# Upgrade Commands

Control plane needs to be upgraded via CFN template.

eksctl utils update-coredns --profile autopilot --cluster ControlPlane-plTeNmD7jV86 --approve

eksctl utils update-kube-proxy --profile autopilot --cluster ControlPlane-plTeNmD7jV86 --approve

eksctl utils update-aws-node --profile autopilot --cluster ControlPlane-plTeNmD7jV86 --approve

eksctl upgrade nodegroup --profile autopilot --name ManagedNodeGroup-zmkenIMnwlzk --cluster ControlPlane-plTeNmD7jV86 --kubernetes-version=1.29

# Upgrade Addons

CLUSTER_NAME=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' \
    --output text
)
EFS_CSI_DRIVER_ROLE_ARN=$(
  aws cloudformation describe-stacks \
    --profile autopilot \
    --stack-name az-use1-k8s-production \
    --query 'Stacks[0].Outputs[?OutputKey==`EfsCsiDriverRoleArn`].OutputValue' \
    --output text
)
aws eks update-addon \
  --region=us-east-1 \
  --profile autopilot \
  --cluster-name $CLUSTER_NAME \
  --addon-name aws-efs-csi-driver \
  --addon-version v2.1.8-eksbuild.1 \
  --service-account-role-arn $EFS_CSI_DRIVER_ROLE_ARN
aws eks wait addon-active \
  --region=us-east-1 \
  --profile autopilot \
  --cluster-name $CLUSTER_NAME \
  --addon-name aws-efs-csi-driver