# QA Run Book

The purpose of this runbook is to describe how one should go about testing certain aspects of an AutoPilot deployment. Before promoting a build to production it is vital to test it in our staging environment. You can access the staging environment by going to autopilot-staging.jetrails.com. Each deployment variant has a checklist that needs to be completed before it can be considered ready for production. The order of the checklist is important and should be followed as closely as possible.

# Deployment Variants

# Application Template Type Preset Database Strategy S3 Bucket (Media) Clone Step
1 Wordpress All-In-One Micro - - No
2 Magento All-In-One Small - - Yes
3 Shopware All-In-One Medium - - No
4 LEMP All-In-One Large - - No
5 Magento Cluster Medium RDS No No
6 Magento Cluster Large EC2 No Yes
7 Magento Cluster X-Large EC2 Yes No
8 Shopware Cluster Large EC2 No No
9 Shopware Cluster Medium EC2 No No

# Deployment Checklist

  1. Create Deployment
  2. Whitelist IP Address & SSH Key
  3. Visual Inspection
  4. Interact With Admin Backend
  5. Shell Inspection
  6. Working With Web Files
  7. Backup & Restore
  8. Actions & Adjustments
  9. Hibernation
  10. Auto Scaling
  11. Monitoring
  12. Clone
  13. Cleanup

# Create Deployment

Go to the AutoPilot Staging website and create an account if you do not have one already. Next, make sure you have your very own AutoPilot organization where you will be creating deployments. It is vital that you have your own account since there are resource limits that are imposed by AWS that are on a per account basis. While inside your organization, click the "New Deployment" button next to your list of deployments.

new-deployment-button
new-deployment-button

Choose the application and deployment type that you would like to test, make sure you are using the "STABLE" channel.

choose-template
choose-template

Fill in any deployment name or press the button at the right of the field to generate a random name. Choose the preset that you would like to use, for example, "Development" or "Medium".

choose-deployment-name-and-preset
choose-deployment-name-and-preset

Fill in the "Domain Name" field with a subdomain of the domain that you have control over. If you are testing a cluster deployment, then choose if you would like to deploy media to an S3 bucket or not.

domain-name-and-provisioning-option
domain-name-and-provisioning-option

Finally, scroll all the way down and press the "Launch Deployment" button. Wait until deployment if fully provisioned, this includes infrastructure and application provisioning. Once done, your page should look like this:

deployment-provisioned
deployment-provisioned

# Whitelist IP Address & SSH Key

Go to the "Security" tab. Once there, you can whitelist your IP address and SSH key by filling out the two forms.

whitelist-ip-and-key
whitelist-ip-and-key

# Visual Inspection

First visit the Private Preview Domain and confirm that the application homepage is displayed.

Next, visit the Public Load Balancer and confirm that the maintenance page is displayed.

maintenance-page
maintenance-page

You should have access to the code editor which will allow you to view the source code of your deployment as well as launch a shell directly from your browser. Please play around with it and report any issues or provide general feedback about this feature.

# Interact With Admin Backend

Using the values from the "Sensitive Info" card on the "Overview" tab, login to the backend of your application. Depending on the application, you might be presented with setup steps that you need to complete.

If you are prompted to update software, do so via the admin backend, reporting any issues that you encounter (applicable for Wordpress & Shopware).

For applications that have a plugin system, enable and disable plugins via the admin backend. This is to ensure that the php-fpm process has the proper permissions to write to the filesystem. If you encounter any issues, report them.

Next, try to upload an image using the admin backend and confirm that the upload functionality works. Each application has a different way of doing this, so please refer to the application's documentation.

# Shell Inspection

SSH into the server using the connection string from the "Shell Access" card on the "Overview" tab. Once SSH'd into the jump host, you can run cluster list to see the status of all the instances in your deployment (All-In-One will only have one instance). Make sure that all the instances that you expect to be in your cluster are there are marked "HEALTHY".

Next, you can run cluster logs -f to see if there were any errors during the provisioning process. This will tail the logs of the boot & provisioning process of all the instances in your deployment.

If you are testing a cluster, try to ssh into the leader web node by running ssh web and make sure that you can access that instance without any issues.

While on the jump node for All-In-One deployments or the leader web node for Cluster deployments, try to install a custom program with apt. For example:

sudo apt -y install neovim

This will become important later on in the checklist.

The final step in this section is to verify the functionality of some CLI tools. If applicable to your deployment, ensure the following tools are working correctly:

  • rabbitmqctl
  • redis-session
  • redis-cache
  • varnish-compile
  • varnishadm
  • varnishlog
  • varnishncsa
  • varnishstat

Please try to run commands that you would normally use to interact with these services and report any issues that you encounter. For example, for varnishlog you might want to run varnishlog -q 'ReqURL eq "/"' and verify that you see the expected output.

# Working With Web Files

While ssh'd into the jump host node, go to your web root directory:

cd /var/www/[DOMAIN_NAME]/live

The way you interact with web files differs depending on the application you are testing. Magento and Wordpress require you to remain as the cluster user (uid 9001) while Shopware requires you to be the www-data user.

If you are interacting with any files in the /var/www/[DOMAIN_NAME], make sure you are the appropriate user.

If you are testing Shopware, you can become the www-data user by running:

sudo -u www-data bash -l

This is also aliased to www.

If you are testing Magento or Shopware, you can test compiling the theme by running the following command while in the live directory:

php bin/magento setup:static-content:deploy -f
php bin/console theme:compile

All applications install the cron under the cluster user (uid 9001). Please check that the cron is installed by running as the cluster user:

crontab -l

The final step of this section is to confirm that none of these operations have caused the site to go down. You can do this by visiting the configured domain and confirming that the site is still working.

# Backup & Restore

We will want to create a backup of the "Operating System" as well as the "Web Files & Media". In All-In-One deployments everything is backup'ed up together while clusters have separate backups for the OS and the web files. If you are testing a cluster, please request one of each and name them whatever you like.

backup-cluster
backup-cluster

If you are testing an All-In-One deployment, you can create a backup everything together.

backup-aio
backup-aio

You can press the refresh button to monitor the progress of your backup. Once the backup is complete and available, please move on to the next step.

wait-for-backup
wait-for-backup

Once the backup is complete, go to the "Adjust" tab and change the "Operating System Image" from the "Default Image" to the backup you just created. Press update and wait. For cluster deployments, this can be on any instance you want. In theory, we should be able to swap out the operating system of any instance in our deployment and it should be able to automatically heal itself.

choose-different-os-image
choose-different-os-image

Once the OS image has been changed, ensure that the site is still working by visiting the configured domain. This might take a couple minutes to fully come back up (especially if jump was swaped out since it also acts as the storage node).

If you swapped out the jump host OS, you might get a warning that the host key has changed. You might see something like this:

% ssh jrc-ldwb-1z58@3.226.166.49                                                                                                                                                                          [130] 2:28:39 PM
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:NQ3JYU8/lUV2ipne7nWvTdGnp3bynWFxp6WgMYZz+PY.
Please contact your system administrator.
Add correct host key in /Users/raffi/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/raffi/.ssh/known_hosts:1727
Host key for 3.226.166.49 has changed and you have requested strict checking.
Host key verification failed.

You can solve this by removing the offending key from your known hosts file:

ssh-keygen -R 3.226.166.49

SSH into the instance you swapped the OS image for and confirm that the custom program you installed earlier is still there. For example, if you install neovim, you can run:

nvim --version

Next, we want to mount a backup to the recovery mount. Visit the "Adjust" tab and choose the backup you created earlier. Press update and wait.

choose-recovery-mount
choose-recovery-mount

You can monitor the progress of the recovery mount by running the following command on the jump host:

watch 'tail /var/log/cfn-hup.log && echo && ls -la /mnt && echo && lsblk'

Once the recovery mount is fully mounted, you can ls -la /mnt/recovery-ro-data and confirm your data exists there.

Next via the "Adjust" tab, unmount the recovery mount and monitor the progress by running the same command as above.

remove-recovery-mount
remove-recovery-mount

Next, we want to grow the volume size of the persistent storage. Via the "Adjust" tab, grow the volume size of the persistent storage and monitor the progress by running the same command as above.

grow-volume
grow-volume

To confirm that the volume size has been increased, just look at the output of the lsblk command. The volume size should be increased to the size you specified as well as the filesystem.

# Actions & Adjustments

Go to the "Adjust" tab and change the instance type of any instance. This will cause some downtime since the instance will be stopped and started.

change-instance-type
change-instance-type

Once complete, confirm that the site is still working by visiting the configured domain.

Next, go to the "Actions" tab. We need to test actions and make sure they work as expected.

test-actions
test-actions

Choose a service and restart it by pressing the "Restart" button. Confirm that that service restarted by inspecting it with journalctl. For example, if you restart the nginx service, you can run:

sudo journalctl -fu nginx

Next, test the "Privilege Escalation" action. First choose to disable it and then enable it. Validate that the action was successful by running the following command in between both actions:

# Run this as the cluster user (uid 9001)
sudo whoami

# Hibernation

Go to the "Actions" tab and press the "Hibernate" button on the "Hibernate" card.

hibernate-deployment
hibernate-deployment

Wait for the deployment to go into hibernation, you should see all the instances go into the "stopped" state.

Once the deployment is hibernated, you can wake it up by pressing the "Wake Up" button in the same location. Monitor that all the instances come back up and are in the running state.

Once the deployment is back up, you can visit the configured domain and confirm that the site is still working. If you initially do not see the site, wait a couple minutes and try again.

# Auto Scaling

You might notice that new deployments no longer have the autoscaling tab available. Old deployments will still have it available, but new deployments will not. You will need to modify the desired capacity of the web tier through the "Adjust" tab instead.

Go to the "Adjust" tab and click on "AutoScaling Web Configuration". Change the "Min Nodes" and "Max Nodes" values to 2 and press the "Update" button.

auto-scaling-values
auto-scaling-values

You can wait for the nodes to register themselves by running watch cluster list on the jump host. Make sure that the new node is marked as "HEALTHY" and that the site is still working.

# Monitoring

Once your deployment has had some time to run, you visit the "Monitoring" tab to ensure that all the metrics are being collected. Look for any anomalies in the metrics and report them. If there are gaps in the metrics, it is likely there is a sizing issue.

monitoring-tab
monitoring-tab

# Clone

If your test specifies to clone the deployment, then you can do so by going to the "Settings" tab and interacting with the "Clone" interface. No instructions on how to clone are provided here because it will be interesting to see how you would figure it out on your own. Please leave feedback in notes.

# Cleanup

You are all done! Thank you for your help! You can now delete your deployment by going to the "Settings" tab. Type your deployment name in the field and press the "Delete" button.

delete-deployment
delete-deployment