#
QA Run Book
The purpose of this runbook is to describe how one should go about testing certain aspects of an AutoPilot deployment. Before promoting a build to production it is vital to test it in our staging environment. You can access the staging environment by going to autopilot-staging.jetrails.com. Each deployment variant has a checklist that needs to be completed before it can be considered ready for production. The order of the checklist is important and should be followed as closely as possible.
Note
When working in parallel with other people, it is really important to work within your own organization since there are resource limits that are imposed by AWS. For example, there is a default limit of 5 elastic IPs per AWS account. An All-In-One deployment uses one elastic ip while a Cluster deployment uses 2.
#
Deployment Variants
#
Deployment Checklist
Create Deployment Whitelist IP Address & SSH Key Visual Inspection Interact With Admin Backend Shell Inspection Working With Web Files Backup & Restore Actions & Adjustments Hibernation Auto Scaling Monitoring Clone Cleanup
#
Create Deployment
Go to the AutoPilot Staging website and create an account if you do not have one already. Next, make sure you have your very own AutoPilot organization where you will be creating deployments. It is vital that you have your own account since there are resource limits that are imposed by AWS that are on a per account basis. While inside your organization, click the "New Deployment" button next to your list of deployments.
Info
For organizations that do not have a payment method on file, you will be asked to provide one before deploying.
On our staging environment, we make sure that the checksum of the credit card is valid, but we do not actually validate the information.
For this reason, you can put in any information you like. Some sample credit card numbers that pass the checksum test are 4242 4242 4242 4242 for Visa or 5454 5454 5454 5454 for Mastercard.
Choose the application and deployment type that you would like to test, make sure you are using the "STABLE" channel.
Fill in any deployment name or press the button at the right of the field to generate a random name. Choose the preset that you would like to use, for example, "Development" or "Medium".
Fill in the "Domain Name" field with a subdomain of the domain that you have control over. If you are testing a cluster deployment, then choose if you would like to deploy media to an S3 bucket or not.
Finally, scroll all the way down and press the "Launch Deployment" button. Wait until deployment if fully provisioned, this includes infrastructure and application provisioning. Once done, your page should look like this:
#
Whitelist IP Address & SSH Key
Go to the "Security" tab. Once there, you can whitelist your IP address and SSH key by filling out the two forms.
#
Visual Inspection
First visit the Private Preview Domain and confirm that the application homepage is displayed.
Next, visit the Public Load Balancer and confirm that the maintenance page is displayed.
Info
The default SSL used by the load balancer is a wildcard SSL for our preview domain. For this reason you will need to accept the SSL warning that your browser displays. You can do this by simply typing "thisisunsafe" while focused on the warning page. Alternatively, you can use the GUI that is displayed by your browser to also proceed after aknowledging the warning.
You should have access to the code editor which will allow you to view the source code of your deployment as well as launch a shell directly from your browser. Please play around with it and report any issues or provide general feedback about this feature.
#
Interact With Admin Backend
Using the values from the "Sensitive Info" card on the "Overview" tab, login to the backend of your application. Depending on the application, you might be presented with setup steps that you need to complete.
If you are prompted to update software, do so via the admin backend, reporting any issues that you encounter (applicable for Wordpress & Shopware).
For applications that have a plugin system, enable and disable plugins via the admin backend. This is to ensure that the php-fpm process has the proper permissions to write to the filesystem. If you encounter any issues, report them.
Info
You cannot interact with extensions in Magento, so you can skip the above part.
Next, try to upload an image using the admin backend and confirm that the upload functionality works. Each application has a different way of doing this, so please refer to the application's documentation.
#
Shell Inspection
SSH into the server using the connection string from the "Shell Access" card on the "Overview" tab.
Once SSH'd into the jump host, you can run cluster list to see the status of all the instances in your deployment (All-In-One will only have one instance).
Make sure that all the instances that you expect to be in your cluster are there are marked "HEALTHY".
Next, you can run cluster logs -f to see if there were any errors during the provisioning process. This will tail the logs of the boot & provisioning process of all the instances in your deployment.
If you are testing a cluster, try to ssh into the leader web node by running ssh web and make sure that you can access that instance without any issues.
While on the jump node for All-In-One deployments or the leader web node for Cluster deployments, try to install a custom program with apt. For example:
sudo apt -y install neovim
This will become important later on in the checklist.
The final step in this section is to verify the functionality of some CLI tools. If applicable to your deployment, ensure the following tools are working correctly:
- rabbitmqctl
- redis-session
- redis-cache
- varnish-compile
- varnishadm
- varnishlog
- varnishncsa
- varnishstat
Please try to run commands that you would normally use to interact with these services and report any issues that you encounter. For example, for varnishlog you might want to run varnishlog -q 'ReqURL eq "/"' and verify that you see the expected output.
#
Working With Web Files
While ssh'd into the jump host node, go to your web root directory:
cd /var/www/[DOMAIN_NAME]/live
The way you interact with web files differs depending on the application you are testing.
Magento and Wordpress require you to remain as the cluster user (uid 9001) while Shopware requires you to be the www-data user.
If you are interacting with any files in the /var/www/[DOMAIN_NAME], make sure you are the appropriate user.
If you are testing Shopware, you can become the www-data user by running:
sudo -u www-data bash -l
This is also aliased to www.
If you are testing Magento or Shopware, you can test compiling the theme by running the following command while in the live directory:
php bin/magento setup:static-content:deploy -f
php bin/console theme:compile
All applications install the cron under the cluster user (uid 9001). Please check that the cron is installed by running as the cluster user:
crontab -l
The final step of this section is to confirm that none of these operations have caused the site to go down. You can do this by visiting the configured domain and confirming that the site is still working.
#
Backup & Restore
We will want to create a backup of the "Operating System" as well as the "Web Files & Media". In All-In-One deployments everything is backup'ed up together while clusters have separate backups for the OS and the web files. If you are testing a cluster, please request one of each and name them whatever you like.
If you are testing an All-In-One deployment, you can create a backup everything together.
You can press the refresh button to monitor the progress of your backup. Once the backup is complete and available, please move on to the next step.
Once the backup is complete, go to the "Adjust" tab and change the "Operating System Image" from the "Default Image" to the backup you just created. Press update and wait. For cluster deployments, this can be on any instance you want. In theory, we should be able to swap out the operating system of any instance in our deployment and it should be able to automatically heal itself.
Once the OS image has been changed, ensure that the site is still working by visiting the configured domain. This might take a couple minutes to fully come back up (especially if jump was swaped out since it also acts as the storage node).
If you swapped out the jump host OS, you might get a warning that the host key has changed. You might see something like this:
% ssh jrc-ldwb-1z58@3.226.166.49 [130] 2:28:39 PM
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ED25519 key sent by the remote host is
SHA256:NQ3JYU8/lUV2ipne7nWvTdGnp3bynWFxp6WgMYZz+PY.
Please contact your system administrator.
Add correct host key in /Users/raffi/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/raffi/.ssh/known_hosts:1727
Host key for 3.226.166.49 has changed and you have requested strict checking.
Host key verification failed.
You can solve this by removing the offending key from your known hosts file:
ssh-keygen -R 3.226.166.49
SSH into the instance you swapped the OS image for and confirm that the custom program you installed earlier is still there. For example, if you install neovim, you can run:
nvim --version
Info
Remember: Backups of the OS are taken of the web leader node only
Next, we want to mount a backup to the recovery mount. Visit the "Adjust" tab and choose the backup you created earlier. Press update and wait.
You can monitor the progress of the recovery mount by running the following command on the jump host:
watch 'tail /var/log/cfn-hup.log && echo && ls -la /mnt && echo && lsblk'
Once the recovery mount is fully mounted, you can ls -la /mnt/recovery-ro-data and confirm your data exists there.
Next via the "Adjust" tab, unmount the recovery mount and monitor the progress by running the same command as above.
Next, we want to grow the volume size of the persistent storage. Via the "Adjust" tab, grow the volume size of the persistent storage and monitor the progress by running the same command as above.
To confirm that the volume size has been increased, just look at the output of the lsblk command.
The volume size should be increased to the size you specified as well as the filesystem.
#
Actions & Adjustments
Go to the "Adjust" tab and change the instance type of any instance. This will cause some downtime since the instance will be stopped and started.
Once complete, confirm that the site is still working by visiting the configured domain.
Next, go to the "Actions" tab. We need to test actions and make sure they work as expected.
Choose a service and restart it by pressing the "Restart" button.
Confirm that that service restarted by inspecting it with journalctl.
For example, if you restart the nginx service, you can run:
sudo journalctl -fu nginx
Info
If you are on a cluster, make sure you are looking on the node where that service is running. For example, the opensearch service runs on the opensearch instance.
Next, test the "Privilege Escalation" action. First choose to disable it and then enable it. Validate that the action was successful by running the following command in between both actions:
# Run this as the cluster user (uid 9001)
sudo whoami
#
Hibernation
Go to the "Actions" tab and press the "Hibernate" button on the "Hibernate" card.
Wait for the deployment to go into hibernation, you should see all the instances go into the "stopped" state.
Once the deployment is hibernated, you can wake it up by pressing the "Wake Up" button in the same location. Monitor that all the instances come back up and are in the running state.
Once the deployment is back up, you can visit the configured domain and confirm that the site is still working. If you initially do not see the site, wait a couple minutes and try again.
#
Auto Scaling
Info
Skip this step if you are testing an All-In-One deployment.
You might notice that new deployments no longer have the autoscaling tab available. Old deployments will still have it available, but new deployments will not. You will need to modify the desired capacity of the web tier through the "Adjust" tab instead.
Go to the "Adjust" tab and click on "AutoScaling Web Configuration". Change the "Min Nodes" and "Max Nodes" values to 2 and press the "Update" button.
You can wait for the nodes to register themselves by running watch cluster list on the jump host.
Make sure that the new node is marked as "HEALTHY" and that the site is still working.
#
Monitoring
Once your deployment has had some time to run, you visit the "Monitoring" tab to ensure that all the metrics are being collected. Look for any anomalies in the metrics and report them. If there are gaps in the metrics, it is likely there is a sizing issue.
#
Clone
If your test specifies to clone the deployment, then you can do so by going to the "Settings" tab and interacting with the "Clone" interface. No instructions on how to clone are provided here because it will be interesting to see how you would figure it out on your own. Please leave feedback in notes.
#
Cleanup
You are all done! Thank you for your help! You can now delete your deployment by going to the "Settings" tab. Type your deployment name in the field and press the "Delete" button.