Full Disk Prevented Partition Growth
Date: 2026-04-09
What Happened
On 2026-04-09, volumes on a cluster node were grown via the AutoPilot console, but the partitions themselves were not resized. The node had run out of disk space, causing filesystem-level failures until the partitions were manually grown.
Errors from /var/log/cfn-init-cmd.log:
touch: cannot touch '/mnt/jrc-comms/vars-cluster.tmp': No space left on device
...
ERROR: mount is missing for "/mnt/comms"
...
ERROR: disk size (200G) and partition size (200G) sizes are the same
Root Cause
When AutoPilot extended the volume, cfn-hup triggered grow-volume directly. However, the script failed because /mnt/jrc-comms was full and it could not write its temporary files.
By the time cfn-hup triggered again and grow-volume could actually execute, the EBS volume had already fully grown. Because the disk was full, metadata about the new size could not be written, causing grow-volume to see the disk and partition as the same size and exit without growing the partition or filesystem.
Resolution
Confirmed the disk had been extended to 200G in AWS but the partitions were still at their original sizes:
# lsblk|grep /mnt
nvme2n1 259:7 0 200G 0 disk
└─nvme2n1p1 259:8 0 150G 0 part /mnt/persistent-data
# df -h|grep /mnt
/dev/nvme2n1p1 150G 150G 0 100% /mnt/persistent-data
Option 1 (recommended) - Use the built-in script for whichever partition needs to grow:
/opt/jrc/sbin/grow-volume /mnt/persistent-data
Option 2 - Manually grow the partition and filesystem:
growpart /dev/nvme2n1 1
xfs_growfs /mnt/persistent-data
Once the partition and filesystem have been grown, run the boot hooks to restore any services or mounts that failed due to the full partition:
/opt/jrc/sbin/run-hooks boot
Impact
Cluster operations were disrupted. Writes to /mnt/jrc-comms failed and the /mnt/comms mount was unavailable until the partitions were manually resized.
Existing Guardrails
EC2 disk utilization alarms are in place for managed customers.
Possible Fixes (dev team)
- Investigate how
grow-volumeshould handle the case where cfn-init fails due to a full partition before it can execute - Verify that cfn-hup retries result in
grow-volumecatching the size difference before the extension fully propagates - Investigate the ability to auto-grow partitions and filesystems when a disk utilization alarm threshold is hit