Full Disk Prevented Partition Growth

Date: 2026-04-09

What Happened

On 2026-04-09, volumes on a cluster node were grown via the AutoPilot console, but the partitions themselves were not resized. The node had run out of disk space, causing filesystem-level failures until the partitions were manually grown.

Errors from /var/log/cfn-init-cmd.log:

touch: cannot touch '/mnt/jrc-comms/vars-cluster.tmp': No space left on device
...
ERROR: mount is missing for "/mnt/comms"
...
ERROR: disk size (200G) and partition size (200G) sizes are the same

Root Cause

When AutoPilot extended the volume, cfn-hup triggered grow-volume directly. However, the script failed because /mnt/jrc-comms was full and it could not write its temporary files.

By the time cfn-hup triggered again and grow-volume could actually execute, the EBS volume had already fully grown. Because the disk was full, metadata about the new size could not be written, causing grow-volume to see the disk and partition as the same size and exit without growing the partition or filesystem.

Resolution

Confirmed the disk had been extended to 200G in AWS but the partitions were still at their original sizes:

# lsblk|grep /mnt
nvme2n1      259:7    0  200G  0 disk
└─nvme2n1p1  259:8    0  150G  0 part /mnt/persistent-data

# df -h|grep /mnt
/dev/nvme2n1p1   150G  150G     0  100% /mnt/persistent-data

Option 1 (recommended) - Use the built-in script for whichever partition needs to grow:

/opt/jrc/sbin/grow-volume /mnt/persistent-data

Option 2 - Manually grow the partition and filesystem:

growpart /dev/nvme2n1 1
xfs_growfs /mnt/persistent-data

Once the partition and filesystem have been grown, run the boot hooks to restore any services or mounts that failed due to the full partition:

/opt/jrc/sbin/run-hooks boot

Impact

Cluster operations were disrupted. Writes to /mnt/jrc-comms failed and the /mnt/comms mount was unavailable until the partitions were manually resized.

Existing Guardrails

EC2 disk utilization alarms are in place for managed customers.

Possible Fixes (dev team)

  • Investigate how grow-volume should handle the case where cfn-init fails due to a full partition before it can execute
  • Verify that cfn-hup retries result in grow-volume catching the size difference before the extension fully propagates
  • Investigate the ability to auto-grow partitions and filesystems when a disk utilization alarm threshold is hit