I/O issues January 2026

Friday, January 30, 2026

Tags:

As of January 2026, we’ve identified a potential I/O bottleneck in our current setup. Compute nodes and the head node share a storage system that serves multiple labs. When several labs run a large number of compute nodes simultaneously or process large datasets at the same time, overall I/O performance can degrade, resulting in slowdowns across the board.

Behaviors observed:

Users can login to the head node, and basic command line operations still behave normally.
Operations relate to any files in the file system (including your environment, any I/O) significantly slow down, no matter in head node or computing nodes.
Current tasks running are idling (still running but no result from output).

To-do:

Confirm with admins with the usage.
If admins confirmed the status. Based on your consideration, cancel your jobs which can be finish later or very expensive promptly to save the cost.

We are exploring solutions to mitigate this. One approach is to use EC2 instance types with fast local storage (such as instances with NVMe drives), which would reduce dependence on the shared filesystem. Another option is to adjust instance configurations to improve I/O performance for specific workloads.