Parallel Processing: A Comprehensive Overview on Cluster Job Submission and Management

Author: Haochen Sun

August, 2023

Understanding Processor Architecture

To efficiently manage the processing of extensive datasets — such as computing over 10,000 files in parallel — understanding the computational resources at our disposal is crucial. The High Performance Computing Cluster server, hereafter referred to as a cluster, typically consists of approximately 100 nodes. Each computational node contains 2 CPUs, and every CPU has 16 cores. Furthermore, each core has 2 threads (processors). It is essential to recognize that a processor represents the smallest processing unit. As a result, each node can handle 64 tasks simultaneously. Alongside processors, MEMORY plays a critical role in computation.

It’s important to recognize that configurations may vary depending on the model of the machine. For instance, CPU on other computers might only have 4 cores. On a Linux system, running the cat /proc/cpuinfo command can offer detailed information regarding your specific setup.

Memory Allocation and Management

Each node is typically endowed with a memory capacity of approximately 256G (some are 512GB), which is shared amongst all threads in that node.
If 64 tasks each utilize 1G of memory, they can all be accommodated within a single node because the total memory is 64G. However, since a node can only handle 64 tasks simultaneously due to processor limitations, any surplus memory might remain unutilized. Thanks to the job scheduler software running on the cluster, these smaller tasks might find homes in other nodes where memory vs CPU balance are better achieved by the cluster when taking into consideration of other user’s active jobs on the cluster at that time.
Conversely, if each task requires 200G of memory, a node’s 256G memory would be completely occupied by just one task submitted to one processor, leaving all other processors idle.

Example: how to efficiently manage 100,000 computing tasks

An initial consideration might be to process all 100,000 files sequentially in one single job:

Task 1: process file1
Task 2: process file2
…
Task 100,000: process file 100,000

However, if each task takes a minute, the entire job would require 100,000 minutes and would engage only a single processor. This strategy cannot release the potential of high performance computing.

Parallelization Strategy

Introducing the concept of job_size can facilitate more efficient processing. If job_size is set to 1000, that means each job would handle 100 tasks. Consequently, the computational load would be divided into 100 jobs, each containing 1000 tasks, that can be executed simultaneously. If every job requires 1000 minutes, and all jobs are allocated to processors, the entire process would be completed in roughly 1000 minutes.

The structure would be as follows:

Job 1:
- Task 1: process file1
- …
- Task 1000: process file1000
Job 2:
…
Job 100:
- Task 999000: process file999000
- …
- Task 100000: process file100000

However, it’s worth noting that overly partitioning tasks might overburden the job manager. For instance, while setting job_size to 1 would create 10,000 simultaneous jobs, it would also require an impractically large number of processors.

As Rui suggested, our cluster only recommands a maximum of 200 jobs running at the same time, and if you submit over them, the state will be qw until the previous one(s) finishes. Thus in practice it is good to always keep some room for temporary use (for example, submit ~190 jobs once at most).

Adjusting Parameters in SoS workflows

The parameters numThreads and trunk_workers are also pivotal in the SoS notebook, but generally the suggestion is not to change them:

numThreads does not determine the number of threads a job can utilize. In fact, unless the functions being executed are designed for multi-threading, this parameter often remains inactive. Typically code written by ourseleves don’t have this feature, so changing numThreads will not be helpful.
trunk_workers Generally, it’s advisable to retain its default value of 1.

Parameters can be set in the notebook as:

[gloabal]
parameter: job_size = 10
parameter: mem = "40G"
parameter: numThreads = 1

Or within the job submission script:

sos run xxx.ipynb --mem 40G --walltime 12h

Troubleshooting

Several issues might arise during the execution of jobs:

If a job remains in the queue for an extended period, it might be demanding excessive memory, and no such node have that much memory available. Reducing the memory allocation can expedite job execution.
Apart from the SoS notebook parameters, other factors might influence job execution. These can be uncovered by inspecting the underlying processes.

Behind the Execution Process

The Job Manager oversees and regulates the execution of all tasks. It is integral for effective job management and monitoring.
The manager, too, relies on parameters. Key pointers include h_rt (max runtime) and h_vmem (memory allocation). Ensuring ample time and memory is pivotal for the manager’s optimal functioning. Please note that make sure your max runtime can last until your final job is completed.
The -c yml file maintains harmony and avoiding overloading. If efficiency seems amiss, examining the Supervisor’s yml file might be enlightening. Eg. max_running_jobs: if set it to 50, then only 50 jobs will be submitted and others have to wait until these jobs are finished. This is also controlled by parameter -J.

Parameters of manager can be set in the header of job submission script:

#$ -l h_rt=150:00:00
#$ -l h_vmem=30G

Parameters of yml file can be find in the yml file itself eg.:

max_cores: 40
max_running_jobs: 50
max_walltime: '1000:00:00'

Optimal Task Processing Strategies

Scenario A: Tasks are time-intensive but require minimal memory.
- Solution: Reduce the job size to allow simultaneous execution, ensuring the -J parameter remains within limits.
Scenario B: Tasks are executed swiftly and have moderate memory requirements.
- Solution: Increase the job size, consolidating tasks to minimize time consumed on job submitting.

Conclusion

While efficiency is paramount, the quality of execution should not be compromised. If tasks are extremely memory-intensive, even a well-thought-out strategy might not yield the desired results. Optimizing and refining the codebase remains essential.