Trouble-shoot SoS workflow submitted to HPC

Suppose you submit a workflow like this to our HPC, using this configuration template. Then something goes wrong — error messages, absence of output files, or all jobs simply quit right after submission. What should you do? This document is written to help with that.

Terminology & concepts

How SoS workflow submission works is that there is a SoS parser that will generate a collection of computing script for each of the substeps. Once that is done, a SoS job manager / master process will distributes them as smaller jobs to cluster job manager, which will actually run your computing script.

Errors can happen when your script is buggy, and/or when the script failed to communicate with the cluster job manager, and/or when the cluster job manager failed to communicate with SoS master job manager.

Debug from some common error messages

“Parser error”, shown as a red Python error message in the *.log file, means that SoS cannot generate your computing script before it is able to submit it to the cluster. This is a common error indicating your SoS coding is buggy. It often results from syntax errors in the input/output statements or bugs in the [global] section of your SoS notebook. You can dry run your SoS command with -n before the actual analysis to make sure your SoS code has the correct syntax, before submitting any jobs. Note that you should remove all other single-dash parameters, such as -c,-s,-j,-q, when “dryrun” the SoS command.

Example of a “parser error”:

INFO: te51e0f8b6d3c433f submitted to csg with job id 2300214
ERROR: [PEER_1]: [te51e0f8b6d3c433f]:
NameError                                 Traceback (most recent call last)
script_1387602820059950656 in <module>
      ent.set_data_matrix([[data.transpose()]],samples_names=[data.columns.values.tolist()], features_names=[data.index.values.tolist()])
      ent.set_model_options(factors= num_factor , spikeslab_weights=False, ard_weights=False)
----> ent.set_train_options(iter={max_iter}, convergence_mode="{convergence_mode}",dropR2 = {f'dropR2 = {r2_tol}' if r2_tol else "" }, startELBO=1, freqELBO=1,tolerance={tol}, gpu_mode=False, verbose=True, seed=42){_output:ar})

NameError: name 'tol' is not defined
[PEER]: Exits with 1 pending step (PEER_2)

“Script error” means that there are bugs in your actual code. They usually appear as a 3-4 line message with both red and green in your .log file. For example:

[td9456b5c4242ecd6]: Executing script in Singularity returns an error (exitcode=1, stderr=/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad_3/cache/
The script has been saved to /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R. To reproduce the error please run:
singularity exec  containers/stephenslab.sif Rscript /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R

This error message indicates that one of the subtasks failed during the SoS run. This error is specific to this subtask. It occurred when running the script using the containers/stephenslab.sif container. To see the actual error message, you should check the /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad_3/cache/ file.

The computing script in this case is /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R. Most of the time, reading the .stderr files and the computing script can help you identify the bugs. If not, it is likely an error in your input data. In this case, you can start a shell in the container, and open up R environment inside of it, to perform line by line debug:

singularity shell containers/stephenslab.sif
# then run each line in the *computing script* interactively

“Cluster error” usually appears as a purple warning message in the .log file, without a clear cause of the problem. It is not very obvious and relies on multiple other files to help you identify the issue, see section below.

Debug from various output log files

There is a more advance 4-step debug based on log files:

What are the log files?

  1. cat *.log -> parser & manager error
  2. sos status {task_id} -v4
  3. cat *.stderr -> script error
  4. cat *.err -> cluster error

In the notebook, you may notice this line

 &> toy_example.log

which writes a .log file. You may also notice this:

stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'

which writes .stderr and .stdout files.

In the job configuration template you may notice:

      #$ -o /home/{user_name}/.sos/{job_name}.out
      #$ -e /home/{user_name}/.sos/{job_name}.err

which writes to .out and .err files.

All these files contain different yet relevant information about the job:

  1. The .log file tracks the messages/status of the “master” process of SoS that manages different smaller jobs. It gives you an overview of each SoS job submitted.
  2. The .err file tracks messages from a particular job that you can see in your qstat output. It captures feedback from the job manager on the cluster for a given job submitted by the SoS master process (above).
  3. The .stderr file (and the .stdout file) tracks the messages from the actual code, or script you run. For example if all your code does is to print("hello, world") then .stdout file will have a line hello, world and .stderr file will be empty.

At the high level:

  1. Check .log is to check if SoS job manager on all jobs in the submission is wrong.
  2. Check .err is to check if the cluster job manager on a particular job is wrong.
  3. Check .stderr is to check if your code is wrong.

So when you think your job fails, you check it in the order outlined above.

How to debug from these files?

The first step, cat *.log, should show you the status of the jobs:

cat ~/Work/MASH_test_h_rt5/sumstat.log
INFO: Running processing_1:
INFO: tde5c207196319c89 submitted to csg2 with job id 3939924
INFO: ta46449c96b861d6f submitted to csg2 with job id 3939925
INFO: t5530b8f3c07275cf submitted to csg2 with job id 3939926
INFO: t3348c7def618a6b7 submitted to csg2 with job id 3939927

Each of these task_id correspond to a job submitted to the cluster, managed by SoS, so this task_id is different from the cluster’s job ID you find in qstat. You should identify the problematic jobs by reading the output above and get their task_id.

The second step is to find the actual job submitted from the task_id in question. To do so, run

sos status tde5c207196319c89 -v4

as an example. It should print lots of details about the job, see below:

tde5c207196319c89	aborted

Created 1 day 4 hr ago
Started 1 day 4 hr ago
Ran for 13 min
extract_data = function(vcfObj,keep_snp)
vcfObj = VariantAnnotation::readVcf("{_input}")
GENE = unique(VariantAnnotation::info(vcfObj)$GENE)%>%as.list()
info = VariantAnnotation::info(vcfObj)%>%as_tibble(rownames = "ID")
output_list = map(GENE, ~extract_data(vcfObj,info%>%filter(GENE %in% .x )%>%pull(ID)))
output_path = tibble(gene = GENE%>%unlist)%>%mutate(path = map_chr(GENE,~paste0("{_output:nn}.",.x,".rds")) )
output_path%>%write_delim("{_output}","\t",col_names = FALSE)

""", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container)

MASH_test_h_rt5 processing_1 processing_1_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_

{'container': '/mnt/vast/hpc/csg/containers/bioinfo.sif'}

__signature_vars__    {'_input', '_output', 'R', 'container'}
_depends              []
_index                1
_input                [file_target(/mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_celltypes_analysis/output/data_intergration/TensorQTL/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged.vcf.gz)]
_output               [file_target(/mnt/vast/hpc/csg/rf2872/Work/MASH_test_h_rt5/output/RDS/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged_rds.list)]
_runtime              {'max_cores': 40,
 'max_walltime': '240:00:00',
 'mem': 30000000000,
 'queue': 'csg2',
 'run_mode': 'run',
 'sig_mode': 'default',
 'tags': 'processing_1_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged_rds',
 'trunk_size': 1,
 'trunk_workers': 1,
 'verbosity': 2,
 'walltime': '05:00:00',
 'workdir': '/mnt/vast/hpc/csg/rf2872/Work/MASH_test_h_rt5'}
container             '/mnt/vast/hpc/csg/containers/bioinfo.sif'
step_name             'processing_1'
workflow_id           'w918f8be368481a89'

execution script:
#$ -l h_rt=05:00:00
#$ -l h_vmem=30G
#$ -pe openmp 1
#$ -N job_tde5c207196319c89
#$ -o /home/rf2872/.sos/tde5c207196319c89.out
#$ -e /home/rf2872/.sos/tde5c207196319c89.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg2.q -l t_pri
module load Singularity
module load R/4.0
export PATH=$HOME/miniconda3/bin:$PATH
export SINGULARITY_BIND="/mnt/mfs/ctcn/:/mnt/mfs/ctcn/"
set -e
sos execute tde5c207196319c89 -v 2 -s default

standard error:
Task tde5c207196319c89 inactive for more than 61 seconds, might have been killed.

The most important are the TASK section, which contains the script you actually executed. There you may find string patterns like {_output:n}.stderr which captures the runtime information of this script.

The third step is to see what might be wrong with this script during its execution earlier. To do so, you need to track its stderr. In the ENVIRONMENT section of the output of sos status -v4 command, you should see the _output variable and its value which is a string of file names. Depending on your workflow, there can be more than one output file. But you should see some .stderr file on your file system sitting next to those output files — that captures runtime information for the script. You can then check the information using (In the cases where the error is not caused by your script it may seem normal):

cat ~/Work/MASH_test_h_rt5/output/RDS/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr6.norminal.cis_long_table.merged_rds.list.stderr
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Usually, if something is wrong with your script and your script quits with a complaint like an error message, then the .stderr file should capture that.

But if this file is empty, or it contains explicit information to indicate that it completed successfully, then the third step is to see if the output file, which you already know what they are, are actually generated and intact. If that is the case, it means the error message might simply be a miscommunication between job managers which indeed happen from time to time. You might get away with just adding -s build to your sos run command to skip these existing files next time when you rerun. Obviously, cautions should be taken here and you should do this only if you are very sure that the output is good.

If you cannot find the output file and there is not much that stderr file can tell you, then you should look into the execution script section of the output of sos status -v4 command. There you should see and check into the .err file:

cat ~/.sos/tde5c207196319c89.err

which should output the name of the compute node the job was executed (as a result of this line of configuration), along with other possible error messages that you may Google search to figure out, or to post to #computing-matters channel on slack for a discussion.

Now, if you still have no clue from the above steps, it is then time to work interactively with the job to figure out what’s going on. You have two options:

  • Save contents of execution script to a file, call it eg, then submit it directly see if it works:

The goal of this step is to bypass the SoS job manager and test directly through the cluster job manager, to see what might be wrong.

  • With information on the computing node that the failed job was executed (from the aforementioned cat *.err), use qlogin to go to that node (in this case, queue csg2.q node node75), and execute directly the command sos execute that you find under the execution script section:
qlogin -q csg2.q -l t_pri -l h="node75"
sos execute tde5c207196319c89 -v 2 -s default

The goal of this step is to test and rule out if there is an issue on a particular node. Notice here that you have not specified memory requested and wall time in the qlogin. It is going to use the default which may or may not be sufficient.

If the job fails on this particular node and if you know that other similar job worked, you can try to run it in another node, eg

qlogin -q csg2.q -l t_pri -l h="node80"

see if it work. If so, you know something must be wrong with node75. You should report it to the #computing-matters channel.

Additional debug tips

  1. “Print” often. You can always add some print statement in your script to track where the code progress at. The output of print can be found in stdout files

Clean up job cache folder

All the runtime information printed via sos status are saved under ~/.sos. These files will be automatically purged from time to time, or can be manually done by using

sos purge

As you might have noticed from previous sections, depending on the settings in the job configuration template, numerous *.out and *.err files might be saved to ~/.sos folder as well. A more aggressive way to clean up is to simply delete the entire folder:

cd ~
rm -rf .sos

Caution that when you delete all the runtime history will be lost, and rerun will be triggered next time you execute the same previously executed command, unless you use the option -s build to skip rerun when output exists.