Trouble-shoot SoS workflow submitted to HPC
Suppose you submit a workflow like this to our HPC, using this configuration template. Then something goes wrong — error messages, absence of output files, or all jobs simply quit right after submission. What should you do? This document is written to help with that.
Terminology & concepts
How SoS workflow submission works is that there is a SoS parser that will generate a collection of computing script for each of the substeps. Once that is done, a SoS job manager / master process will distributes them as smaller jobs to cluster job manager, which will actually run your computing script.
Errors can happen when your script is buggy, and/or when the script failed to communicate with the cluster job manager, and/or when the cluster job manager failed to communicate with SoS master job manager.
Debug from some common error messages
“Parser error”, shown as a red Python error message in the *.log file, means that SoS cannot generate your computing script before it is able to submit it to the cluster. This is a common error indicating your SoS coding is buggy. It often results from syntax errors in the input/output statements or bugs in the [global] section of your SoS notebook. You can dry run your SoS command with -n before the actual analysis to make sure your SoS code has the correct syntax, before submitting any jobs. Note that you should remove all other single-dash parameters, such as -c,-s,-j,-q, when “dryrun” the SoS command.
Example of a “parser error”:
INFO: te51e0f8b6d3c433f submitted to csg with job id 2300214
ERROR: [PEER_1]: [te51e0f8b6d3c433f]:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
script_1387602820059950656 in <module>
ent.set_data_matrix([[data.transpose()]],samples_names=[data.columns.values.tolist()], features_names=[data.index.values.tolist()])
ent.set_model_options(factors= num_factor , spikeslab_weights=False, ard_weights=False)
----> ent.set_train_options(iter={max_iter}, convergence_mode="{convergence_mode}",dropR2 = {f'dropR2 = {r2_tol}' if r2_tol else "" }, startELBO=1, freqELBO=1,tolerance={tol}, gpu_mode=False, verbose=True, seed=42)
ent.build()
ent.run()
ent.save({_output:ar})
NameError: name 'tol' is not defined
[PEER]: Exits with 1 pending step (PEER_2)
“Script error” means that there are bugs in your actual code. They usually appear as a 3-4 line message with both red and green in your .log file. For example:
[td9456b5c4242ecd6]: Executing script in Singularity returns an error (exitcode=1, stderr=/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad_3/cache/demo.ENSG00000275963.unisusie.fit.stderr).
The script has been saved to /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R. To reproduce the error please run:
singularity exec containers/stephenslab.sif Rscript /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R
---------------------------------------------------------------------------
This error message indicates that one of the subtasks failed during the SoS run. This error is specific to this subtask. It occurred when running the script using the containers/stephenslab.sif container. To see the actual error message, you should check the /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad_3/cache/demo.ENSG00000275963.unisusie.fit.stderr file.
The computing script in this case is /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R. Most of the time, reading the .stderr files and the computing script can help you identify the bugs. If not, it is likely an error in your input data. In this case, you can start a shell in the container, and open up R environment inside of it, to perform line by line debug:
singularity shell containers/stephenslab.sif
R
# then run each line in the *computing script* interactively
“Cluster error” usually appears as a purple warning message in the .log file, without a clear cause of the problem. It is not very obvious and relies on multiple other files to help you identify the issue, see section below.
Debug from various output log files
There is a more advance 4-step debug based on log files:
What are the log files?
cat *.log-> parser & manager errorsos status {task_id} -v4cat *.stderr-> script errorcat *.err-> cluster error
In the notebook, you may notice this line
&> toy_example.log
which writes a .log file. You may also notice this:
stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
which writes .stderr and .stdout files.
In the job configuration template you may notice:
#$ -o /home/{user_name}/.sos/{job_name}.out
#$ -e /home/{user_name}/.sos/{job_name}.err
which writes to .out and .err files.
All these files contain different yet relevant information about the job:
- The
.logfile tracks the messages/status of the “master” process of SoS that manages different smaller jobs. It gives you an overview of each SoS job submitted. - The
.errfile tracks messages from a particular job that you can see in yourqstatoutput. It captures feedback from the job manager on the cluster for a given job submitted by the SoS master process (above). - The
.stderrfile (and the.stdoutfile) tracks the messages from the actual code, or script you run. For example if all your code does is toprint("hello, world")then.stdoutfile will have a linehello, worldand.stderrfile will be empty.
At the high level:
- Check
.logis to check if SoS job manager on all jobs in the submission is wrong. - Check
.erris to check if the cluster job manager on a particular job is wrong. - Check
.stderris to check if your code is wrong.
So when you think your job fails, you check it in the order outlined above.
How to debug from these files?
The first step, cat *.log, should show you the status of the jobs:
cat ~/Work/MASH_test_h_rt5/sumstat.log
INFO: Running processing_1:
INFO: tde5c207196319c89 submitted to csg2 with job id 3939924
INFO: ta46449c96b861d6f submitted to csg2 with job id 3939925
INFO: t5530b8f3c07275cf submitted to csg2 with job id 3939926
INFO: t3348c7def618a6b7 submitted to csg2 with job id 3939927
Each of these task_id correspond to a job submitted to the cluster, managed by SoS, so this task_id is different from the cluster’s job ID you find in qstat.
You should identify the problematic jobs by reading the output above and get their task_id.
The second step is to find the actual job submitted from the task_id in question. To do so, run
sos status tde5c207196319c89 -v4
as an example. It should print lots of details about the job, see below:
tde5c207196319c89 aborted
Created 1 day 4 hr ago
Started 1 day 4 hr ago
Ran for 13 min
TASK:
=====
R(fr"""library("dplyr")
library("tibble")
library("purrr")
library("readr")
extract_data = function(vcfObj,keep_snp)
vcfObj = VariantAnnotation::readVcf("{_input}")
GENE = unique(VariantAnnotation::info(vcfObj)$GENE)%>%as.list()
info = VariantAnnotation::info(vcfObj)%>%as_tibble(rownames = "ID")
output_list = map(GENE, ~extract_data(vcfObj,info%>%filter(GENE %in% .x )%>%pull(ID)))
output_path = tibble(gene = GENE%>%unlist)%>%mutate(path = map_chr(GENE,~paste0("{_output:nn}.",.x,".rds")) )
walk2(output_list,output_path$path,~.x%>%saveRDS(.y))
output_path%>%write_delim("{_output}","\t",col_names = FALSE)
""", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container)
TAGS:
=====
MASH_test_h_rt5 processing_1 processing_1_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_
GLOBAL:
=======
{'container': '/mnt/vast/hpc/csg/containers/bioinfo.sif'}
ENVIRONMENT:
============
__signature_vars__ {'_input', '_output', 'R', 'container'}
_depends []
_index 1
_input [file_target(/mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_celltypes_analysis/output/data_intergration/TensorQTL/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged.vcf.gz)]
_output [file_target(/mnt/vast/hpc/csg/rf2872/Work/MASH_test_h_rt5/output/RDS/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged_rds.list)]
_runtime {'max_cores': 40,
'max_walltime': '240:00:00',
'mem': 30000000000,
'queue': 'csg2',
'run_mode': 'run',
'sig_mode': 'default',
'tags': 'processing_1_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged_rds',
'trunk_size': 1,
'trunk_workers': 1,
'verbosity': 2,
'walltime': '05:00:00',
'workdir': '/mnt/vast/hpc/csg/rf2872/Work/MASH_test_h_rt5'}
container '/mnt/vast/hpc/csg/containers/bioinfo.sif'
step_name 'processing_1'
workflow_id 'w918f8be368481a89'
execution script:
================
#!/bin/sh
#$ -l h_rt=05:00:00
#$ -l h_vmem=30G
#$ -pe openmp 1
#$ -N job_tde5c207196319c89
#$ -o /home/rf2872/.sos/tde5c207196319c89.out
#$ -e /home/rf2872/.sos/tde5c207196319c89.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg2.q -l t_pri
module load Singularity
module load R/4.0
export PATH=$HOME/miniconda3/bin:$PATH
export SINGULARITY_BIND="/mnt/mfs/ctcn/:/mnt/mfs/ctcn/"
set -e
sos execute tde5c207196319c89 -v 2 -s default
standard error:
================
Task tde5c207196319c89 inactive for more than 61 seconds, might have been killed.
The most important are the TASK section, which contains the script you actually executed. There you may find string patterns like {_output:n}.stderr which captures the runtime information of this script.
The third step is to see what might be wrong with this script during its execution earlier. To do so, you need to track its stderr. In the ENVIRONMENT section of the output of sos status -v4 command, you should see the _output variable and its value which is a string of file names. Depending on your workflow, there can be more than one output file. But you should see some .stderr file on your file system sitting next to those output files — that captures runtime information for the script. You can then check the information using (In the cases where the error is not caused by your script it may seem normal):
cat ~/Work/MASH_test_h_rt5/output/RDS/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr6.norminal.cis_long_table.merged_rds.list.stderr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Usually, if something is wrong with your script and your script quits with a complaint like an error message, then the .stderr file should capture that.
But if this file is empty, or it contains explicit information to indicate that it completed successfully, then the third step is to see if the output file, which you already know what they are, are actually generated and intact. If that is the case, it means the error message might simply be a miscommunication between job managers which indeed happen from time to time. You might get away with just adding -s build to your sos run command to skip these existing files next time when you rerun. Obviously, cautions should be taken here and you should do this only if you are very sure that the output is good.
If you cannot find the output file and there is not much that stderr file can tell you, then you should look into the execution script section of the output of sos status -v4 command. There you should see and check into the .err file:
cat ~/.sos/tde5c207196319c89.err
which should output the name of the compute node the job was executed (as a result of this line of configuration), along with other possible error messages that you may Google search to figure out, or to post to #computing-matters channel on slack for a discussion.
Now, if you still have no clue from the above steps, it is then time to work interactively with the job to figure out what’s going on. You have two options:
- Save contents of
execution scriptto a file, call it egtest.sh, then submit it directly see if it works:
qsub test.sh
The goal of this step is to bypass the SoS job manager and test directly through the cluster job manager, to see what might be wrong.
- With information on the computing node that the failed job was executed (from the aforementioned
cat *.err), useqloginto go to that node (in this case, queuecsg2.qnodenode75), and execute directly the commandsos executethat you find under theexecution scriptsection:
qlogin -q csg2.q -l t_pri -l h="node75"
sos execute tde5c207196319c89 -v 2 -s default
The goal of this step is to test and rule out if there is an issue on a particular node. Notice here that you have not specified memory requested and wall time in the qlogin. It is going to use the default which may or may not be sufficient.
If the job fails on this particular node and if you know that other similar job worked, you can try to run it in another node, eg
qlogin -q csg2.q -l t_pri -l h="node80"
see if it work. If so, you know something must be wrong with node75. You should report it to the #computing-matters channel.
Additional debug tips
- “Print” often. You can always add some
printstatement in your script to track where the code progress at. The output ofprintcan be found instdoutfiles
Clean up job cache folder
All the runtime information printed via sos status are saved under ~/.sos. These files will be automatically purged from time to time, or can be manually done by using
sos purge
As you might have noticed from previous sections, depending on the settings in the job configuration template, numerous *.out and *.err files might be saved to ~/.sos folder as well.
A more aggressive way to clean up is to simply delete the entire folder:
cd ~
rm -rf .sos
Caution that when you delete all the runtime history will be lost, and rerun will be triggered next time you execute the same previously executed command, unless you use the option -s build to skip rerun when output exists.