Trouble-shoot SoS workflow submitted to HPC
Suppose you submit a workflow like this to our HPC, using this configuration template. Then something goes wrong — error messages, absence of output files, or all jobs simply quit right after submission. What should you do? This document is written to help with that.
Terminology & concepts
How SoS workflow submission works is that there is a SoS parser that will generate a collection of computing script for each of the substeps. Once that is done, a SoS job manager / master process will distributes them as smaller jobs to cluster job manager, which will actually run your computing script.
Errors can happen when your script is buggy, and/or when the script failed to communicate with the cluster job manager, and/or when the cluster job manager failed to communicate with SoS master job manager.
Debug from some common error messages
“Parser error”, shown as a red Python error message in the *.log
file, means that SoS cannot generate your computing script before it is able to submit it to the cluster. This is a common error indicating your SoS coding is buggy. It often results from syntax errors in the input/output statements or bugs in the [global]
section of your SoS notebook. You can dry run your SoS command with -n
before the actual analysis to make sure your SoS code has the correct syntax, before submitting any jobs. Note that you should remove all other single-dash parameters, such as -c,-s,-j,-q
, when “dryrun” the SoS command.
Example of a “parser error”:
INFO: te51e0f8b6d3c433f submitted to csg with job id 2300214
ERROR: [PEER_1]: [te51e0f8b6d3c433f]:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
script_1387602820059950656 in <module>
ent.set_data_matrix([[data.transpose()]],samples_names=[data.columns.values.tolist()], features_names=[data.index.values.tolist()])
ent.set_model_options(factors= num_factor , spikeslab_weights=False, ard_weights=False)
----> ent.set_train_options(iter={max_iter}, convergence_mode="{convergence_mode}",dropR2 = {f'dropR2 = {r2_tol}' if r2_tol else "" }, startELBO=1, freqELBO=1,tolerance={tol}, gpu_mode=False, verbose=True, seed=42)
ent.build()
ent.run()
ent.save({_output:ar})
NameError: name 'tol' is not defined
[PEER]: Exits with 1 pending step (PEER_2)
“Script error” means that there are bugs in your actual code. They usually appear as a 3-4 line message with both red and green in your .log
file. For example:
[td9456b5c4242ecd6]: Executing script in Singularity returns an error (exitcode=1, stderr=/mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad_3/cache/demo.ENSG00000275963.unisusie.fit.stderr).
The script has been saved to /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R. To reproduce the error please run:
singularity exec containers/stephenslab.sif Rscript /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R
---------------------------------------------------------------------------
This error message indicates that one of the subtasks failed during the SoS run. This error is specific to this subtask. It occurred when running the script using the containers/stephenslab.sif
container. To see the actual error message, you should check the /mnt/vast/hpc/csg/molecular_phenotype_calling/eqtl/output/susie_per_gene_tad_3/cache/demo.ENSG00000275963.unisusie.fit.stderr
file.
The computing script in this case is /home/hs3163/.sos/7df85b31c6799c73/singularity_run_27033.R
. Most of the time, reading the .stderr files and the computing script can help you identify the bugs. If not, it is likely an error in your input data. In this case, you can start a shell in the container, and open up R environment inside of it, to perform line by line debug:
singularity shell containers/stephenslab.sif
R
# then run each line in the *computing script* interactively
“Cluster error” usually appears as a purple warning message in the .log
file, without a clear cause of the problem. It is not very obvious and relies on multiple other files to help you identify the issue, see section below.
Debug from various output log files
There is a more advance 4-step debug based on log files:
What are the log files?
cat *.log
-> parser & manager errorsos status {task_id} -v4
cat *.stderr
-> script errorcat *.err
-> cluster error
In the notebook, you may notice this line
&> toy_example.log
which writes a .log
file. You may also notice this:
stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
which writes .stderr
and .stdout
files.
In the job configuration template you may notice:
#$ -o /home/{user_name}/.sos/{job_name}.out
#$ -e /home/{user_name}/.sos/{job_name}.err
which writes to .out
and .err
files.
All these files contain different yet relevant information about the job:
- The
.log
file tracks the messages/status of the “master” process of SoS that manages different smaller jobs. It gives you an overview of each SoS job submitted. - The
.err
file tracks messages from a particular job that you can see in yourqstat
output. It captures feedback from the job manager on the cluster for a given job submitted by the SoS master process (above). - The
.stderr
file (and the.stdout
file) tracks the messages from the actual code, or script you run. For example if all your code does is toprint("hello, world")
then.stdout
file will have a linehello, world
and.stderr
file will be empty.
At the high level:
- Check
.log
is to check if SoS job manager on all jobs in the submission is wrong. - Check
.err
is to check if the cluster job manager on a particular job is wrong. - Check
.stderr
is to check if your code is wrong.
So when you think your job fails, you check it in the order outlined above.
How to debug from these files?
The first step, cat *.log
, should show you the status of the jobs:
cat ~/Work/MASH_test_h_rt5/sumstat.log
INFO: Running processing_1:
INFO: tde5c207196319c89 submitted to csg2 with job id 3939924
INFO: ta46449c96b861d6f submitted to csg2 with job id 3939925
INFO: t5530b8f3c07275cf submitted to csg2 with job id 3939926
INFO: t3348c7def618a6b7 submitted to csg2 with job id 3939927
Each of these task_id
correspond to a job submitted to the cluster, managed by SoS, so this task_id
is different from the cluster’s job ID you find in qstat
.
You should identify the problematic jobs by reading the output above and get their task_id
.
The second step is to find the actual job submitted from the task_id
in question. To do so, run
sos status tde5c207196319c89 -v4
as an example. It should print lots of details about the job, see below:
tde5c207196319c89 aborted
Created 1 day 4 hr ago
Started 1 day 4 hr ago
Ran for 13 min
TASK:
=====
R(fr"""library("dplyr")
library("tibble")
library("purrr")
library("readr")
extract_data = function(vcfObj,keep_snp)
vcfObj = VariantAnnotation::readVcf("{_input}")
GENE = unique(VariantAnnotation::info(vcfObj)$GENE)%>%as.list()
info = VariantAnnotation::info(vcfObj)%>%as_tibble(rownames = "ID")
output_list = map(GENE, ~extract_data(vcfObj,info%>%filter(GENE %in% .x )%>%pull(ID)))
output_path = tibble(gene = GENE%>%unlist)%>%mutate(path = map_chr(GENE,~paste0("{_output:nn}.",.x,".rds")) )
walk2(output_list,output_path$path,~.x%>%saveRDS(.y))
output_path%>%write_delim("{_output}","\t",col_names = FALSE)
""", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container)
TAGS:
=====
MASH_test_h_rt5 processing_1 processing_1_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_
GLOBAL:
=======
{'container': '/mnt/vast/hpc/csg/containers/bioinfo.sif'}
ENVIRONMENT:
============
__signature_vars__ {'_input', '_output', 'R', 'container'}
_depends []
_index 1
_input [file_target(/mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_celltypes_analysis/output/data_intergration/TensorQTL/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged.vcf.gz)]
_output [file_target(/mnt/vast/hpc/csg/rf2872/Work/MASH_test_h_rt5/output/RDS/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged_rds.list)]
_runtime {'max_cores': 40,
'max_walltime': '240:00:00',
'mem': 30000000000,
'queue': 'csg2',
'run_mode': 'run',
'sig_mode': 'default',
'tags': 'processing_1_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr11.norminal.cis_long_table.merged_rds',
'trunk_size': 1,
'trunk_workers': 1,
'verbosity': 2,
'walltime': '05:00:00',
'workdir': '/mnt/vast/hpc/csg/rf2872/Work/MASH_test_h_rt5'}
container '/mnt/vast/hpc/csg/containers/bioinfo.sif'
step_name 'processing_1'
workflow_id 'w918f8be368481a89'
execution script:
================
#!/bin/sh
#$ -l h_rt=05:00:00
#$ -l h_vmem=30G
#$ -pe openmp 1
#$ -N job_tde5c207196319c89
#$ -o /home/rf2872/.sos/tde5c207196319c89.out
#$ -e /home/rf2872/.sos/tde5c207196319c89.err
#$ -cwd
#$ -S /bin/bash
#$ -q csg2.q -l t_pri
module load Singularity
module load R/4.0
export PATH=$HOME/miniconda3/bin:$PATH
export SINGULARITY_BIND="/mnt/mfs/ctcn/:/mnt/mfs/ctcn/"
set -e
sos execute tde5c207196319c89 -v 2 -s default
standard error:
================
Task tde5c207196319c89 inactive for more than 61 seconds, might have been killed.
The most important are the TASK
section, which contains the script you actually executed. There you may find string patterns like {_output:n}.stderr
which captures the runtime information of this script.
The third step is to see what might be wrong with this script during its execution earlier. To do so, you need to track its stderr
. In the ENVIRONMENT
section of the output of sos status -v4
command, you should see the _output
variable and its value which is a string of file names. Depending on your workflow, there can be more than one output file. But you should see some .stderr
file on your file system sitting next to those output files — that captures runtime information for the script. You can then check the information using (In the cases where the error is not caused by your script it may seem normal):
cat ~/Work/MASH_test_h_rt5/output/RDS/ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.bed.processed_phenotype.per_chrom_ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.log2cpm.ALL_Ast_End_Exc_Inh_Mic_OPC_Oli.covariate.pca.resid.PEER.cov.chr6.norminal.cis_long_table.merged_rds.list.stderr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Usually, if something is wrong with your script and your script quits with a complaint like an error message, then the .stderr
file should capture that.
But if this file is empty, or it contains explicit information to indicate that it completed successfully, then the third step is to see if the output file, which you already know what they are, are actually generated and intact. If that is the case, it means the error message might simply be a miscommunication between job managers which indeed happen from time to time. You might get away with just adding -s build
to your sos run
command to skip these existing files next time when you rerun. Obviously, cautions should be taken here and you should do this only if you are very sure that the output is good.
If you cannot find the output file and there is not much that stderr
file can tell you, then you should look into the execution script
section of the output of sos status -v4
command. There you should see and check into the .err
file:
cat ~/.sos/tde5c207196319c89.err
which should output the name of the compute node the job was executed (as a result of this line of configuration), along with other possible error messages that you may Google search to figure out, or to post to #computing-matters
channel on slack for a discussion.
Now, if you still have no clue from the above steps, it is then time to work interactively with the job to figure out what’s going on. You have two options:
- Save contents of
execution script
to a file, call it egtest.sh
, then submit it directly see if it works:
qsub test.sh
The goal of this step is to bypass the SoS job manager and test directly through the cluster job manager, to see what might be wrong.
- With information on the computing node that the failed job was executed (from the aforementioned
cat *.err
), useqlogin
to go to that node (in this case, queuecsg2.q
nodenode75
), and execute directly the commandsos execute
that you find under theexecution script
section:
qlogin -q csg2.q -l t_pri -l h="node75"
sos execute tde5c207196319c89 -v 2 -s default
The goal of this step is to test and rule out if there is an issue on a particular node. Notice here that you have not specified memory requested and wall time in the qlogin
. It is going to use the default which may or may not be sufficient.
If the job fails on this particular node and if you know that other similar job worked, you can try to run it in another node, eg
qlogin -q csg2.q -l t_pri -l h="node80"
see if it work. If so, you know something must be wrong with node75
. You should report it to the #computing-matters
channel.
Additional debug tips
- “Print” often. You can always add some
print
statement in your script to track where the code progress at. The output ofprint
can be found instdout
files
Clean up job cache folder
All the runtime information printed via sos status
are saved under ~/.sos
. These files will be automatically purged from time to time, or can be manually done by using
sos purge
As you might have noticed from previous sections, depending on the settings in the job configuration template, numerous *.out
and *.err
files might be saved to ~/.sos
folder as well.
A more aggressive way to clean up is to simply delete the entire folder:
cd ~
rm -rf .sos
Caution that when you delete all the runtime history will be lost, and rerun will be triggered next time you execute the same previously executed command, unless you use the option -s build
to skip rerun when output exists.