Estimating Computing Costs for Genomics Project Budgets, January 2026

We propose a simple framework for cost estimation with examples for common genomics workflows.

Method

For cloud-based HPC, cost estimation is most accurate when based on actual instance hourly rates rather than traditional service units (SU). We use representative rates from each instance tier for straightforward estimation.

Computing cost estimation follows this formula:

Annual Cost = (Hours per task × Hourly rate × Number of tasks × Contingency factor) + Prototyping cost + Interactive session cost

We recommend a contingency factor of 2x to account for the reality that research pipelines are rarely run only once. This multiplier covers failed jobs and reruns.

Prototyping cost accounts for the initial phase of establishing and validating a pipeline before processing the full dataset, based on processing 10% of your samples during pipeline development, at the same per-sample cost as production runs.

Interactive session cost accounts for ongoing data exploration, visualization, and analysis using tools like RStudio or JupyterLab. A reasonable estimate assumes 12 hours per day, 7 days per week, throughout the year for each lab member actively working on the project. Using a cpu4mem32a instance ($0.23/hr), this amounts to approximately $1,000 per person per year.

MemVerge Spot Savings as Discount Factor

The compute rates considered in this post are on-demand instance list prices. With our MemVerge Spot instance integration, actual costs are typically 20% to 70% lower depending on applications and instance availability. If your budget estimate exceeds available funds, you can apply a MemVerge discount factor at your discretion (0.3 to 0.8) to your total compute cost as justification. This discount applies only to compute costs, not storage.

Computing

Reference Rates

The table below shows representative hourly rates based on the lowest-priced instance in each compute resource tier.

Workload Type Instance Specs Hourly Rate Typical Use Case
Light CPU cpu2mem8a 2 CPU, 8GB $0.09 Preprocessing, small scripts
Standard CPU cpu4mem32a 4 CPU, 32GB $0.23 BWA alignment, variant calling
Heavy CPU cpu8mem64a 8 CPU, 64GB $0.45 GATK, multi-sample calling
High Memory cpu16mem128a 16 CPU, 128GB $0.90 Sequence assembly, large matrices
Very High Memory cpu32mem256a 32 CPU, 256GB $1.81 Memory-intensive workflows
Single GPU gpu1mem64 1 A10G, 64GB $1.62 Model training, inference
Multi-GPU gpu4mem96 4 A10G, 96GB $5.67 Large model training

Storage

See Storage Pricing for current rates.

Storage Tier Monthly Cost Annual Cost Best For
S3 Standard $18/TB $216/TB Active analysis data
S3 Intelligent-Tiering $11-18/TB $132-216/TB Mixed access patterns
S3 Glacier $3.50/TB $42/TB Long-term archival

Budget Justification Template

The project will utilize the Columbia University Department of Neurology high-performance computing environment hosted on Amazon Web Services. Computing costs are estimated based on [X samples/analyses] requiring approximately [Y] CPU-hours for [workflow description] at an average rate of $[Z] per hour, totaling $[amount]. A 2× contingency factor is applied to account for pipeline development, parameter optimization, and at least one complete reanalysis cycle. An additional $[amount] is budgeted for prototyping on [N] samples to establish and validate the pipeline, and $[amount] for interactive analysis sessions ([N] lab members × $1,000/year). Compute costs assume utilization of MemVerge SpotSurfer for Spot instance integration, which historically reduces compute costs by approximately [X]%. Storage costs include [N] TB of S3 storage at $[rate]/TB/year for [data description]. Total computing costs: $[final amount] per year.

Examples

The cost estimates in these examples were generated by Claude AI and reviewed with minor adjustments by Gao Wang.

Benchmark references

Compute cost

Workflow Benchmark Source
30x WGS (alignment + sorting + markdup + BQSR) 4-9 hours on 96 vCPU NVIDIA Parabricks docs
BWA-MEM WGS alignment only ~300 CPU-hours per sample Engström et al. 2016 via CNAG
General alignment rule of thumb ~1 CPU-hour per GB of compressed FASTQ Community benchmark

To convert these benchmarks to our instance types: a 30x WGS sample taking 4-9 hours on 96 vCPUs would take approximately 48-108 hours on a cpu8mem64a (8 vCPU) instance, or 12-27 hours on a cpu32mem256a (32 vCPU) instance.

Typical data sizes:

Data Type Size Notes
WGS FASTQ (30x, compressed) ~50 GB per sample Paired-end 150bp
WGS BAM (30x) ~80 GB per sample ~40 GB if CRAM
RNA-seq FASTQ ~3-5 GB per sample 30-50M read pairs
scRNA-seq (10x) ~15-20 GB per library Varies with cell count
Genotype data (PLINK .bed) ~2.5 GB per 10K samples At ~1M SNPs
Genotype data (PLINK .bed) ~25 GB per 10K samples At ~10M variants (WGS)
Genotype data (VCF, bgzipped) ~50-200 GB per 10K samples Depends on variant count and annotations

Example 1: Whole Genome Sequencing Analysis

Process 500 whole genome samples (30x coverage) through alignment and variant calling pipeline. Assumes 2 lab members working on the project.

Compute estimate per sample:

Step Instance Wall-clock Hours Cost/Sample
BWA-MEM alignment cpu32mem256a 10 hrs $18.10
Sorting and deduplication cpu8mem64a 2 hrs $0.90
GATK HaplotypeCaller cpu8mem64a 8 hrs $3.60
Quality control cpu4mem32a 1 hr $0.23
Compute subtotal ~21 hrs $22.83

Storage estimate:

Data Size Annual Cost
FASTQ (retained) 25 TB $5,400
BAM files 40 TB $8,640
gVCF results 2 TB $432
Storage subtotal 67 TB $14,472

Annual budget:

Category Calculation Cost
Production compute 500 samples × $22.83 $11,415
Contingency (2×) $11,415 × 2 $22,830
Prototyping (50 samples) 50 × $22.83 $1,142
Interactive sessions (2 people) 2 × $1,000 $2,000
Compute total $25,972
Storage $14,472
Total Year 1 ~$40,450

Example 2: Single-Nucleus RNA-seq Analysis

snRNA-seq analysis of 500 samples with cell type-specific differential expression across 6 cell types. Assumes 1 lab member working on the project.

Compute estimate per sample:

Step Instance Hours Cost/Sample
STAR alignment + quantification cpu16mem128a 2 hrs $1.80
Quality control and doublet removal cpu4mem32a 0.5 hr $0.12
Per-sample subtotal 2.5 hrs $1.92

Batch processing (once per project):

Step Instance Hours Cost
Data aggregation cpu16mem128a 8 hrs $7.20
Cell type annotation cpu8mem64a 4 hrs $1.80
DE analysis (6 cell types) cpu8mem64a 6 hrs $2.70
GSEA enrichment analysis cpu4mem32a 3 hrs $0.69
Batch subtotal 21 hrs $12.39

Storage estimate:

Data Size Annual Cost
FASTQ files 7.5 TB $1,620
STARsolo outputs (BAM + matrices) 10 TB $2,160
Integrated AnnData objects 0.1 TB $22
Storage subtotal ~17.6 TB $3,802

Annual budget:

Category Calculation Cost
Production compute (500 × $1.92) + $12.39 $972
Contingency (2×) $972 × 2 $1,944
Prototyping (50 samples) (50 × $1.92) + $12.39 $108
Interactive sessions (1 person) 1 × $1,000 $1,000
Compute total $3,052
Storage $3,802
Total Year 1 ~$6,854

Example 3: GWAS Fine-Mapping

Fine-mapping analysis across 100 genomic loci using SuSiE, with 10,000 samples. Assumes 1 lab member working on the project.

Note: Fine-mapping runtime varies significantly with region number of variants and LD complexity. These are conservative estimates for typical 1-2 Mb regions.

Compute estimate per locus:

Step Instance Hours Cost/Locus
LD matrix computation cpu8mem64a 0.5 hr $0.23
SuSiE fine-mapping cpu4mem32a 0.25 hr $0.06
Colocalization cpu4mem32a 0.25 hr $0.06
Compute subtotal ~1 hr $0.35

Storage estimate:

Data Size Annual Cost
Genotype data (PLINK) 50 GB $11
Summary statistics 10 GB $2
LD matrices (if retained) 200 GB $43
Fine-mapping results 1 GB $0.22
Storage subtotal ~260 GB $56

Annual budget:

Category Calculation Cost
Production compute 100 loci × $0.35 $35
Contingency (2×) $35 × 2 $70
Prototyping (10 loci) 10 × $0.35 $4
Interactive sessions (1 person) 1 × $1,000 $1,000
Compute total $1,074
Storage $56
Total Year 1 ~$1,130