Estimating Computing Costs for Genomics Project Budgets, January 2026

We propose a simple framework for cost estimation with examples for common genomics workflows.

Thursday, January 29, 2026

Method

For cloud-based HPC, cost estimation is most accurate when based on actual instance hourly rates rather than traditional service units (SU). We use representative rates from each instance tier for straightforward estimation.

Computing cost estimation follows this formula:

Annual Cost = (Hours per task × Hourly rate × Number of tasks × Contingency factor) + Prototyping cost + Interactive session cost

We recommend a contingency factor of 2x to account for the reality that research pipelines are rarely run only once. This multiplier covers failed jobs and reruns.

Prototyping cost accounts for the initial phase of establishing and validating a pipeline before processing the full dataset, based on processing 10% of your samples during pipeline development, at the same per-sample cost as production runs.

Interactive session cost accounts for ongoing data exploration, visualization, and analysis using tools like RStudio or JupyterLab. A reasonable estimate assumes 12 hours per day, 7 days per week, throughout the year for each lab member actively working on the project. Using a cpu4mem32a instance ($0.23/hr), this amounts to approximately $1,000 per person per year.

MemVerge Spot Savings as Discount Factor

The compute rates considered in this post are on-demand instance list prices. With our MemVerge Spot instance integration, actual costs are typically 20% to 70% lower depending on applications and instance availability. If your budget estimate exceeds available funds, you can apply a MemVerge discount factor at your discretion (0.3 to 0.8) to your total compute cost as justification. This discount applies only to compute costs, not storage.

Computing

Reference Rates

The table below shows representative hourly rates based on the lowest-priced instance in each compute resource tier.

Workload Type	Instance	Specs	Hourly Rate	Typical Use Case
Light CPU	cpu2mem8a	2 CPU, 8GB	$0.09	Preprocessing, small scripts
Standard CPU	cpu4mem32a	4 CPU, 32GB	$0.23	BWA alignment, variant calling
Heavy CPU	cpu8mem64a	8 CPU, 64GB	$0.45	GATK, multi-sample calling
High Memory	cpu16mem128a	16 CPU, 128GB	$0.90	Sequence assembly, large matrices
Very High Memory	cpu32mem256a	32 CPU, 256GB	$1.81	Memory-intensive workflows
Single GPU	gpu1mem64	1 A10G, 64GB	$1.62	Model training, inference
Multi-GPU	gpu4mem96	4 A10G, 96GB	$5.67	Large model training

Storage

See Storage Pricing for current rates.

Storage Tier	Monthly Cost	Annual Cost	Best For
S3 Standard	$18/TB	$216/TB	Active analysis data
S3 Intelligent-Tiering	$11-18/TB	$132-216/TB	Mixed access patterns
S3 Glacier	$3.50/TB	$42/TB	Long-term archival

Budget Justification Template

The project will utilize the Columbia University Department of Neurology high-performance computing environment hosted on Amazon Web Services. Computing costs are estimated based on [X samples/analyses] requiring approximately [Y] CPU-hours for [workflow description] at an average rate of $[Z] per hour, totaling $[amount]. A 2× contingency factor is applied to account for pipeline development, parameter optimization, and at least one complete reanalysis cycle. An additional $[amount] is budgeted for prototyping on [N] samples to establish and validate the pipeline, and $[amount] for interactive analysis sessions ([N] lab members × $1,000/year). Compute costs assume utilization of MemVerge SpotSurfer for Spot instance integration, which historically reduces compute costs by approximately [X]%. Storage costs include [N] TB of S3 storage at $[rate]/TB/year for [data description]. Total computing costs: $[final amount] per year.

Examples

The cost estimates in these examples were generated by Claude AI and reviewed with minor adjustments by Gao Wang.

Benchmark references

Compute cost

Workflow	Benchmark	Source
30x WGS (alignment + sorting + markdup + BQSR)	4-9 hours on 96 vCPU	NVIDIA Parabricks docs
BWA-MEM WGS alignment only	~300 CPU-hours per sample	Engström et al. 2016 via CNAG
General alignment rule of thumb	~1 CPU-hour per GB of compressed FASTQ	Community benchmark

To convert these benchmarks to our instance types: a 30x WGS sample taking 4-9 hours on 96 vCPUs would take approximately 48-108 hours on a cpu8mem64a (8 vCPU) instance, or 12-27 hours on a cpu32mem256a (32 vCPU) instance.

Typical data sizes:

Data Type	Size	Notes
WGS FASTQ (30x, compressed)	~50 GB per sample	Paired-end 150bp
WGS BAM (30x)	~80 GB per sample	~40 GB if CRAM
RNA-seq FASTQ	~3-5 GB per sample	30-50M read pairs
scRNA-seq (10x)	~15-20 GB per library	Varies with cell count
Genotype data (PLINK .bed)	~2.5 GB per 10K samples	At ~1M SNPs
Genotype data (PLINK .bed)	~25 GB per 10K samples	At ~10M variants (WGS)
Genotype data (VCF, bgzipped)	~50-200 GB per 10K samples	Depends on variant count and annotations

Example 1: Whole Genome Sequencing Analysis

Process 500 whole genome samples (30x coverage) through alignment and variant calling pipeline. Assumes 2 lab members working on the project.

Compute estimate per sample:

Step	Instance	Wall-clock Hours	Cost/Sample
BWA-MEM alignment	cpu32mem256a	10 hrs	$18.10
Sorting and deduplication	cpu8mem64a	2 hrs	$0.90
GATK HaplotypeCaller	cpu8mem64a	8 hrs	$3.60
Quality control	cpu4mem32a	1 hr	$0.23
Compute subtotal		~21 hrs	$22.83

Storage estimate:

Data	Size	Annual Cost
FASTQ (retained)	25 TB	$5,400
BAM files	40 TB	$8,640
gVCF results	2 TB	$432
Storage subtotal	67 TB	$14,472

Annual budget:

Category	Calculation	Cost
Production compute	500 samples × $22.83	$11,415
Contingency (2×)	$11,415 × 2	$22,830
Prototyping (50 samples)	50 × $22.83	$1,142
Interactive sessions (2 people)	2 × $1,000	$2,000
Compute total		$25,972
Storage		$14,472
Total Year 1		~$40,450

Example 2: Single-Nucleus RNA-seq Analysis

snRNA-seq analysis of 500 samples with cell type-specific differential expression across 6 cell types. Assumes 1 lab member working on the project.

Compute estimate per sample:

Step	Instance	Hours	Cost/Sample
STAR alignment + quantification	cpu16mem128a	2 hrs	$1.80
Quality control and doublet removal	cpu4mem32a	0.5 hr	$0.12
Per-sample subtotal		2.5 hrs	$1.92

Batch processing (once per project):

Step	Instance	Hours	Cost
Data aggregation	cpu16mem128a	8 hrs	$7.20
Cell type annotation	cpu8mem64a	4 hrs	$1.80
DE analysis (6 cell types)	cpu8mem64a	6 hrs	$2.70
GSEA enrichment analysis	cpu4mem32a	3 hrs	$0.69
Batch subtotal		21 hrs	$12.39

Storage estimate:

Data	Size	Annual Cost
FASTQ files	7.5 TB	$1,620
STARsolo outputs (BAM + matrices)	10 TB	$2,160
Integrated AnnData objects	0.1 TB	$22
Storage subtotal	~17.6 TB	$3,802

Annual budget:

Category	Calculation	Cost
Production compute	(500 × $1.92) + $12.39	$972
Contingency (2×)	$972 × 2	$1,944
Prototyping (50 samples)	(50 × $1.92) + $12.39	$108
Interactive sessions (1 person)	1 × $1,000	$1,000
Compute total		$3,052
Storage		$3,802
Total Year 1		~$6,854

Example 3: GWAS Fine-Mapping

Fine-mapping analysis across 100 genomic loci using SuSiE, with 10,000 samples. Assumes 1 lab member working on the project.

Note: Fine-mapping runtime varies significantly with region number of variants and LD complexity. These are conservative estimates for typical 1-2 Mb regions.

Compute estimate per locus:

Step	Instance	Hours	Cost/Locus
LD matrix computation	cpu8mem64a	0.5 hr	$0.23
SuSiE fine-mapping	cpu4mem32a	0.25 hr	$0.06
Colocalization	cpu4mem32a	0.25 hr	$0.06
Compute subtotal		~1 hr	$0.35

Storage estimate:

Data	Size	Annual Cost
Genotype data (PLINK)	50 GB	$11
Summary statistics	10 GB	$2
LD matrices (if retained)	200 GB	$43
Fine-mapping results	1 GB	$0.22
Storage subtotal	~260 GB	$56

Annual budget:

Category	Calculation	Cost
Production compute	100 loci × $0.35	$35
Contingency (2×)	$35 × 2	$70
Prototyping (10 loci)	10 × $0.35	$4
Interactive sessions (1 person)	1 × $1,000	$1,000
Compute total		$1,074
Storage		$56
Total Year 1		~$1,130

Estimating Computing Costs for Genomics Project Budgets, January 2026

Categories:

Tags:

Method

MemVerge Spot Savings as Discount Factor

Computing

Reference Rates

Storage

Budget Justification Template

Examples

Benchmark references

Example 1: Whole Genome Sequencing Analysis

Example 2: Single-Nucleus RNA-seq Analysis

Example 3: GWAS Fine-Mapping