Estimating Computing Costs for Genomics Project Budgets, January 2026
Categories:
Method
For cloud-based HPC, cost estimation is most accurate when based on actual instance hourly rates rather than traditional service units (SU). We use representative rates from each instance tier for straightforward estimation.
Computing cost estimation follows this formula:
Annual Cost = (Hours per task × Hourly rate × Number of tasks × Contingency factor) + Prototyping cost + Interactive session cost
We recommend a contingency factor of 2x to account for the reality that research pipelines are rarely run only once. This multiplier covers failed jobs and reruns.
Prototyping cost accounts for the initial phase of establishing and validating a pipeline before processing the full dataset, based on processing 10% of your samples during pipeline development, at the same per-sample cost as production runs.
Interactive session cost accounts for ongoing data exploration, visualization, and analysis using tools like RStudio or JupyterLab. A reasonable estimate assumes 12 hours per day, 7 days per week, throughout the year for each lab member actively working on the project. Using a cpu4mem32a instance ($0.23/hr), this amounts to approximately $1,000 per person per year.
MemVerge Spot Savings as Discount Factor
The compute rates considered in this post are on-demand instance list prices. With our MemVerge Spot instance integration, actual costs are typically 20% to 70% lower depending on applications and instance availability. If your budget estimate exceeds available funds, you can apply a MemVerge discount factor at your discretion (0.3 to 0.8) to your total compute cost as justification. This discount applies only to compute costs, not storage.
Computing
Reference Rates
The table below shows representative hourly rates based on the lowest-priced instance in each compute resource tier.
| Workload Type | Instance | Specs | Hourly Rate | Typical Use Case |
|---|---|---|---|---|
| Light CPU | cpu2mem8a | 2 CPU, 8GB | $0.09 | Preprocessing, small scripts |
| Standard CPU | cpu4mem32a | 4 CPU, 32GB | $0.23 | BWA alignment, variant calling |
| Heavy CPU | cpu8mem64a | 8 CPU, 64GB | $0.45 | GATK, multi-sample calling |
| High Memory | cpu16mem128a | 16 CPU, 128GB | $0.90 | Sequence assembly, large matrices |
| Very High Memory | cpu32mem256a | 32 CPU, 256GB | $1.81 | Memory-intensive workflows |
| Single GPU | gpu1mem64 | 1 A10G, 64GB | $1.62 | Model training, inference |
| Multi-GPU | gpu4mem96 | 4 A10G, 96GB | $5.67 | Large model training |
Storage
See Storage Pricing for current rates.
| Storage Tier | Monthly Cost | Annual Cost | Best For |
|---|---|---|---|
| S3 Standard | $18/TB | $216/TB | Active analysis data |
| S3 Intelligent-Tiering | $11-18/TB | $132-216/TB | Mixed access patterns |
| S3 Glacier | $3.50/TB | $42/TB | Long-term archival |
Budget Justification Template
The project will utilize the Columbia University Department of Neurology high-performance computing environment hosted on Amazon Web Services. Computing costs are estimated based on [X samples/analyses] requiring approximately [Y] CPU-hours for [workflow description] at an average rate of $[Z] per hour, totaling $[amount]. A 2× contingency factor is applied to account for pipeline development, parameter optimization, and at least one complete reanalysis cycle. An additional $[amount] is budgeted for prototyping on [N] samples to establish and validate the pipeline, and $[amount] for interactive analysis sessions ([N] lab members × $1,000/year). Compute costs assume utilization of MemVerge SpotSurfer for Spot instance integration, which historically reduces compute costs by approximately [X]%. Storage costs include [N] TB of S3 storage at $[rate]/TB/year for [data description]. Total computing costs: $[final amount] per year.
Examples
The cost estimates in these examples were generated by Claude AI and reviewed with minor adjustments by Gao Wang.
Benchmark references
Compute cost
| Workflow | Benchmark | Source |
|---|---|---|
| 30x WGS (alignment + sorting + markdup + BQSR) | 4-9 hours on 96 vCPU | NVIDIA Parabricks docs |
| BWA-MEM WGS alignment only | ~300 CPU-hours per sample | Engström et al. 2016 via CNAG |
| General alignment rule of thumb | ~1 CPU-hour per GB of compressed FASTQ | Community benchmark |
To convert these benchmarks to our instance types: a 30x WGS sample taking 4-9 hours on 96 vCPUs would take approximately 48-108 hours on a cpu8mem64a (8 vCPU) instance, or 12-27 hours on a cpu32mem256a (32 vCPU) instance.
Typical data sizes:
| Data Type | Size | Notes |
|---|---|---|
| WGS FASTQ (30x, compressed) | ~50 GB per sample | Paired-end 150bp |
| WGS BAM (30x) | ~80 GB per sample | ~40 GB if CRAM |
| RNA-seq FASTQ | ~3-5 GB per sample | 30-50M read pairs |
| scRNA-seq (10x) | ~15-20 GB per library | Varies with cell count |
| Genotype data (PLINK .bed) | ~2.5 GB per 10K samples | At ~1M SNPs |
| Genotype data (PLINK .bed) | ~25 GB per 10K samples | At ~10M variants (WGS) |
| Genotype data (VCF, bgzipped) | ~50-200 GB per 10K samples | Depends on variant count and annotations |
Example 1: Whole Genome Sequencing Analysis
Process 500 whole genome samples (30x coverage) through alignment and variant calling pipeline. Assumes 2 lab members working on the project.
Compute estimate per sample:
| Step | Instance | Wall-clock Hours | Cost/Sample |
|---|---|---|---|
| BWA-MEM alignment | cpu32mem256a | 10 hrs | $18.10 |
| Sorting and deduplication | cpu8mem64a | 2 hrs | $0.90 |
| GATK HaplotypeCaller | cpu8mem64a | 8 hrs | $3.60 |
| Quality control | cpu4mem32a | 1 hr | $0.23 |
| Compute subtotal | ~21 hrs | $22.83 |
Storage estimate:
| Data | Size | Annual Cost |
|---|---|---|
| FASTQ (retained) | 25 TB | $5,400 |
| BAM files | 40 TB | $8,640 |
| gVCF results | 2 TB | $432 |
| Storage subtotal | 67 TB | $14,472 |
Annual budget:
| Category | Calculation | Cost |
|---|---|---|
| Production compute | 500 samples × $22.83 | $11,415 |
| Contingency (2×) | $11,415 × 2 | $22,830 |
| Prototyping (50 samples) | 50 × $22.83 | $1,142 |
| Interactive sessions (2 people) | 2 × $1,000 | $2,000 |
| Compute total | $25,972 | |
| Storage | $14,472 | |
| Total Year 1 | ~$40,450 |
Example 2: Single-Nucleus RNA-seq Analysis
snRNA-seq analysis of 500 samples with cell type-specific differential expression across 6 cell types. Assumes 1 lab member working on the project.
Compute estimate per sample:
| Step | Instance | Hours | Cost/Sample |
|---|---|---|---|
| STAR alignment + quantification | cpu16mem128a | 2 hrs | $1.80 |
| Quality control and doublet removal | cpu4mem32a | 0.5 hr | $0.12 |
| Per-sample subtotal | 2.5 hrs | $1.92 |
Batch processing (once per project):
| Step | Instance | Hours | Cost |
|---|---|---|---|
| Data aggregation | cpu16mem128a | 8 hrs | $7.20 |
| Cell type annotation | cpu8mem64a | 4 hrs | $1.80 |
| DE analysis (6 cell types) | cpu8mem64a | 6 hrs | $2.70 |
| GSEA enrichment analysis | cpu4mem32a | 3 hrs | $0.69 |
| Batch subtotal | 21 hrs | $12.39 |
Storage estimate:
| Data | Size | Annual Cost |
|---|---|---|
| FASTQ files | 7.5 TB | $1,620 |
| STARsolo outputs (BAM + matrices) | 10 TB | $2,160 |
| Integrated AnnData objects | 0.1 TB | $22 |
| Storage subtotal | ~17.6 TB | $3,802 |
Annual budget:
| Category | Calculation | Cost |
|---|---|---|
| Production compute | (500 × $1.92) + $12.39 | $972 |
| Contingency (2×) | $972 × 2 | $1,944 |
| Prototyping (50 samples) | (50 × $1.92) + $12.39 | $108 |
| Interactive sessions (1 person) | 1 × $1,000 | $1,000 |
| Compute total | $3,052 | |
| Storage | $3,802 | |
| Total Year 1 | ~$6,854 |
Example 3: GWAS Fine-Mapping
Fine-mapping analysis across 100 genomic loci using SuSiE, with 10,000 samples. Assumes 1 lab member working on the project.
Note: Fine-mapping runtime varies significantly with region number of variants and LD complexity. These are conservative estimates for typical 1-2 Mb regions.
Compute estimate per locus:
| Step | Instance | Hours | Cost/Locus |
|---|---|---|---|
| LD matrix computation | cpu8mem64a | 0.5 hr | $0.23 |
| SuSiE fine-mapping | cpu4mem32a | 0.25 hr | $0.06 |
| Colocalization | cpu4mem32a | 0.25 hr | $0.06 |
| Compute subtotal | ~1 hr | $0.35 |
Storage estimate:
| Data | Size | Annual Cost |
|---|---|---|
| Genotype data (PLINK) | 50 GB | $11 |
| Summary statistics | 10 GB | $2 |
| LD matrices (if retained) | 200 GB | $43 |
| Fine-mapping results | 1 GB | $0.22 |
| Storage subtotal | ~260 GB | $56 |
Annual budget:
| Category | Calculation | Cost |
|---|---|---|
| Production compute | 100 loci × $0.35 | $35 |
| Contingency (2×) | $35 × 2 | $70 |
| Prototyping (10 loci) | 10 × $0.35 | $4 |
| Interactive sessions (1 person) | 1 × $1,000 | $1,000 |
| Compute total | $1,074 | |
| Storage | $56 | |
| Total Year 1 | ~$1,130 |