How to Size a GPU Cluster for LLM Training
Learn how to calculate the right number of GPUs, memory, and networking for training large language models efficiently.
How to Size a GPU Cluster for LLM Training
So you're building an LLM. Congrats! Now comes the hard part: figuring out how many GPUs you actually need.
Too few? Your training crawls. Too many? You're burning cash on idle silicon. Let's do the math and get this right.
The Quick Formula (For the Impatient)
Minimum GPUs needed:
`
GPUs = (Model Parameters × 4 bytes × 1.2) / GPU Memory
`
Training Time Estimate:
`
Days = (Tokens × Parameters × 6) / (GPUs × TFLOPS × 86400 × Efficiency)
`
Where efficiency is ~0.3-0.5 for real workloads.
Example: Llama 3 70B on 1T tokens
- Minimum GPUs: (70B × 4 × 1.2) / 80GB = 4.2 GPUs → Use 8
- Training time: (1T × 70B × 6) / (8 × 989 × 86400 × 0.4) = 152 days
Scared yet? Let's optimize this.
Step 1: Calculate Model Memory Requirements
Base Model Size
Your model's memory footprint in mixed-precision training:
| Component | Memory Formula |
|---|---|
| Model weights | Parameters × 4 bytes (FP32) |
| Gradients | Parameters × 4 bytes |
| Optimizer states (Adam) | Parameters × 8 bytes |
| Total per GPU | Parameters × 16 bytes |
Example - Llama 3 70B:
- 70B parameters × 16 bytes = 1,120 GB
- With H100 80GB: Need 14 GPUs minimum
Add Overhead (20-30%)
Real training needs:
- Activation memory (depends on batch size)
- Communication buffers (for multi-GPU)
- Framework overhead (PyTorch/JAX)
Rule of thumb: Add 20-30% to your calculation.
Llama 3 70B adjusted: 1,120 GB × 1.25 = 1,400 GB → 18 GPUs (H100 80GB)
Step 2: Choose Your Parallelism Strategy
Data Parallelism (DP)
- Same model on every GPU
- Different batch on each GPU
- When to use: Model fits in 1 GPU
- Scaling: Linear up to 8-16 GPUs
Tensor Parallelism (TP)
- Split model layers across GPUs
- High bandwidth required (NVLink)
- When to use: Model doesn't fit in 1 GPU
- Scaling: 2-8 GPUs max (communication bottleneck)
Pipeline Parallelism (PP)
- Split model vertically (layers → GPUs)
- Less communication than TP
- When to use: Very large models
- Scaling: 8-64 GPUs
FSDP (Fully Sharded Data Parallel)
- Meta's secret sauce for Llama
- Shards model, gradients, optimizer across GPUs
- When to use: Modern default for LLM training
- Scaling: 64-512+ GPUs
Llama 3 70B optimal config:
- FSDP across 16 GPUs
- TP=1, PP=1, DP=16
- Batch size = 4M tokens (256 micro-batches)
Step 3: Calculate Training Time
The Formula (Simplified)
`
Training Time = 6 × Parameters × Tokens / (GPUs × TFLOPS × Efficiency)
`
Where:
- 6 = FLOPs per parameter per token (forward + backward)
- TFLOPS = Your GPU's performance (H100 = 989 TFLOPS FP16)
- Efficiency = 0.3-0.5 for real workloads (communication, CPU, I/O)
Real Examples
GPT-3 175B on 300B tokens:
- H100 config: 128 GPUs
- Training time: (6 × 175B × 300B) / (128 × 989 × 0.4) ≈ 77 days
- Cost at $2/hr/GPU: 128 × 77 × 24 × $2 = $475,584
Llama 2 7B on 2T tokens:
- H100 config: 8 GPUs
- Training time: (6 × 7B × 2T) / (8 × 989 × 0.45) ≈ 23 days
- Cost: 8 × 23 × 24 × $2 = $8,832
Mistral 7B on 1T tokens (efficient):
- H100 config: 8 GPUs
- Training time: ~12 days (with optimizations)
- Cost: ~$4,600
Step 4: Size Your Network
Intra-Node (NVLink/NVSwitch)
For GPUs in the same server:
- NVLink 4.0: 900 GB/s per GPU (H100/H200)
- Required for: Tensor Parallelism
- Rule: TP within a single 8-GPU node
Inter-Node (InfiniBand/RoCE)
For multi-server clusters:
- 400GbE: $3K/port → Good for <64 GPUs
- InfiniBand NDR: $8K/port → Best for 64-512 GPUs
- RoCE v2: $2K/port → Budget option
Llama 3 70B @ 16 GPUs:
- 2x 8-GPU servers
- NVLink within each node
- 400GbE between nodes
- Bandwidth: 50 GB/s inter-node (sufficient)
GPT-4 @ 512 GPUs:
- 64x 8-GPU servers
- InfiniBand NDR mesh
- Bandwidth: 400 GB/s inter-node
- Cost: ~$500K for networking alone
Step 5: Storage & Data Pipeline
Training Data Storage
| Model Size | Dataset Size | Storage Type | Bandwidth Needed |
|---|---|---|---|
| 7B | 2TB | NVMe | 10 GB/s |
| 70B | 5TB | NVMe | 25 GB/s |
| 175B | 10TB | Parallel FS | 50 GB/s |
Don't bottleneck on I/O. A $500K GPU cluster waiting on a $5K SSD is tragedy.
Recommended:
- <16 GPUs: 8x NVMe in RAID 0 (60 GB/s)
- 16-64 GPUs: Parallel filesystem (WekaFS, BeeGFS)
- 64+ GPUs: Dedicated storage cluster (Weka, Vast, DDN)
Checkpointing
Save model state every N hours:
- 7B model: ~28 GB checkpoint
- 70B model: ~280 GB checkpoint
- 175B model: ~700 GB checkpoint
Bandwidth: 10-20 GB/s to persistent storage
Step 6: Power & Cooling Budget
Power Requirements
| Server Config | Idle | Training | Peak |
|---|---|---|---|
| 8x H100 PCIe | 2kW | 3.5kW | 4kW |
| 8x H100 SXM | 3kW | 6.5kW | 7kW |
| 8x H200 SXM | 3kW | 6.5kW | 7kW |
| 8x B200 | 4kW | 9kW | 10kW |
64 GPU cluster (8 servers):
- H100 SXM: 8 × 6.5kW = 52kW
- Add networking, storage, cooling: 70kW total
- Monthly power cost @ $0.10/kWh: $5,040
Cooling
- Air cooling: Up to 5kW/server
- Liquid cooling: Required for >5kW/server
- Cost: $50K-$200K for liquid cooling infrastructure
Real-World Cluster Configurations
Startup Config: 8x H100 PCIe
Use case: Llama 2 7B finetuning, small experiments
- GPUs: 8x H100 80GB PCIe
- Server: 2x 4U Supermicro
- Networking: 100GbE
- Storage: 32TB NVMe
- Cost: $280K
- Power: 8kW
Mid-Size: 32x H100 SXM
Use case: Llama 3 70B training, multi-project lab
- GPUs: 32x H100 80GB SXM
- Servers: 4x 8-GPU HGX
- Networking: 400GbE
- Storage: Weka 100TB
- Cost: $1.2M
- Power: 30kW
Enterprise: 256x H200 SXM
Use case: GPT-4 scale training, foundation model development
- GPUs: 256x H200 141GB SXM
- Servers: 32x 8-GPU HGX
- Networking: InfiniBand NDR
- Storage: Weka 1PB
- Cost: $12M
- Power: 250kW
The Honest Sizing Guide
For Finetuning (<10B models):
- 4-8 GPUs (H100 PCIe or RTX 6000)
- Budget: $150K-$300K
- Training time: Days to weeks
For Training Small Models (7-13B):
- 8-16 GPUs (H100 SXM)
- Budget: $350K-$700K
- Training time: 2-4 weeks
For Training Large Models (70B):
- 16-64 GPUs (H100/H200 SXM)
- Budget: $1M-$3M
- Training time: 1-3 months
For Foundation Models (175B+):
- 128-512 GPUs (H200/B200 SXM)
- Budget: $8M-$30M
- Training time: 3-6 months
Don't Make These Mistakes
1. Undersizing Memory
→ Your 70B model doesn't fit in 8 GPUs. Now you need 16. Oops.
2. Ignoring Networking
→ 400GbE between 64 GPUs? Enjoy 40% utilization.
3. Cheap Storage
→ Your $10M GPU cluster is bottlenecked by a $10K NAS.
4. No Redundancy
→ One GPU dies. Training stops for 2 weeks while you wait for RMA.
5. Wrong GPU Choice
→ Bought H100 PCIe for multi-GPU training. NVLink would've been 2x faster.
Need Help Sizing Your Cluster?
We've designed GPU clusters from 4 to 512 GPUs. We know the mistakes because we've made them.
Call (850) 407-7265 for a free sizing consultation. We'll run your workload analysis and quote a custom configuration.
Request Quote | View GPU Catalog
