Inference workloads demand different hardware than training. Our inference nodes prioritize low latency, high throughput, and cost-per-token efficiency for serving LLMs, vision models, and embeddings at scale.
From single-GPU workstations for prototyping to multi-GPU servers for production deployments, we offer configurations validated for popular inference frameworks.
┌────────────────────┐ │ ╔══════════════╗ │ │ ║ ▓▓▓▓▓▓▓▓▓▓▓▓ ║ │ │ ║ ▓▓▓▓▓▓▓▓▓▓▓▓ ║ │ │ ║ ▓▓▓▓▓▓▓▓▓▓▓▓ ║ │ │ ╚══════════════╝ │ │ ┌─┐┌─┐┌─┐┌─┐┌─┐ │ │ └─┘└─┘└─┘└─┘└─┘ │ └────────────────────┘ H100 80GB HBM3
Desktop inference for development and prototyping
Production inference with multi-GPU parallelism
High-throughput inference for large context windows
| Model Size | Recommended GPU | Batch Size | Tokens/sec (est.) |
|---|---|---|---|
| 7B params | RTX 6000 Ada (48GB) | 1-8 | ~80-120 |
| 13B params | RTX 6000 Ada (48GB) | 1-4 | ~50-70 |
| 70B params | 2× L40S (48GB each) | 1-2 | ~30-40 |
| 405B params | 2× H200 (141GB each) | 1 | ~15-25 |
* Estimates based on FP16/BF16 precision with typical context lengths. Actual performance varies by model architecture, quantization, and framework.
Our voice agent can help you select the right inference hardware based on your model size, latency requirements, and budget.