Hardware
NVIDIA H100 and H200 GPUs are the production inference workhorses for 2026; the B200 (Blackwell) began shipping in volume late 2025 and changes the cost math for the largest deployments. Consumer GPUs (RTX 4090, 5090) work for development and small internal tools. Cloud GPU instances — AWS p5 (H100), p5e (H200), Azure ND H100/H200 v5, GCP A3 Mega — for variable workloads. Rough rule of thumb: one H100 handles roughly 20 concurrent users at acceptable latency for a 70B-class model with reasonable context length; a B200 doubles that. Llama 4 70B-class models, Mistral Large 2, Qwen2.5-72B, and DeepSeek-V3 are the leading open-weight choices for CRM-grade work.
Orchestration
vLLM and TGI (Text Generation Inference) lead production inference serving. SGLang and TensorRT-LLM are the closest alternatives for specific workloads. Kubernetes for scaling, with NVIDIA’s GPU operator and the K8s device plugin handling GPU scheduling. Prometheus and Grafana for monitoring; capture queue depth, time-to-first-token, tokens-per-second, GPU utilization, and KV cache hit rate. Cost tracking per team via tags so internal chargeback is possible. Treat LLM serving as a first-class platform service with an SRE rotation, not a research project.
Production stack — typical
Inference: vLLM 0.6+ on Kubernetes
Routing: LiteLLM or NVIDIA Triton router
Quantization: AWQ or GPTQ for 70B class -> single H100
Caching: KV cache + prefix caching enabled
Metrics: Prometheus, Grafana
Tracing: OpenTelemetry + Langfuse
Cost Math
Self-hosted 70B-class model lands roughly $0.20-0.40 per million tokens at scale once GPU amortization, electricity, and ops time are factored. Proprietary API pricing in 2026: Anthropic Claude Sonnet ~$3 input / $15 output per million tokens, OpenAI GPT-5 in similar territory, Gemini comparable. The break-even sits around 10-30M tokens per month depending on the workload mix and the discount on committed-use pricing for the proprietary alternative. Below 10M tokens per month, commercial APIs are operationally cheaper because the ops overhead of self-hosting amortizes poorly. Above 100M tokens per month and self-hosting is hard to beat on raw cost — but factor the model upgrade treadmill that proprietary providers absorb for free.
Skills Required
ML ops or platform engineering expertise. Model updates (re-quantize, re-test, re-deploy), scaling under load, incident response when a GPU faults, model swap when a better open-weight model releases — all of these are continuous work. Do not embark on self-hosting unless you have the team or the budget to hire. The 2026 rate for an experienced LLM platform engineer is $200-350K total comp in major US markets; staff a minimum of two so the on-call rotation is humane. Pilot on cloud GPU first; avoid upfront hardware commitment until the workload is stable.
When Self-Hosting Wins
Beyond the cost math, self-hosting wins when data residency rules forbid sending content to a third-party API (defense, classified, certain healthcare and financial workloads), when the use case requires fine-tuning or significant prompt customization, and when compliance posture (FedRAMP High, IL5/IL6, certain EU sovereignty requirements) makes proprietary APIs awkward. The 2026 EU sovereign-cloud push, including the EuroStack initiative, has put more European enterprises on the self-hosted path.
Common Failure Modes
The recurring failures: starting on self-hosted before the workload is large enough; underestimating ops capacity; running an open-weight model six months stale because nobody scheduled the upgrade; losing the eval harness between version swaps so quality regressions go undetected; and treating GPU capacity as fixed instead of bursting to cloud during peaks.
Cost Considerations
H100 list price was around $30-40K through 2024-2025; reserved cloud capacity at $2-3/hour. B200 commands a premium and shifts the math. Budget electricity, cooling, and data-center contract terms separately from compute. Plan one full re-quantization and re-eval cycle every 6 months as a minimum.
What to do this week
Estimate your monthly token volume across all CRM AI use cases. If under 10M, self-hosting is unlikely to pay back this year; if over 50M, the math is worth modeling carefully. Either way, the volume number is the input every other decision flows from.