The Stack
Ollama orchestrates local LLM inference. Supports Llama 4, Mistral Large, Qwen, DeepSeek, and others. CRM agents integrate via Ollama’s OpenAI-compatible API. Self-hosted; no per-token bill.
Ollama runs on macOS, Linux, and Windows. Pulls quantized models (typically Q4_K_M or Q5_K_M GGUF) and serves them via an OpenAI-compatible HTTP endpoint on port 11434. The OpenAI compatibility means existing CRM agent code that points at OpenAI swaps with a base-URL change. For production, run Ollama on Linux with NVIDIA CUDA or Apple Silicon Metal acceleration; the macOS desktop app is for development only. vLLM and TGI are alternatives for higher-throughput production deployments.
ollama pull llama3.3:70b-instruct-q4_K_M
ollama serve # exposes OpenAI-compatible API at :11434
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.3:70b-instruct-q4_K_M",
messages=[{"role":"user","content":"Summarize this case"}]
)
When It Fits
Regulated industries requiring data residency. High-volume workflows where cloud LLM costs hurt. Edge deployments (retail stores, field service) without reliable internet. Custom fine-tuning workflows.
Strong fit signals. Healthcare, financial services, or government with sovereignty requirements that even AWS Bedrock or Azure OpenAI in-region can’t satisfy. Token volume above 10M/day where cloud spend exceeds $20K/month — self-host typically breaks even at that level. Edge or on-premise scenarios (retail POS, manufacturing lines, branch offices) where latency and offline capability matter. Fine-tuning workloads where you need full control over base weights and training data.
When It Doesn’t
Low volume (cloud is cheaper operationally). Workflows needing best-in-class reasoning (open models lag proprietary by 6-12 months typically). Teams without ML ops expertise (hosting has ongoing maintenance).
Concrete thresholds. Below 1M tokens/day, cloud LLM is straightforwardly cheaper after operational overhead. Tasks requiring multi-step agentic reasoning (Sonnet 4.5 leads by 8-12 points on tool-use benchmarks). Teams without dedicated MLOps capacity — Ollama is easy to start, harder to keep available 24/7 with capacity planning, security patching, and model updates. Multimodal workloads where Gemini 2.5 Pro or Claude still outperform open alternatives meaningfully.
Hybrid
Most successful deployments: open-source for high-volume classification/extraction, cloud LLM for complex reasoning. Route per task. Balance cost, quality, and control.
A working hybrid pattern. Local Ollama hosting Llama 3.3 70B or Qwen 2.5 32B for: intent classification, entity extraction, language detection, sentiment scoring, embedding generation. Cloud Claude Sonnet 4.5 or GPT-5 for: multi-step agent reasoning, complex case summarization, regulated decisions, and any task where eval scores show a meaningful quality gap. Route via Portkey or LiteLLM with per-task model selection — same API surface, different backend. Track cost-per-task and quality-per-task to validate routing decisions.
Common Failure Modes
Five recurring patterns. Running Q2 quantization to save GPU memory and getting noticeably worse outputs. Skipping prompt-cache equivalents (vLLM’s prefix caching) and paying full inference cost on repeated context. No GPU monitoring, hitting OOM in production. Treating Ollama desktop as production-ready. Forgetting model updates: open weights ship security and quality improvements that require explicit pulls.
What to Do This Week
Stand up Ollama on a single GPU box and benchmark Llama 3.3 70B against your current cloud model on 100 representative CRM prompts.