The Release
Meta’s Llama 4 ships in two main flavors: Scout (109B params, 17B active, 10M context window) and Maverick (400B, 17B active, quality leader). Mixture-of-experts architecture — lower inference cost per call than dense models.
The MoE architecture activates only 17B parameters per token, so per-token compute approximates a 17B dense model while quality approaches frontier dense models 5-10x larger. License remains the Llama Community License with the 700M monthly active user threshold for commercial use — most CRM deployments are well under that ceiling. Available via Bedrock, Vertex AI, Azure AI Foundry, Together, and Groq, with fp8 weights downloadable for self-host. A third sibling, Behemoth (~2T parameters), is in training as a teacher model.
Multimodal
Both Scout and Maverick natively multimodal. Image + text inputs handled. For CRM use cases involving customer-submitted photos (claims, support, product feedback), multimodal capabilities matter.
Native multimodal means image and text share a unified token space, not a bolted-on vision encoder. Resolution caps at 1024x1024 per image with up to 8 images per request. Handles charts, screenshots, and document photos at quality competitive with Gemini 2.5 Pro on the MMMU benchmark. CRM-relevant use cases: insurance claim photo triage, retail return verification, B2B case attachments where the agent needs to see the screenshot before responding.
Context Window
Scout’s 10M context is remarkable. Full account histories, year-long conversation transcripts, complete product catalogs — all fit in one prompt. Opens use cases previously requiring elaborate RAG engineering.
The 10M window enables a class of CRM workflows where retrieval was the bottleneck. Drop a five-year account history (calls, emails, opportunities, support cases) into a single prompt without chunking. Quality on long-context retrieval (needle-in-haystack) holds above 95% across the full window in published benchmarks. Trade-off: latency scales with context, and long-context inference still costs real money — Together AI lists Scout at roughly $0.18/M input tokens for the long-context tier. RAG remains cheaper for most production loads; Scout’s window is the right tool for one-shot deep analysis.
Where It Fits
High-volume agentic workflows where open-source inference is cheaper than proprietary APIs. Long-context tasks (full-history customer support). Multimodal CRM use cases. Self-hosted deployments for regulated data.
Strong fit signals. Token budget exceeding $50K/month where switching to Llama 4 on Together or Groq cuts spend 60-80%. Regulatory or sovereignty constraints requiring on-prem or specific-region inference. Multimodal workflows at scale where Gemini’s price is acceptable but lock-in is not. Weak fit. Reasoning-heavy multi-step agentic workflows where Claude Sonnet 4.5 still outperforms by 8-12 points on tool-use benchmarks. Cases where time-to-market beats inference cost.
Cost Considerations
Self-hosting Maverick requires roughly 8x H100 GPUs at fp8 for production-grade throughput, costing $25-$40K/month on demand or $300-$600K capex. Scout fits on 4x H100s. Hosted inference (Together, Groq, Bedrock) eliminates the hardware lift and prices Scout 1.5-2.5x cheaper per token than Claude Haiku for comparable tasks. Most teams should start hosted and only self-host when token volume justifies $20K+ monthly hardware spend.
What to Do This Week
Run a single high-volume CRM workflow on Scout via Together or Bedrock and compare cost-per-resolution against your incumbent provider over 1,000 requests.