Time to First Token — time from sending a request until the first output token arrives. Includes network latency, queuing, and the entire prefill phase (processing all input tokens to build the KV cache). Longer prompts → bigger TTFT.
Targets: chatbot <500ms · code completion <100ms · batch OK at seconds
Time Per Output Token — average time between generating each successive token after the first. Measures the decode phase speed. Also called ITL (Inter-Token Latency) in some tools, though ITL is token-weighted while TPOT is request-weighted.
30ms TPOT ≈ 33 tok/s ≈ ~1,600 words/min (faster than reading speed)
End-to-End Request Latency — total time from request to final token. What the user actually waits for.
A fast TTFT with slow generation still yields poor perceived UX.
Tokens/sec, Requests/sec, Goodput — system-level throughput. TPS = total output tokens generated per second across all concurrent requests. RPS = completed requests per second. Goodput = requests/sec that meet your SLO thresholds (e.g. TTFT <500ms AND TPOT <15ms). Goodput is the metric that actually correlates to user satisfaction.
| Metric | Measures | Affected By | Optimize When |
|---|---|---|---|
| TTFT | Responsiveness start | Prompt length, KV cache build, queue depth, prefill compute | Interactive chat, streaming UX |
| TPOT / ITL | Generation smoothness | Model size, batch size, KV cache growth, decode compute | Streaming text, real-time apps |
| E2E | Total wait | TTFT + TPOT × tokens | Short-response use cases |
| TPS | System capacity | GPU count, batching strategy, memory bandwidth | Multi-user serving |
| Goodput | Useful throughput | All of the above within SLO bounds | Production SLAs |
MoE replaces dense FFN layers with multiple specialized expert sub-networks and a learned router/gate that selects a subset of experts per token. Since early 2025, MoE has become the dominant architecture for frontier models — over 60% of major open-source releases use it.
Faster training: MoEs match dense model quality with far less compute.
Faster inference FLOPs: Only active params are computed per token. Qwen3-235B activates only 22B per token — inference compute equivalent to a ~22B dense model.
Scaling capacity: Total param count → capacity. Active params → cost. You can grow capacity without proportional compute increase.
Expert specialization: Different experts learn to handle different domains/token types, improving multi-task generalization.
High VRAM: ALL experts must be loaded in memory, even though only a subset activate. Mixtral 8×7B has 47B total params but ~12B active. You need RAM for the 47B.
Load balancing: Without auxiliary losses, some experts starve while others are overloaded. Active research area in 2025-26.
Fine-tuning fragility: MoE historically overfits during fine-tuning. Newer techniques (LoRA on shared layers, expert freezing) help.
Communication overhead: Multi-GPU expert parallelism adds latency from all-to-all token routing.
| Model | Total Params | Active | Experts | Routing | Context | Notes |
|---|---|---|---|---|---|---|
| Qwen3-235B-A22B | 235B | 22B | 128 (top-8) | Token-level | 128K | Hybrid thinking/non-thinking modes, 119+ languages |
| Qwen3-30B-A3B | 30B | 3B | 128 (top-8) | Token-level | 128K | Runnable on M4 Max 36GB at Q4_K_M |
| DeepSeek-V3.1 | 671B | 37B | 256 (top-8) | Fine-grained | 128K | Multi-head latent attention, strong code |
| DeepSeek-R1 | 671B | 37B | 256 | Fine-grained | 128K | Reasoning specialist, FP4 & 1.78-bit variants exist |
| Llama 4 Maverick | 400B | 17B | 128 (top-1) | Token-level | 1M | Meta's first MoE, multimodal, ultra-long context |
| GPT-5 | Undisclosed | — | MoE | Dynamic | 400K | OpenAI's first confirmed MoE architecture |
| GPT-OSS-120B | 120B | ~20B | MoE | — | 128K | Open-source, MXFP4 fits on 1× H100 |
| Kimi K2 | 1T | 32B | MoE | — | 128K | Moonshot, trillion-parameter scale |
| Mixtral 8×22B | 141B | ~39B | 8 (top-2) | Token-level | 65K | Mistral, strong generalist |
| Mistral Large 3 | MoE | — | MoE | — | 128K | Production frontier from Mistral |
Quantization maps FP16/FP32 weights to lower-precision integers, dramatically shrinking model size and speeding inference. The GGUF format (from llama.cpp) is the universal standard for local inference. Key families: legacy (Q4_0, Q8_0), K-quants (Q4_K_M, Q5_K_M, Q6_K), and I-quants (IQ4_XS). For GPU serving, AWQ and GPTQ offer higher throughput via specialized kernels.
| Quant | Bits/Weight | Size (7B) | PPL Δ (7B) | Quality | Use Case |
|---|---|---|---|---|---|
| Q8_0 | 8.0 | ~7.2 GB | +0.004 | Near-lossless | Baseline / when VRAM permits / code & RAG fidelity |
| Q6_K | 6.6 | ~5.5 GB | +0.009 | Excellent | Production sweet spot if memory allows |
| Q5_K_M | 5.7 | ~4.8 GB | +0.035 | Very Good | High quality with real savings; great for code & reasoning |
| Q4_K_M | 4.9 | ~4.1 GB | +0.054 | Good | ⭐ Most popular — best balance of quality/size/speed |
| IQ4_XS | 4.5 | ~3.8 GB | ~+0.06 | Good* | Max compression at 4-bit class; needs good imatrix |
| Q3_K_M | 3.9 | ~3.3 GB | +0.244 | Moderate | Tight VRAM only; noticeable quality loss |
| Q2_K | 3.4 | ~2.9 GB | +0.870 | Poor | Not recommended — extreme quality loss |
PPL Δ = perplexity increase vs FP16 baseline. Lower = better. K-quants use block + sub-block grouping for superior quality vs legacy types at the same bit-width. "M" = medium mixed precision (attention layers get higher precision).
A larger model at lower quant typically outperforms a smaller model at higher quant. Example: 32B @ Q4_K_M > 14B @ Q8_0 in most benchmarks. Always pick the biggest model that fits in VRAM at Q4_K_M or above.
General chat: Q4_K_M (bump to Q5 if inconsistent)
Coding: Q5_K_M or Q8_0 for fewer subtle errors
Reasoning/math: Q5_K_M+ stabilizes chain-of-thought
RAG/retrieval: Q5_K_M or Q8_0 for grounding accuracy
GGUF: Best for llama.cpp/Ollama on CPU+GPU. Universal format.
AWQ: GPU-only, activation-aware. Best with Marlin kernels in vLLM (741 tok/s).
GPTQ: GPU-only, Hessian-based. Slightly lower quality than AWQ.
MXFP4/NVFP4: Native 4-bit on Blackwell GPUs. 4× cost reduction.
MoE models benefit enormously from quantization because you need ALL experts in memory but only activate a few. Qwen3-30B-A3B at Q4_K_M fits in ~18GB — runnable on M4 Max 36GB with room for KV cache. DeepSeek-R1 has community 1.78-bit variants for extreme compression.
| Model | Provider | Architecture | Context | Input $/1M | Output $/1M | Strength |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | Dense | 200K | $15 | $75 | Most capable reasoning |
| Claude Sonnet 4.6 | Anthropic | Dense | 200K | $3 | $15 | Best quality/cost ratio |
| Claude Haiku 4.5 | Anthropic | Dense | 200K | $0.25 | $1.25 | Speed + cost leader |
| GPT-5.1 | OpenAI | MoE | 400K | $1.25 | $10 | Reasoning, agentic |
| GPT-4.1 | OpenAI | Dense | 1M | $2 | $8 | Huge context, fast |
| Gemini 2.5 Pro | MoE | 1M | $1.25 | $10 | Multimodal, massive context | |
| Gemini 2.5 Flash | MoE | 1M | $0.15 | $0.60 | Best price-performance | |
| Grok 4.1 Fast | xAI | — | 2M | $0.20 | — | Huge context, speed |
| Model | Type | Total / Active | Context | Local Fit | Strength |
|---|---|---|---|---|---|
| Qwen3-235B-A22B | MoE | 235B / 22B | 128K | Multi-GPU (Q4: ~120GB) | Top-tier open, hybrid thinking |
| Qwen3-32B | Dense | 32B / 32B | 128K | 24GB GPU @ Q4 | Strong all-rounder, great for code |
| Qwen3-30B-A3B | MoE | 30B / 3B | 128K | M4 Max 36GB @ Q4 | Fast MoE for lightweight local |
| DeepSeek-V3.1 | MoE | 671B / 37B | 128K | Cluster only | Top coding, reasoning |
| DeepSeek-R1 | MoE | 671B / 37B | 128K | Cluster (1.78-bit fits 24GB!) | Reasoning specialist |
| Llama 4 Maverick | MoE | 400B / 17B | 1M | Multi-GPU | Multimodal, 1M context |
| Llama 3.3 70B | Dense | 70B | 128K | 2× 24GB GPU @ Q4 | Mature, well-supported |
| Mistral Large 3 | MoE | MoE | 128K | Multi-GPU | Strong European frontier |
| GPT-OSS-120B | MoE | 120B / ~20B | 128K | 1× H100 @ MXFP4 | OpenAI's first open model |
| Runtime | Best For | Format | Multi-GPU | Concurrency | Setup | Key Feature |
|---|---|---|---|---|---|---|
| Ollama | Dev workflow, single-user, prototyping | GGUF | No TP | Poor (serial) | 1 min | Docker-like UX, model hub, OpenAI-compat API |
| llama.cpp | Edge, CPU, max control, single-user | GGUF | Layer split + RPC | Low | Moderate | MCP client (Mar 2026), pure C++, runs anywhere |
| vLLM | Production serving, multi-user, throughput | HF, AWQ, GPTQ, GGUF | Full TP/PP | Excellent | Complex | PagedAttention, continuous batching, 35× RPS vs llama.cpp |
| LM Studio | Desktop GUI, beginners, quick eval | GGUF | No | Low | 1 min | Beautiful UI, model browser, Vulkan for iGPU |
| SGLang | Structured generation, multi-step reasoning | HF, AWQ | Yes | High | Complex | Program-level LLM orchestration on vLLM backend |
| ExLlamaV2 | NVIDIA GPU max perf, custom quant | EXL2, GPTQ | Multi-GPU split | Moderate | Moderate | Per-layer quant control, fast GPU inference |
| TensorRT-LLM | Enterprise NVIDIA, max throughput | Compiled | Full | Excellent | Hard | NVIDIA stack, Dynamo, FP4/FP8, NIM |
| LocalAI | OpenAI drop-in replacement, multimodal | GGUF, others | Limited | Moderate | Moderate | Broadest API compat, image/audio/embedding |
| Provider | Tier | Key Models | Differentiator | Pricing Tier |
|---|---|---|---|---|
| Anthropic | Frontier | Claude Opus/Sonnet/Haiku 4.x | Best coding (SWE-bench 80.9%), 200K context, safety-focused | $0.25–$15/M input |
| OpenAI | Frontier | GPT-5.1, GPT-4.1, GPT-5-mini | Largest ecosystem, 400K–1M context, batch API 50% off | $0.25–$2/M input |
| Frontier | Gemini 2.5 Pro/Flash/Flash-Lite | 1M context, multimodal native, free tier (1K req/day) | $0.10–$1.25/M input | |
| xAI | Frontier | Grok 4.1 Fast | 2M token context, speed-focused | $0.20/M input |
| DeepSeek | Value | V3.1, R1 | Reasoning at $0.55/M, open-weight, MoE | $0.14–$0.55/M input |
| Mistral AI | Value | Mistral Large 3, Codestral | Strong European option, open models, competitive pricing | $0.15–$2/M input |
| Fireworks AI | Infra | OSS models (Llama, Qwen, etc.) | Fast inference for open models, serverless, competitive | $0.10–$0.90/M |
| Together AI | Infra | OSS models + fine-tuning | Fine-tuning platform, serverless & dedicated | $0.10–$0.88/M |
| Groq | Speed | Llama, Mixtral, etc. | LPU hardware — lowest latency for open models | Competitive |
| SiliconFlow | Value | DeepSeek, Qwen, MiniMax | 2.3× faster inference, lowest-cost open model serving | Industry-low |
| Azure OpenAI | Enterprise | OpenAI models | VPC, HIPAA/SOC/ISO, 99.9% SLA, 27 regions | = OpenAI + regional |
| AWS Bedrock | Enterprise | Claude, Llama, Mistral, Titan | Multi-model hub, AWS integration, guardrails | Varies by model |
Local LLM performance is primarily gated by memory bandwidth (tok/s during decode) and total VRAM/RAM (determines max model size). Compute matters more for prefill (TTFT) than decode (TPOT).
| Hardware | VRAM/RAM | Bandwidth | Max Model (Q4) | Sweet Spot |
|---|---|---|---|---|
| RTX 3090 | 24 GB | 936 GB/s | ~32B | Qwen3-32B Q4_K_M, great decode speed |
| RTX 4090 | 24 GB | 1,008 GB/s | ~32B | Same fit, faster prefill, FP8 support |
| RTX 5090 | 32 GB | 1,792 GB/s | ~50B | Bigger models + faster, FP4 native |
| 2× RTX 3090 | 48 GB | 1,872 GB/s | ~70B | Llama 70B Q4 via vLLM TP=2 or llama.cpp split |
| M4 Max (36 GB) | 36 GB unified | 546 GB/s | ~32B dense, ~30B MoE | Qwen3-32B Q4 or Qwen3-30B-A3B Q4_K_M |
| M4 Max (48 GB) | 48 GB unified | 546 GB/s | ~70B Q3, ~32B Q8 | Bigger context or higher quant |
| M5 Max (48 GB) | 48 GB unified | 614 GB/s | ~50B Q4 | 12% bandwidth uplift vs M4 Max; watch for 64GB config |
| 96 GB Mac/PC | 96 GB | varies | ~120B Q4 / Qwen3-235B Q2-3 | Big MoE models with aggressive quant |
| H100 (80 GB) | 80 GB HBM3 | 3,350 GB/s | ~120B | Production serving, GPT-OSS-120B @ MXFP4 |
| H200 (141 GB) | 141 GB HBM3e | 4,800 GB/s | ~200B+ | DeepSeek-V3 per-node, production frontier |
Gateways sit between your application and model providers, adding unified APIs, routing, cost tracking, and caching. Three distinct patterns: unified routers (aggregate providers behind one key), inference platforms (host open models as a service), and OSS proxies (self-hosted middleware).
900K+ models, 200K+ datasets, Spaces for demos. The de facto repository for open-weight models — every local runtime pulls from here. Free to use; paid private storage. GGUF quantizations (bartowski, unsloth, mlx-community) are distributed via Hub.
Instant API access to thousands of Hub models with zero setup. Rate-limited free tier; pay-per-request at scale. No GPU provisioning. Great for prototyping and evaluation but not for sustained production throughput.
Deploy any Hub model on dedicated hardware (T4, A10G, A100, H100) in your chosen cloud region. Auto-scaling, private VPC option, OpenAI-compat API. Per-hour from ~$0.60/hr (T4) to $6+/hr (A100). Best for custom fine-tuned models needing stable SLAs.
HuggingFace's production inference server. Continuous batching, PagedAttention, FlashAttention-2, quantization (GPTQ/AWQ/bitsandbytes), multi-GPU tensor parallelism. Powers HF Inference Endpoints. Self-host via Docker: ghcr.io/huggingface/text-generation-inference
| Gateway | Type | Models / Providers | Key Feature | Pricing |
|---|---|---|---|---|
| OpenRouter | Router | 300+ models, 30+ providers | Single API key for all providers, real-time price comparison, auto-fallback, free-model tier, provider rankings by latency | Pass-through + small markup |
| LiteLLM (proxy) | OSS Proxy | 100+ providers | Self-hosted, per-team cost tracking, rate limits per key, Redis caching, load balancing across providers, budget alerts | Free OSS; hosted tier available |
| Portkey AI | Enterprise Router | 250+ models | Observability, guardrails, semantic caching, canary deployments, prompt versioning, SOC2 | Free tier; $49+/mo |
| Helicone | Observability | Any OpenAI-compat | Drop-in logging proxy (one-line header change), cost analytics, prompt management, A/B testing, OSS | Free tier; usage-based |
| AI/ML API | Router | 200+ models | Low-cost aggregator focused on open models, free trial credits, OpenAI-compat | Competitive; usage-based |
| Platform | Model Selection | Billing | Standout | Use Case |
|---|---|---|---|---|
| Ollama | Hub models via ollama pull | Free (self-hosted) | See §05 Local Runtimes — Docker-like UX, OpenAI-compat API, model library at ollama.com | Local dev; see §05 for details |
| Replicate | 100K+ community models | Per-second GPU time | Cog framework for packaging any model; massive community ecosystem; image/video/audio supported | Niche models, rapid prototyping, image/video gen |
| Cloudflare Workers AI | 50+ curated models | Per 1K neurons; 10K/day free | Edge inference <50ms globally, no cold starts, serverless-native, zero GPU provisioning | Latency-critical, edge apps, low-volume prod |
| Modal | Any model (custom containers) | Per-second GPU time | Python-native serverless GPU, ~100ms cold starts, great for batch pipelines and fine-tuning jobs | Batch inference, research, custom pipelines |
| Baseten | Any HF model | Per-hour + compute | Fast cold starts (~2s via Truss framework), fine-tuning support, production ML APIs | Custom fine-tuned model APIs with SLA needs |
Provider fallback: "route": "fallback" tries providers in order
Load balance: "route": "load-balance" distributes across providers
Provider pinning: "provider": {"order": ["Groq","Together"]}
OpenRouter uses provider/model-name slugs:
anthropic/claude-sonnet-4-6
openai/gpt-4.1
google/gemini-2.5-pro
deepseek/deepseek-r1
meta-llama/llama-3.3-70b
qwen/qwen3-235b-a22b
OpenRouter maintains a set of free models (rate-limited): Llama 3.3 70B, DeepSeek variants, Gemma 3, Mistral 7B, and others. Append :free to the model slug to opt in. Rate limits apply per IP. Useful for dev/CI pipelines.
Change two lines — base URL and model slug — to route any OpenAI SDK call through a gateway:
base_url="https://openrouter.ai/api/v1"
model="anthropic/claude-sonnet-4-6"
Works identically with LiteLLM proxy at localhost:4000.