LLM Performance & Deployment
Quick Reference

March 2026 · TPOT / TTFT / MoE / Quantization / Providers / Gateways

01 Inference Performance Metrics

TTFT Latency

Time to First Token — time from sending a request until the first output token arrives. Includes network latency, queuing, and the entire prefill phase (processing all input tokens to build the KV cache). Longer prompts → bigger TTFT.

TTFT = queue_time + prefill_time + network_latency

Targets: chatbot <500ms · code completion <100ms · batch OK at seconds

TPOT Latency

Time Per Output Token — average time between generating each successive token after the first. Measures the decode phase speed. Also called ITL (Inter-Token Latency) in some tools, though ITL is token-weighted while TPOT is request-weighted.

TPOT = (last_token_time − first_token_time) / (output_tokens − 1)

30ms TPOT ≈ 33 tok/s ≈ ~1,600 words/min (faster than reading speed)

E2E Latency Latency

End-to-End Request Latency — total time from request to final token. What the user actually waits for.

E2E = TTFT + TPOT × (output_tokens − 1)

A fast TTFT with slow generation still yields poor perceived UX.

TPS / RPS / Goodput Throughput

Tokens/sec, Requests/sec, Goodput — system-level throughput. TPS = total output tokens generated per second across all concurrent requests. RPS = completed requests per second. Goodput = requests/sec that meet your SLO thresholds (e.g. TTFT <500ms AND TPOT <15ms). Goodput is the metric that actually correlates to user satisfaction.

Benchmarking gotcha: Tools disagree on definitions. GenAI-Perf excludes TTFT from TPOT; LLMPerf includes it. Always check which tool generated the numbers and whether they include queuing time.

Metric Relationships at a Glance

MetricMeasuresAffected ByOptimize When
TTFTResponsiveness startPrompt length, KV cache build, queue depth, prefill computeInteractive chat, streaming UX
TPOT / ITLGeneration smoothnessModel size, batch size, KV cache growth, decode computeStreaming text, real-time apps
E2ETotal waitTTFT + TPOT × tokensShort-response use cases
TPSSystem capacityGPU count, batching strategy, memory bandwidthMulti-user serving
GoodputUseful throughputAll of the above within SLO boundsProduction SLAs

02 Mixture of Experts (MoE)

MoE replaces dense FFN layers with multiple specialized expert sub-networks and a learned router/gate that selects a subset of experts per token. Since early 2025, MoE has become the dominant architecture for frontier models — over 60% of major open-source releases use it.

Advantages

Faster training: MoEs match dense model quality with far less compute.

Faster inference FLOPs: Only active params are computed per token. Qwen3-235B activates only 22B per token — inference compute equivalent to a ~22B dense model.

Scaling capacity: Total param count → capacity. Active params → cost. You can grow capacity without proportional compute increase.

Expert specialization: Different experts learn to handle different domains/token types, improving multi-task generalization.

Tradeoffs

High VRAM: ALL experts must be loaded in memory, even though only a subset activate. Mixtral 8×7B has 47B total params but ~12B active. You need RAM for the 47B.

Load balancing: Without auxiliary losses, some experts starve while others are overloaded. Active research area in 2025-26.

Fine-tuning fragility: MoE historically overfits during fine-tuning. Newer techniques (LoRA on shared layers, expert freezing) help.

Communication overhead: Multi-GPU expert parallelism adds latency from all-to-all token routing.

Notable MoE Models (2024–2026)

ModelTotal ParamsActiveExpertsRoutingContextNotes
Qwen3-235B-A22B235B22B128 (top-8)Token-level128KHybrid thinking/non-thinking modes, 119+ languages
Qwen3-30B-A3B30B3B128 (top-8)Token-level128KRunnable on M4 Max 36GB at Q4_K_M
DeepSeek-V3.1671B37B256 (top-8)Fine-grained128KMulti-head latent attention, strong code
DeepSeek-R1671B37B256Fine-grained128KReasoning specialist, FP4 & 1.78-bit variants exist
Llama 4 Maverick400B17B128 (top-1)Token-level1MMeta's first MoE, multimodal, ultra-long context
GPT-5UndisclosedMoEDynamic400KOpenAI's first confirmed MoE architecture
GPT-OSS-120B120B~20BMoE128KOpen-source, MXFP4 fits on 1× H100
Kimi K21T32BMoE128KMoonshot, trillion-parameter scale
Mixtral 8×22B141B~39B8 (top-2)Token-level65KMistral, strong generalist
Mistral Large 3MoEMoE128KProduction frontier from Mistral
2025-26 Trend: Innovation has shifted from simply scaling parameter counts to improving routing reliability, load balancing (similarity-preserving objectives, MaxScore constrained optimization), and controllable compute allocation across modalities. Fine-grained MoE (many small experts, e.g. DeepSeek's 256 experts) is preferred over coarse-grained (few large experts).

03 Quantization

Quantization maps FP16/FP32 weights to lower-precision integers, dramatically shrinking model size and speeding inference. The GGUF format (from llama.cpp) is the universal standard for local inference. Key families: legacy (Q4_0, Q8_0), K-quants (Q4_K_M, Q5_K_M, Q6_K), and I-quants (IQ4_XS). For GPU serving, AWQ and GPTQ offer higher throughput via specialized kernels.

GGUF Quantization Levels — Quality vs. Size

QuantBits/WeightSize (7B)PPL Δ (7B)QualityUse Case
Q8_08.0~7.2 GB+0.004 Near-lossless Baseline / when VRAM permits / code & RAG fidelity
Q6_K6.6~5.5 GB+0.009 Excellent Production sweet spot if memory allows
Q5_K_M5.7~4.8 GB+0.035 Very Good High quality with real savings; great for code & reasoning
Q4_K_M4.9~4.1 GB+0.054 Good ⭐ Most popular — best balance of quality/size/speed
IQ4_XS4.5~3.8 GB~+0.06 Good* Max compression at 4-bit class; needs good imatrix
Q3_K_M3.9~3.3 GB+0.244 Moderate Tight VRAM only; noticeable quality loss
Q2_K3.4~2.9 GB+0.870 Poor Not recommended — extreme quality loss

PPL Δ = perplexity increase vs FP16 baseline. Lower = better. K-quants use block + sub-block grouping for superior quality vs legacy types at the same bit-width. "M" = medium mixed precision (attention layers get higher precision).

Quantization Strategy Cheat Sheet

🎯 Decision Rule

A larger model at lower quant typically outperforms a smaller model at higher quant. Example: 32B @ Q4_K_M > 14B @ Q8_0 in most benchmarks. Always pick the biggest model that fits in VRAM at Q4_K_M or above.

📊 By Task

General chat: Q4_K_M (bump to Q5 if inconsistent)
Coding: Q5_K_M or Q8_0 for fewer subtle errors
Reasoning/math: Q5_K_M+ stabilizes chain-of-thought
RAG/retrieval: Q5_K_M or Q8_0 for grounding accuracy

🔧 By Format (GPU Serving)

GGUF: Best for llama.cpp/Ollama on CPU+GPU. Universal format.
AWQ: GPU-only, activation-aware. Best with Marlin kernels in vLLM (741 tok/s).
GPTQ: GPU-only, Hessian-based. Slightly lower quality than AWQ.
MXFP4/NVFP4: Native 4-bit on Blackwell GPUs. 4× cost reduction.

⚡ MoE + Quantization

MoE models benefit enormously from quantization because you need ALL experts in memory but only activate a few. Qwen3-30B-A3B at Q4_K_M fits in ~18GB — runnable on M4 Max 36GB with room for KV cache. DeepSeek-R1 has community 1.78-bit variants for extreme compression.

04 Key Models — March 2026 Snapshot

Frontier / Cloud API Models

ModelProviderArchitectureContextInput $/1MOutput $/1MStrength
Claude Opus 4.6AnthropicDense200K$15$75Most capable reasoning
Claude Sonnet 4.6AnthropicDense200K$3$15Best quality/cost ratio
Claude Haiku 4.5AnthropicDense200K$0.25$1.25Speed + cost leader
GPT-5.1OpenAIMoE400K$1.25$10Reasoning, agentic
GPT-4.1OpenAIDense1M$2$8Huge context, fast
Gemini 2.5 ProGoogleMoE1M$1.25$10Multimodal, massive context
Gemini 2.5 FlashGoogleMoE1M$0.15$0.60Best price-performance
Grok 4.1 FastxAI2M$0.20Huge context, speed

Open-Weight Models (Self-Hostable)

ModelTypeTotal / ActiveContextLocal FitStrength
Qwen3-235B-A22BMoE235B / 22B128KMulti-GPU (Q4: ~120GB)Top-tier open, hybrid thinking
Qwen3-32BDense32B / 32B128K24GB GPU @ Q4Strong all-rounder, great for code
Qwen3-30B-A3BMoE30B / 3B128KM4 Max 36GB @ Q4Fast MoE for lightweight local
DeepSeek-V3.1MoE671B / 37B128KCluster onlyTop coding, reasoning
DeepSeek-R1MoE671B / 37B128KCluster (1.78-bit fits 24GB!)Reasoning specialist
Llama 4 MaverickMoE400B / 17B1MMulti-GPUMultimodal, 1M context
Llama 3.3 70BDense70B128K2× 24GB GPU @ Q4Mature, well-supported
Mistral Large 3MoEMoE128KMulti-GPUStrong European frontier
GPT-OSS-120BMoE120B / ~20B128K1× H100 @ MXFP4OpenAI's first open model

05 Local Inference Runtimes

RuntimeBest ForFormatMulti-GPUConcurrencySetupKey Feature
Ollama Dev workflow, single-user, prototyping GGUF No TP Poor (serial) 1 min Docker-like UX, model hub, OpenAI-compat API
llama.cpp Edge, CPU, max control, single-user GGUF Layer split + RPC Low Moderate MCP client (Mar 2026), pure C++, runs anywhere
vLLM Production serving, multi-user, throughput HF, AWQ, GPTQ, GGUF Full TP/PP Excellent Complex PagedAttention, continuous batching, 35× RPS vs llama.cpp
LM Studio Desktop GUI, beginners, quick eval GGUF No Low 1 min Beautiful UI, model browser, Vulkan for iGPU
SGLang Structured generation, multi-step reasoning HF, AWQ Yes High Complex Program-level LLM orchestration on vLLM backend
ExLlamaV2 NVIDIA GPU max perf, custom quant EXL2, GPTQ Multi-GPU split Moderate Moderate Per-layer quant control, fast GPU inference
TensorRT-LLM Enterprise NVIDIA, max throughput Compiled Full Excellent Hard NVIDIA stack, Dynamo, FP4/FP8, NIM
LocalAI OpenAI drop-in replacement, multimodal GGUF, others Limited Moderate Moderate Broadest API compat, image/audio/embedding
Decision tree: Just exploring? → Ollama or LM Studio. Building apps? → Ollama (simple) or vLLM (production). Multi-GPU serving? → vLLM. Edge/embedded? → llama.cpp. NVIDIA enterprise? → TensorRT-LLM.

06 Cloud API Providers

ProviderTierKey ModelsDifferentiatorPricing Tier
Anthropic Frontier Claude Opus/Sonnet/Haiku 4.x Best coding (SWE-bench 80.9%), 200K context, safety-focused $0.25–$15/M input
OpenAI Frontier GPT-5.1, GPT-4.1, GPT-5-mini Largest ecosystem, 400K–1M context, batch API 50% off $0.25–$2/M input
Google Frontier Gemini 2.5 Pro/Flash/Flash-Lite 1M context, multimodal native, free tier (1K req/day) $0.10–$1.25/M input
xAI Frontier Grok 4.1 Fast 2M token context, speed-focused $0.20/M input
DeepSeek Value V3.1, R1 Reasoning at $0.55/M, open-weight, MoE $0.14–$0.55/M input
Mistral AI Value Mistral Large 3, Codestral Strong European option, open models, competitive pricing $0.15–$2/M input
Fireworks AI Infra OSS models (Llama, Qwen, etc.) Fast inference for open models, serverless, competitive $0.10–$0.90/M
Together AI Infra OSS models + fine-tuning Fine-tuning platform, serverless & dedicated $0.10–$0.88/M
Groq Speed Llama, Mixtral, etc. LPU hardware — lowest latency for open models Competitive
SiliconFlow Value DeepSeek, Qwen, MiniMax 2.3× faster inference, lowest-cost open model serving Industry-low
Azure OpenAI Enterprise OpenAI models VPC, HIPAA/SOC/ISO, 99.9% SLA, 27 regions = OpenAI + regional
AWS Bedrock Enterprise Claude, Llama, Mistral, Titan Multi-model hub, AWS integration, guardrails Varies by model
Cost optimization: 70-80% of production workloads perform identically on mid-tier models vs premium ones. Always A/B test cheaper tiers first. Use batch APIs (OpenAI 50% off) for non-real-time work. Build abstraction layers (LiteLLM, LangChain) to swap providers without rewrites.

07 Hardware Quick Guide

Local LLM performance is primarily gated by memory bandwidth (tok/s during decode) and total VRAM/RAM (determines max model size). Compute matters more for prefill (TTFT) than decode (TPOT).

HardwareVRAM/RAMBandwidthMax Model (Q4)Sweet Spot
RTX 309024 GB936 GB/s~32BQwen3-32B Q4_K_M, great decode speed
RTX 409024 GB1,008 GB/s~32BSame fit, faster prefill, FP8 support
RTX 509032 GB1,792 GB/s~50BBigger models + faster, FP4 native
2× RTX 309048 GB1,872 GB/s~70BLlama 70B Q4 via vLLM TP=2 or llama.cpp split
M4 Max (36 GB)36 GB unified546 GB/s~32B dense, ~30B MoEQwen3-32B Q4 or Qwen3-30B-A3B Q4_K_M
M4 Max (48 GB)48 GB unified546 GB/s~70B Q3, ~32B Q8Bigger context or higher quant
M5 Max (48 GB)48 GB unified614 GB/s~50B Q412% bandwidth uplift vs M4 Max; watch for 64GB config
96 GB Mac/PC96 GBvaries~120B Q4 / Qwen3-235B Q2-3Big MoE models with aggressive quant
H100 (80 GB)80 GB HBM33,350 GB/s~120BProduction serving, GPT-OSS-120B @ MXFP4
H200 (141 GB)141 GB HBM3e4,800 GB/s~200B+DeepSeek-V3 per-node, production frontier
Rule of thumb: Model VRAM ≈ (params_in_billions × bits_per_weight) / 8 + KV cache overhead. For a 32B model at Q4_K_M (~4.9 bpw): 32 × 4.9 / 8 ≈ 19.6 GB for weights alone. Add 2-6 GB for KV cache depending on context length. Apple Silicon unified memory means the full system RAM is available — no separate VRAM pool.

08 Model Gateways & Access Layers

Gateways sit between your application and model providers, adding unified APIs, routing, cost tracking, and caching. Three distinct patterns: unified routers (aggregate providers behind one key), inference platforms (host open models as a service), and OSS proxies (self-hosted middleware).

Hugging Face Ecosystem

Hub Model Registry

900K+ models, 200K+ datasets, Spaces for demos. The de facto repository for open-weight models — every local runtime pulls from here. Free to use; paid private storage. GGUF quantizations (bartowski, unsloth, mlx-community) are distributed via Hub.

Serverless Inference API Free Tier

Instant API access to thousands of Hub models with zero setup. Rate-limited free tier; pay-per-request at scale. No GPU provisioning. Great for prototyping and evaluation but not for sustained production throughput.

POST api-inference.huggingface.co/models/{model_id}

Inference Endpoints Dedicated

Deploy any Hub model on dedicated hardware (T4, A10G, A100, H100) in your chosen cloud region. Auto-scaling, private VPC option, OpenAI-compat API. Per-hour from ~$0.60/hr (T4) to $6+/hr (A100). Best for custom fine-tuned models needing stable SLAs.

TGI — Text Generation Inference OSS Server

HuggingFace's production inference server. Continuous batching, PagedAttention, FlashAttention-2, quantization (GPTQ/AWQ/bitsandbytes), multi-GPU tensor parallelism. Powers HF Inference Endpoints. Self-host via Docker: ghcr.io/huggingface/text-generation-inference

Unified API Routers

GatewayTypeModels / ProvidersKey FeaturePricing
OpenRouter Router 300+ models, 30+ providers Single API key for all providers, real-time price comparison, auto-fallback, free-model tier, provider rankings by latency Pass-through + small markup
LiteLLM (proxy) OSS Proxy 100+ providers Self-hosted, per-team cost tracking, rate limits per key, Redis caching, load balancing across providers, budget alerts Free OSS; hosted tier available
Portkey AI Enterprise Router 250+ models Observability, guardrails, semantic caching, canary deployments, prompt versioning, SOC2 Free tier; $49+/mo
Helicone Observability Any OpenAI-compat Drop-in logging proxy (one-line header change), cost analytics, prompt management, A/B testing, OSS Free tier; usage-based
AI/ML API Router 200+ models Low-cost aggregator focused on open models, free trial credits, OpenAI-compat Competitive; usage-based

Inference Platforms (Open Models as a Service)

PlatformModel SelectionBillingStandoutUse Case
Ollama Hub models via ollama pull Free (self-hosted) See §05 Local Runtimes — Docker-like UX, OpenAI-compat API, model library at ollama.com Local dev; see §05 for details
Replicate 100K+ community models Per-second GPU time Cog framework for packaging any model; massive community ecosystem; image/video/audio supported Niche models, rapid prototyping, image/video gen
Cloudflare Workers AI 50+ curated models Per 1K neurons; 10K/day free Edge inference <50ms globally, no cold starts, serverless-native, zero GPU provisioning Latency-critical, edge apps, low-volume prod
Modal Any model (custom containers) Per-second GPU time Python-native serverless GPU, ~100ms cold starts, great for batch pipelines and fine-tuning jobs Batch inference, research, custom pipelines
Baseten Any HF model Per-hour + compute Fast cold starts (~2s via Truss framework), fine-tuning support, production ML APIs Custom fine-tuned model APIs with SLA needs

OpenRouter Quick Reference

Route Configuration

Provider fallback: "route": "fallback" tries providers in order

Load balance: "route": "load-balance" distributes across providers

Provider pinning: "provider": {"order": ["Groq","Together"]}

Model Naming

OpenRouter uses provider/model-name slugs:

anthropic/claude-sonnet-4-6
openai/gpt-4.1
google/gemini-2.5-pro
deepseek/deepseek-r1
meta-llama/llama-3.3-70b
qwen/qwen3-235b-a22b

Free Models Tier

OpenRouter maintains a set of free models (rate-limited): Llama 3.3 70B, DeepSeek variants, Gemma 3, Mistral 7B, and others. Append :free to the model slug to opt in. Rate limits apply per IP. Useful for dev/CI pipelines.

Drop-in OpenAI Replacement

Change two lines — base URL and model slug — to route any OpenAI SDK call through a gateway:

base_url="https://openrouter.ai/api/v1"
model="anthropic/claude-sonnet-4-6"


Works identically with LiteLLM proxy at localhost:4000.

LiteLLM vs OpenRouter: OpenRouter is hosted — fast setup, no infra, ideal for solo devs and quick integrations. LiteLLM proxy is self-hosted — more control, per-team budget enforcement, Redis semantic caching (reuse identical prompts), and request logging to your own store. Both expose OpenAI-compat endpoints so switching is trivial.
Gateway decision tree: One API key for many providers? → OpenRouter. Self-hosted cost control + logging? → LiteLLM proxy. Niche open model not on major clouds? → Replicate. Global edge inference? → Cloudflare Workers AI. Custom fine-tune with stable SLA? → HF Inference Endpoints. Add observability to existing setup? → Helicone. Serverless GPU batch jobs? → Modal.