LLM Quick Reference — March 2026

01 Inference Performance Metrics

TTFT Latency

Time to First Token — time from sending a request until the first output token arrives. Includes network latency, queuing, and the entire prefill phase (processing all input tokens to build the KV cache). Longer prompts → bigger TTFT.

TTFT = queue_time + prefill_time + network_latency

Targets: chatbot <500ms · code completion <100ms · batch OK at seconds

TPOT Latency

Time Per Output Token — average time between generating each successive token after the first. Measures the decode phase speed. Also called ITL (Inter-Token Latency) in some tools, though ITL is token-weighted while TPOT is request-weighted.

TPOT = (last_token_time − first_token_time) / (output_tokens − 1)

30ms TPOT ≈ 33 tok/s ≈ ~1,600 words/min (faster than reading speed)

E2E Latency Latency

End-to-End Request Latency — total time from request to final token. What the user actually waits for.

E2E = TTFT + TPOT × (output_tokens − 1)

A fast TTFT with slow generation still yields poor perceived UX.

TPS / RPS / Goodput Throughput

Tokens/sec, Requests/sec, Goodput — system-level throughput. TPS = total output tokens generated per second across all concurrent requests. RPS = completed requests per second. Goodput = requests/sec that meet your SLO thresholds (e.g. TTFT <500ms AND TPOT <15ms). Goodput is the metric that actually correlates to user satisfaction.

    Benchmarking gotcha: Tools disagree on definitions. GenAI-Perf excludes TTFT from TPOT; LLMPerf includes it. Always check which tool generated the numbers and whether they include queuing time.
  

Metric Relationships at a Glance

Metric	Measures	Affected By	Optimize When
TTFT	Responsiveness start	Prompt length, KV cache build, queue depth, prefill compute	Interactive chat, streaming UX
TPOT / ITL	Generation smoothness	Model size, batch size, KV cache growth, decode compute	Streaming text, real-time apps
E2E	Total wait	TTFT + TPOT × tokens	Short-response use cases
TPS	System capacity	GPU count, batching strategy, memory bandwidth	Multi-user serving
Goodput	Useful throughput	All of the above within SLO bounds	Production SLAs

02 Mixture of Experts (MoE)

MoE replaces dense FFN layers with multiple specialized expert sub-networks and a learned router/gate that selects a subset of experts per token. Since early 2025, MoE has become the dominant architecture for frontier models — over 60% of major open-source releases use it.

Advantages

Faster training: MoEs match dense model quality with far less compute.

Faster inference FLOPs: Only active params are computed per token. Qwen3-235B activates only 22B per token — inference compute equivalent to a ~22B dense model.

Scaling capacity: Total param count → capacity. Active params → cost. You can grow capacity without proportional compute increase.

Expert specialization: Different experts learn to handle different domains/token types, improving multi-task generalization.

Tradeoffs

High VRAM: ALL experts must be loaded in memory, even though only a subset activate. Mixtral 8×7B has 47B total params but ~12B active. You need RAM for the 47B.

Load balancing: Without auxiliary losses, some experts starve while others are overloaded. Active research area in 2025-26.

Fine-tuning fragility: MoE historically overfits during fine-tuning. Newer techniques (LoRA on shared layers, expert freezing) help.

Communication overhead: Multi-GPU expert parallelism adds latency from all-to-all token routing.

Notable MoE Models (2024–2026)

Model	Total Params	Active	Experts	Routing	Context	Notes
Qwen3-235B-A22B	235B	22B	128 (top-8)	Token-level	128K	Hybrid thinking/non-thinking modes, 119+ languages
Qwen3-30B-A3B	30B	3B	128 (top-8)	Token-level	128K	Runnable on M4 Max 36GB at Q4_K_M
DeepSeek-V3.1	671B	37B	256 (top-8)	Fine-grained	128K	Multi-head latent attention, strong code
DeepSeek-R1	671B	37B	256	Fine-grained	128K	Reasoning specialist, FP4 & 1.78-bit variants exist
Llama 4 Maverick	400B	17B	128 (top-1)	Token-level	1M	Meta's first MoE, multimodal, ultra-long context
GPT-5	Undisclosed	—	MoE	Dynamic	400K	OpenAI's first confirmed MoE architecture
GPT-OSS-120B	120B	~20B	MoE	—	128K	Open-source, MXFP4 fits on 1× H100
Kimi K2	1T	32B	MoE	—	128K	Moonshot, trillion-parameter scale
Mixtral 8×22B	141B	~39B	8 (top-2)	Token-level	65K	Mistral, strong generalist
Mistral Large 3	MoE	—	MoE	—	128K	Production frontier from Mistral

    2025-26 Trend: Innovation has shifted from simply scaling parameter counts to improving routing reliability, load balancing (similarity-preserving objectives, MaxScore constrained optimization), and controllable compute allocation across modalities. Fine-grained MoE (many small experts, e.g. DeepSeek's 256 experts) is preferred over coarse-grained (few large experts).
  

03 Quantization

Quantization maps FP16/FP32 weights to lower-precision integers, dramatically shrinking model size and speeding inference. The GGUF format (from llama.cpp) is the universal standard for local inference. Key families: legacy (Q4_0, Q8_0), K-quants (Q4_K_M, Q5_K_M, Q6_K), and I-quants (IQ4_XS). For GPU serving, AWQ and GPTQ offer higher throughput via specialized kernels.

GGUF Quantization Levels — Quality vs. Size

Quant	Bits/Weight	Size (7B)	PPL Δ (7B)	Quality	Use Case
Q8_0	8.0	~7.2 GB	+0.004	Near-lossless	Baseline / when VRAM permits / code & RAG fidelity
Q6_K	6.6	~5.5 GB	+0.009	Excellent	Production sweet spot if memory allows
Q5_K_M	5.7	~4.8 GB	+0.035	Very Good	High quality with real savings; great for code & reasoning
Q4_K_M	4.9	~4.1 GB	+0.054	Good	⭐ Most popular — best balance of quality/size/speed
IQ4_XS	4.5	~3.8 GB	~+0.06	Good*	Max compression at 4-bit class; needs good imatrix
Q3_K_M	3.9	~3.3 GB	+0.244	Moderate	Tight VRAM only; noticeable quality loss
Q2_K	3.4	~2.9 GB	+0.870	Poor	Not recommended — extreme quality loss

PPL Δ = perplexity increase vs FP16 baseline. Lower = better. K-quants use block + sub-block grouping for superior quality vs legacy types at the same bit-width. "M" = medium mixed precision (attention layers get higher precision).

Quantization Strategy Cheat Sheet

🎯 Decision Rule

A larger model at lower quant typically outperforms a smaller model at higher quant. Example: 32B @ Q4_K_M > 14B @ Q8_0 in most benchmarks. Always pick the biggest model that fits in VRAM at Q4_K_M or above.

📊 By Task

General chat: Q4_K_M (bump to Q5 if inconsistent)
Coding: Q5_K_M or Q8_0 for fewer subtle errors
Reasoning/math: Q5_K_M+ stabilizes chain-of-thought
RAG/retrieval: Q5_K_M or Q8_0 for grounding accuracy

🔧 By Format (GPU Serving)

GGUF: Best for llama.cpp/Ollama on CPU+GPU. Universal format.
AWQ: GPU-only, activation-aware. Best with Marlin kernels in vLLM (741 tok/s).
GPTQ: GPU-only, Hessian-based. Slightly lower quality than AWQ.
MXFP4/NVFP4: Native 4-bit on Blackwell GPUs. 4× cost reduction.

⚡ MoE + Quantization

MoE models benefit enormously from quantization because you need ALL experts in memory but only activate a few. Qwen3-30B-A3B at Q4_K_M fits in ~18GB — runnable on M4 Max 36GB with room for KV cache. DeepSeek-R1 has community 1.78-bit variants for extreme compression.

04 Key Models — March 2026 Snapshot

Frontier / Cloud API Models

Model	Provider	Architecture	Context	Input $/1M	Output $/1M	Strength
Claude Opus 4.6	Anthropic	Dense	200K	$15	$75	Most capable reasoning
Claude Sonnet 4.6	Anthropic	Dense	200K	$3	$15	Best quality/cost ratio
Claude Haiku 4.5	Anthropic	Dense	200K	$0.25	$1.25	Speed + cost leader
GPT-5.1	OpenAI	MoE	400K	$1.25	$10	Reasoning, agentic
GPT-4.1	OpenAI	Dense	1M	$2	$8	Huge context, fast
Gemini 2.5 Pro	Google	MoE	1M	$1.25	$10	Multimodal, massive context
Gemini 2.5 Flash	Google	MoE	1M	$0.15	$0.60	Best price-performance
Grok 4.1 Fast	xAI	—	2M	$0.20	—	Huge context, speed

Open-Weight Models (Self-Hostable)

Model	Type	Total / Active	Context	Local Fit	Strength
Qwen3-235B-A22B	MoE	235B / 22B	128K	Multi-GPU (Q4: ~120GB)	Top-tier open, hybrid thinking
Qwen3-32B	Dense	32B / 32B	128K	24GB GPU @ Q4	Strong all-rounder, great for code
Qwen3-30B-A3B	MoE	30B / 3B	128K	M4 Max 36GB @ Q4	Fast MoE for lightweight local
DeepSeek-V3.1	MoE	671B / 37B	128K	Cluster only	Top coding, reasoning
DeepSeek-R1	MoE	671B / 37B	128K	Cluster (1.78-bit fits 24GB!)	Reasoning specialist
Llama 4 Maverick	MoE	400B / 17B	1M	Multi-GPU	Multimodal, 1M context
Llama 3.3 70B	Dense	70B	128K	2× 24GB GPU @ Q4	Mature, well-supported
Mistral Large 3	MoE	MoE	128K	Multi-GPU	Strong European frontier
GPT-OSS-120B	MoE	120B / ~20B	128K	1× H100 @ MXFP4	OpenAI's first open model

05 Local Inference Runtimes

Runtime	Best For	Format	Multi-GPU	Concurrency	Setup	Key Feature
Ollama	Dev workflow, single-user, prototyping	GGUF	No TP	Poor (serial)	1 min	Docker-like UX, model hub, OpenAI-compat API
llama.cpp	Edge, CPU, max control, single-user	GGUF	Layer split + RPC	Low	Moderate	MCP client (Mar 2026), pure C++, runs anywhere
vLLM	Production serving, multi-user, throughput	HF, AWQ, GPTQ, GGUF	Full TP/PP	Excellent	Complex	PagedAttention, continuous batching, 35× RPS vs llama.cpp
LM Studio	Desktop GUI, beginners, quick eval	GGUF	No	Low	1 min	Beautiful UI, model browser, Vulkan for iGPU
SGLang	Structured generation, multi-step reasoning	HF, AWQ	Yes	High	Complex	Program-level LLM orchestration on vLLM backend
ExLlamaV2	NVIDIA GPU max perf, custom quant	EXL2, GPTQ	Multi-GPU split	Moderate	Moderate	Per-layer quant control, fast GPU inference
TensorRT-LLM	Enterprise NVIDIA, max throughput	Compiled	Full	Excellent	Hard	NVIDIA stack, Dynamo, FP4/FP8, NIM
LocalAI	OpenAI drop-in replacement, multimodal	GGUF, others	Limited	Moderate	Moderate	Broadest API compat, image/audio/embedding

    Decision tree: Just exploring? → Ollama or LM Studio. Building apps? → Ollama (simple) or vLLM (production). Multi-GPU serving? → vLLM. Edge/embedded? → llama.cpp. NVIDIA enterprise? → TensorRT-LLM.
  

06 Cloud API Providers

Provider	Tier	Key Models	Differentiator	Pricing Tier
Anthropic	Frontier	Claude Opus/Sonnet/Haiku 4.x	Best coding (SWE-bench 80.9%), 200K context, safety-focused	$0.25–$15/M input
OpenAI	Frontier	GPT-5.1, GPT-4.1, GPT-5-mini	Largest ecosystem, 400K–1M context, batch API 50% off	$0.25–$2/M input
Google	Frontier	Gemini 2.5 Pro/Flash/Flash-Lite	1M context, multimodal native, free tier (1K req/day)	$0.10–$1.25/M input
xAI	Frontier	Grok 4.1 Fast	2M token context, speed-focused	$0.20/M input
DeepSeek	Value	V3.1, R1	Reasoning at $0.55/M, open-weight, MoE	$0.14–$0.55/M input
Mistral AI	Value	Mistral Large 3, Codestral	Strong European option, open models, competitive pricing	$0.15–$2/M input
Fireworks AI	Infra	OSS models (Llama, Qwen, etc.)	Fast inference for open models, serverless, competitive	$0.10–$0.90/M
Together AI	Infra	OSS models + fine-tuning	Fine-tuning platform, serverless & dedicated	$0.10–$0.88/M
Groq	Speed	Llama, Mixtral, etc.	LPU hardware — lowest latency for open models	Competitive
SiliconFlow	Value	DeepSeek, Qwen, MiniMax	2.3× faster inference, lowest-cost open model serving	Industry-low
Azure OpenAI	Enterprise	OpenAI models	VPC, HIPAA/SOC/ISO, 99.9% SLA, 27 regions	= OpenAI + regional
AWS Bedrock	Enterprise	Claude, Llama, Mistral, Titan	Multi-model hub, AWS integration, guardrails	Varies by model

    Cost optimization: 70-80% of production workloads perform identically on mid-tier models vs premium ones. Always A/B test cheaper tiers first. Use batch APIs (OpenAI 50% off) for non-real-time work. Build abstraction layers (LiteLLM, LangChain) to swap providers without rewrites.
  

07 Hardware Quick Guide

Local LLM performance is primarily gated by memory bandwidth (tok/s during decode) and total VRAM/RAM (determines max model size). Compute matters more for prefill (TTFT) than decode (TPOT).

Hardware	VRAM/RAM	Bandwidth	Max Model (Q4)	Sweet Spot
RTX 3090	24 GB	936 GB/s	~32B	Qwen3-32B Q4_K_M, great decode speed
RTX 4090	24 GB	1,008 GB/s	~32B	Same fit, faster prefill, FP8 support
RTX 5090	32 GB	1,792 GB/s	~50B	Bigger models + faster, FP4 native
2× RTX 3090	48 GB	1,872 GB/s	~70B	Llama 70B Q4 via vLLM TP=2 or llama.cpp split
M4 Max (36 GB)	36 GB unified	546 GB/s	~32B dense, ~30B MoE	Qwen3-32B Q4 or Qwen3-30B-A3B Q4_K_M
M4 Max (48 GB)	48 GB unified	546 GB/s	~70B Q3, ~32B Q8	Bigger context or higher quant
M5 Max (48 GB)	48 GB unified	614 GB/s	~50B Q4	12% bandwidth uplift vs M4 Max; watch for 64GB config
96 GB Mac/PC	96 GB	varies	~120B Q4 / Qwen3-235B Q2-3	Big MoE models with aggressive quant
H100 (80 GB)	80 GB HBM3	3,350 GB/s	~120B	Production serving, GPT-OSS-120B @ MXFP4
H200 (141 GB)	141 GB HBM3e	4,800 GB/s	~200B+	DeepSeek-V3 per-node, production frontier

    Rule of thumb: Model VRAM ≈ (params_in_billions × bits_per_weight) / 8 + KV cache overhead. For a 32B model at Q4_K_M (~4.9 bpw): 32 × 4.9 / 8 ≈ 19.6 GB for weights alone. Add 2-6 GB for KV cache depending on context length. Apple Silicon unified memory means the full system RAM is available — no separate VRAM pool.
  

08 Model Gateways & Access Layers

Gateways sit between your application and model providers, adding unified APIs, routing, cost tracking, and caching. Three distinct patterns: unified routers (aggregate providers behind one key), inference platforms (host open models as a service), and OSS proxies (self-hosted middleware).

Hugging Face Ecosystem

Hub Model Registry

900K+ models, 200K+ datasets, Spaces for demos. The de facto repository for open-weight models — every local runtime pulls from here. Free to use; paid private storage. GGUF quantizations (bartowski, unsloth, mlx-community) are distributed via Hub.

Serverless Inference API Free Tier

Instant API access to thousands of Hub models with zero setup. Rate-limited free tier; pay-per-request at scale. No GPU provisioning. Great for prototyping and evaluation but not for sustained production throughput.

POST api-inference.huggingface.co/models/{model_id}

Inference Endpoints Dedicated

Deploy any Hub model on dedicated hardware (T4, A10G, A100, H100) in your chosen cloud region. Auto-scaling, private VPC option, OpenAI-compat API. Per-hour from ~$0.60/hr (T4) to $6+/hr (A100). Best for custom fine-tuned models needing stable SLAs.

TGI — Text Generation Inference OSS Server

HuggingFace's production inference server. Continuous batching, PagedAttention, FlashAttention-2, quantization (GPTQ/AWQ/bitsandbytes), multi-GPU tensor parallelism. Powers HF Inference Endpoints. Self-host via Docker: ghcr.io/huggingface/text-generation-inference

Unified API Routers

Gateway	Type	Models / Providers	Key Feature	Pricing
OpenRouter	Router	300+ models, 30+ providers	Single API key for all providers, real-time price comparison, auto-fallback, free-model tier, provider rankings by latency	Pass-through + small markup
LiteLLM (proxy)	OSS Proxy	100+ providers	Self-hosted, per-team cost tracking, rate limits per key, Redis caching, load balancing across providers, budget alerts	Free OSS; hosted tier available
Portkey AI	Enterprise Router	250+ models	Observability, guardrails, semantic caching, canary deployments, prompt versioning, SOC2	Free tier; $49+/mo
Helicone	Observability	Any OpenAI-compat	Drop-in logging proxy (one-line header change), cost analytics, prompt management, A/B testing, OSS	Free tier; usage-based
AI/ML API	Router	200+ models	Low-cost aggregator focused on open models, free trial credits, OpenAI-compat	Competitive; usage-based

Inference Platforms (Open Models as a Service)

Platform	Model Selection	Billing	Standout	Use Case
Ollama	Hub models via ollama pull	Free (self-hosted)	See §05 Local Runtimes — Docker-like UX, OpenAI-compat API, model library at ollama.com	Local dev; see §05 for details
Replicate	100K+ community models	Per-second GPU time	Cog framework for packaging any model; massive community ecosystem; image/video/audio supported	Niche models, rapid prototyping, image/video gen
Cloudflare Workers AI	50+ curated models	Per 1K neurons; 10K/day free	Edge inference <50ms globally, no cold starts, serverless-native, zero GPU provisioning	Latency-critical, edge apps, low-volume prod
Modal	Any model (custom containers)	Per-second GPU time	Python-native serverless GPU, ~100ms cold starts, great for batch pipelines and fine-tuning jobs	Batch inference, research, custom pipelines
Baseten	Any HF model	Per-hour + compute	Fast cold starts (~2s via Truss framework), fine-tuning support, production ML APIs	Custom fine-tuned model APIs with SLA needs

OpenRouter Quick Reference

Route Configuration

Provider fallback: "route": "fallback" tries providers in order

Load balance: "route": "load-balance" distributes across providers

Provider pinning: "provider": {"order": ["Groq","Together"]}

Model Naming

OpenRouter uses provider/model-name slugs:

anthropic/claude-sonnet-4-6
openai/gpt-4.1
google/gemini-2.5-pro
deepseek/deepseek-r1
meta-llama/llama-3.3-70b
qwen/qwen3-235b-a22b

Free Models Tier

OpenRouter maintains a set of free models (rate-limited): Llama 3.3 70B, DeepSeek variants, Gemma 3, Mistral 7B, and others. Append :free to the model slug to opt in. Rate limits apply per IP. Useful for dev/CI pipelines.

Drop-in OpenAI Replacement

Change two lines — base URL and model slug — to route any OpenAI SDK call through a gateway:

base_url="https://openrouter.ai/api/v1"
model="anthropic/claude-sonnet-4-6"

Works identically with LiteLLM proxy at localhost:4000.

    LiteLLM vs OpenRouter: OpenRouter is hosted — fast setup, no infra, ideal for solo devs and quick integrations. LiteLLM proxy is self-hosted — more control, per-team budget enforcement, Redis semantic caching (reuse identical prompts), and request logging to your own store. Both expose OpenAI-compat endpoints so switching is trivial.
  

    Gateway decision tree: One API key for many providers? → OpenRouter. Self-hosted cost control + logging? → LiteLLM proxy. Niche open model not on major clouds? → Replicate. Global edge inference? → Cloudflare Workers AI. Custom fine-tune with stable SLA? → HF Inference Endpoints. Add observability to existing setup? → Helicone. Serverless GPU batch jobs? → Modal.