| llm / go / devops / performance / benchmark

TTFT varies 13x across LLM providers. Here are the numbers.

Hourly probes across 15 frontier models from OpenAI, Anthropic, Google, DeepSeek, and xAI. Median TTFT ranges from 321ms to 4,226ms. Raw data included.

The claim

Every LLM provider publishes throughput numbers from ideal conditions. Nobody publishes what your production traffic actually experiences: time to first token (TTFT) measured continuously from a fixed location.

I set up an automated benchmark that probes 15 frontier models every hour and publishes all raw data. After 30+ hours of data across 5 providers, the results are clear: TTFT varies 13x depending on which provider you pick.

The setup

Every hour, a probe sends a minimal request (“Hi”, max 20 tokens) to each model via OpenRouter. The prompt is intentionally tiny to isolate infrastructure latency from model reasoning time.

Models tested:

  • OpenAI: GPT-5.4, GPT-5.5, GPT-OSS-120B
  • Anthropic: Claude Sonnet 4.6, Claude Opus 4.6, Claude Opus 4.7
  • Google: Gemini 2.5 Flash, Gemini 2.5 Flash Lite, Gemini 2.0 Flash
  • DeepSeek: DeepSeek v3.2, DeepSeek v4 Flash, DeepSeek v4 Pro
  • xAI: Grok 4 Fast, Grok 4.1 Fast, Grok 4.3

The numbers

ModelMedian TTFTMedian ThroughputMedian LatencySamples
google/gemini-2.5-flash-lite321ms191.9 tok/s395ms17
google/gemini-2.5-flash412ms235.6 tok/s464ms17
google/gemini-2.0-flash-001405ms203.2 tok/s468ms17
openai/gpt-5.4912ms44.8 tok/s1,147ms17
openai/gpt-5.51,158ms44.0 tok/s1,501ms17
openai/gpt-oss-120b1,491ms1,977.0 tok/s1,584ms2
anthropic/claude-opus-4.61,709ms70.4 tok/s1,939ms17
deepseek/deepseek-v3.21,734ms24.7 tok/s2,372ms17
anthropic/claude-sonnet-4.62,120ms44.3 tok/s2,842ms17
x-ai/grok-4.1-fast2,545ms2,985.4 tok/s2,593ms17
anthropic/claude-opus-4.72,494ms94.9 tok/s2,599ms17
deepseek/deepseek-v4-flash3,122ms251.6 tok/s3,560ms16
deepseek/deepseek-v4-pro3,411ms108.1 tok/s3,816ms3
x-ai/grok-4-fast3,618ms1,338.7 tok/s3,682ms17
x-ai/grok-4.34,226ms1,114.8 tok/s4,328ms17

What this means

Google is the TTFT winner by a wide margin. All three Gemini models respond in under 500ms at the median. Gemini 2.5 Flash Lite at 321ms is the fastest first token across all 15 models.

OpenAI sits in the middle. GPT-5.4 at 912ms and GPT-5.5 at 1,158ms are solid but not exceptional.

Anthropic has the widest spread within a single provider. Claude Opus 4.6 at 1,709ms is reasonable. Claude Opus 4.7 at 2,494ms and Sonnet 4.6 at 2,120ms are notably slower on first token.

xAI and DeepSeek are the slowest to start streaming. Grok 4.3 takes a median 4,226ms before the first token arrives. That is 13x slower than Gemini Flash Lite.

Fastest TTFT != fastest generation

Throughput tells a completely different story. The xAI Grok models are the slowest to start but produce tokens at 1,000-3,000 tok/s once they get going. Grok 4.1 Fast at 2,985 tok/s is 121x faster than DeepSeek v3.2 at 24.7 tok/s.

If your use case is batch processing where TTFT does not matter, xAI and DeepSeek v4 Flash are strong choices. If your use case is interactive chat where users stare at a loading spinner, Google wins.

Why this matters for production

If you hardcode a 3-second timeout on your LLM calls, 5 of the 15 models in this benchmark would regularly fail. If you set it at 2 seconds, 8 of 15 would fail at the median.

Most teams set timeouts based on what felt right during development with one provider. These numbers show that switching providers (or even models within the same provider) can push you past your timeout without changing any code.

Live dashboard and raw data

This benchmark runs continuously. The live dashboard with charts is at bench.jonathanwrede.de.

All raw data is published as JSONL and freely available at github.com/Jwrede/llm-bench-data. The model list updates daily based on OpenRouter’s popularity rankings.

The probing infrastructure is built with llmprobe, an open-source Go CLI that measures TTFT, latency, and throughput using raw HTTP and SSE parsing (no SDKs). It also works as an MCP server for Claude Code, so you can check provider health from your editor.