Token tracking — a deep dive

How we account for cache reads, cache writes, and streamed responses.

2026-04-12 · Engineering · 7 min read

Token usage looks simple from the outside — input + output, multiply by price, done. In practice, modern model APIs surface four counters that all bill differently: input, output, cache_creation_input, and cache_read_input.

The four counters

input_tokens — what you sent, not counting cached prefixes
output_tokens — what came back
cache_creation_input_tokens — first time a long prefix is cached (priced ~1.25× input)
cache_read_input_tokens — cache hit on a prefix (priced ~0.1× input)

A naive integration adds them all up and overcharges. We bill each separately at the model's real per-token price. Effective input cost on a Sonnet 4.5 chat with a 50k cached system prompt drops by ~85%.

Streaming complicates things

For SSE, the upstream sends the input count in message_start and the running output count in successive message_delta events. We track both and charge once at message_stop.

json

{
  "type": "message_delta",
  "delta": { "stop_reason": "end_turn" },
  "usage": { "output_tokens": 412 }
}

If the connection dies before message_stop, we still bill what was produced — partial output costs money for the upstream too. Aborts during input (before any output_tokens > 0) are free.

Receipts

Every request lands a row in api_usage_logs with each counter and the current model price. The dashboard exposes daily and per-model breakdowns; the /ai/v1/usage/stats endpoint gives you the same numbers as JSON.

Try 1024X →