All posts

Token tracking — a deep dive

How we account for cache reads, cache writes, and streamed responses.

· Engineering · 7 min read

Token usage looks simple from the outside — input + output, multiply by price, done. In practice, modern model APIs surface four counters that all bill differently: input, output, cache_creation_input, and cache_read_input.

The four counters

  • input_tokens — what you sent, not counting cached prefixes
  • output_tokens — what came back
  • cache_creation_input_tokens — first time a long prefix is cached (priced ~1.25× input)
  • cache_read_input_tokens — cache hit on a prefix (priced ~0.1× input)

A naive integration adds them all up and overcharges. We bill each separately at the model's real per-token price. Effective input cost on a Sonnet 4.5 chat with a 50k cached system prompt drops by ~85%.

Streaming complicates things

For SSE, the upstream sends the input count in message_start and the running output count in successive message_delta events. We track both and charge once at message_stop.

json
{
  "type": "message_delta",
  "delta": { "stop_reason": "end_turn" },
  "usage": { "output_tokens": 412 }
}

If the connection dies before message_stop, we still bill what was produced — partial output costs money for the upstream too. Aborts during input (before any output_tokens > 0) are free.

Receipts

Every request lands a row in api_usage_logs with each counter and the current model price. The dashboard exposes daily and per-model breakdowns; the /ai/v1/usage/stats endpoint gives you the same numbers as JSON.