Token tracking — a deep dive
How we account for cache reads, cache writes, and streamed responses.
· Engineering · 7 min read
Token usage looks simple from the outside — input + output, multiply by price, done. In practice, modern model APIs surface four counters that all bill differently: input, output, cache_creation_input, and cache_read_input.
The four counters
- input_tokens — what you sent, not counting cached prefixes
- output_tokens — what came back
- cache_creation_input_tokens — first time a long prefix is cached (priced ~1.25× input)
- cache_read_input_tokens — cache hit on a prefix (priced ~0.1× input)
A naive integration adds them all up and overcharges. We bill each separately at the model's real per-token price. Effective input cost on a Sonnet 4.5 chat with a 50k cached system prompt drops by ~85%.
Streaming complicates things
For SSE, the upstream sends the input count in message_start and the running output count in successive message_delta events. We track both and charge once at message_stop.
{
"type": "message_delta",
"delta": { "stop_reason": "end_turn" },
"usage": { "output_tokens": 412 }
}If the connection dies before message_stop, we still bill what was produced — partial output costs money for the upstream too. Aborts during input (before any output_tokens > 0) are free.
Receipts
Every request lands a row in api_usage_logs with each counter and the current model price. The dashboard exposes daily and per-model breakdowns; the /ai/v1/usage/stats endpoint gives you the same numbers as JSON.