Headroom: The Local Tool That Cuts AI Agent Token Use by Up to 95%

Headroom is a fully-local, open-source tool that compresses what your AI agent reads — tool outputs, logs, files, RAG chunks — for 60–95% fewer tokens with accuracy held flat on GSM8K and TruthfulQA.

Jun 19, 2026

Headroom: The Local Tool That Cuts AI Agent Token Use by Up to 95%

Table of contents

What problem it actually solves
How it works
The numbers that make it interesting
Where it plugs in
Honest caveats
Bottom line
Sources and further reading

If you run AI coding agents all day, you already know where the money goes: tokens. Every tool output, log dump, RAG chunk, and file your agent reads gets pushed into the model's context — and you pay for all of it, every turn. A new open-source project called Headroom goes after exactly that bill, and it's picked up more than 37,000 GitHub stars while doing it.

The pitch is blunt: "Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60–95% fewer tokens, same answers." It's built by Tejas Chopra (a Netflix engineer), released under Apache 2.0, and — the part that matters most for a lot of teams — it runs 100% locally.

What problem it actually solves

Modern agents are token-hungry not because the questions are long, but because the context is. A single "search the codebase" step can dump 18,000 tokens of file content into the prompt. Debugging a production incident can mean pasting 60,000+ tokens of logs. Most of that is boilerplate, repetition, and structure the model doesn't need verbatim.

Headroom sits between your agent and the model and squeezes that payload first. If you've felt the sting of the hidden costs of AI coding, this is aimed straight at the biggest line item.

How it works

Instead of one blunt "summarize everything" pass, Headroom routes content to a compressor that fits its type:

ContentRouter detects the content type and picks the right algorithm.
SmartCrusher handles structured JSON.
CodeCompressor is AST-aware, so it compresses source code by structure rather than mangling it as plain text.
Kompress-base is a trained model for general text.
CacheAligner stabilizes prompt prefixes so you still get KV-cache hits.

The clever bit is reversible compression (CCR): originals are cached locally, and if the model decides it actually needs the full version, it calls a headroom_retrieve tool to pull it back. So compression isn't a lossy gamble — the raw data is one call away when it matters.

The numbers that make it interesting

Anyone can claim big compression ratios. The reason Headroom is worth a look is that accuracy didn't fall off a cliff in its published benchmarks:

Benchmark	Baseline	With Headroom
GSM8K (math reasoning)	0.870	0.870 (±0.000)
TruthfulQA	0.530	0.560 (+0.030)
SQuAD v2	—	97% accuracy at 19% compression

And on real workloads the token savings are large:

Workload	Before	After	Saved
Code search	17,765	1,408	92%
SRE debugging	65,694	5,118	92%
GitHub triage	54,174	14,761	73%

Flat accuracy on a math benchmark like GSM8K is the proof point that caught most people's attention — compress the context that hard and you'd expect reasoning to degrade. Here it didn't.

Where it plugs in

Headroom isn't tied to one editor. It ships in three shapes:

Library — call it directly from your own code.
Proxy — drop it in front of your model API so existing tools benefit with no code changes.
MCP server — expose it as a tool your agent can use. (New to that idea? See MCP explained simply.)

It lists support for Claude Code, Codex, Cursor, Aider, Copilot CLI, and OpenClaw — i.e. most of the agents people in this space actually use. If you're still deciding between those, our Cursor vs Claude Code vs Codex breakdown is a good companion read.

Honest caveats

A few things to keep in mind before you wire it into production:

It's young and moving fast. 37k stars in a short window is hype velocity, not a maturity signal. Expect rough edges and breaking changes.
It adds a moving part. A proxy or MCP layer between your agent and the model is one more thing that can fail, add latency, or mask bugs. The local-only design helps, but test it on real traffic before you trust it.
Benchmarks aren't your codebase. GSM8K and TruthfulQA staying flat is encouraging, but the only number that matters is whether your agent gives the same answers on your tasks after compression. Measure it.
Stack reality: it's Python (3.10+) with a Rust core. Fine for most, but worth knowing if you're a pure-JS shop.

This is the kind of tool that rewards the workflow discipline we keep coming back to — if you already vibe-code without piling up technical debt, adding a measurable compression layer fits right in.

Bottom line

Token cost is the quiet tax on agentic coding, and most teams just eat it. Headroom is the first widely-adopted, fully-local attempt to cut it by a large margin without trading away answer quality — and it's free under Apache 2.0. Clone it, point it at a real workload, and check your own accuracy and token graphs. If the numbers hold up like the benchmarks suggest, 60–95% fewer tokens is hard to argue with.

Find Headroom on GitHub