Switch to light mode

Your AI Tooling Budget Is Already Unsustainable - Here's What to Do About It

- 14 min read

Burning through AI tooling budget - token waste visualization

In April 2026, Uber’s CTO admitted the company had burned through its entire annual AI budget in four months.

Five thousand engineers. Claude Code adoption doubled by February. The math stopped working sometime around March.

Most teams won’t get a headline about it. They’ll just get a budget conversation they weren’t prepared for. Or they’ll hit Claude’s usage limits mid-sprint and wonder what happened.

Here’s what happened: the way AI coding tools consume tokens is fundamentally different from what the marketing pages suggest, and most teams have no idea what they’re actually spending or why.

This article breaks down where tokens actually go, and covers ten tools - including one we built at Jetpack Labs - that meaningfully reduce waste without slowing you down.


The Gap Between Subscription and Reality

The Claude Max plan is $200/month. That sounds reasonable. Then you start using Claude Code as an actual agent - letting it read files, run commands, iterate on solutions - and the numbers shift.

A single long agentic session burns 50,000 to 300,000 tokens. One complex debugging session can consume 30–90% of a 5-hour budget. Token overages, premium model access, and agentic workflows push real costs 2–5× past base subscriptions. Engineers using Claude Code as an agent are reporting $500–$2,000/month in API costs, well past what a subscription covers.

At the enterprise level, engineering leaders should expect AI tooling to hit 20–30% of total operational expenditure by late 2026. $1,000+ per developer annually is becoming the baseline for multi-tool teams.

Anthropic is currently subsidizing heavy Claude Max users significantly - some estimates put compute costs at $90,000/year against $2,400 in subscription revenue for power users. That model doesn’t last. Price adjustments and throttling are coming.

So the question isn’t whether this gets expensive. It’s whether you’re doing anything about it before it becomes a crisis.


Where Tokens Actually Go

Before you can fix it, you need to understand the three drains.

Input tokens are everything Claude reads: files you give it, bash output, tool responses, past context. These pile up fast. A standard Glob + Grep + Read workflow on a typical search returns ~8,000 tokens. Do that twenty times in a session and you’ve burned 160,000 tokens before Claude has written a single line of code.

Output tokens are everything Claude writes back: explanations, summaries, code, commentary. Most developers don’t realize how much this costs. Claude defaults to being verbose. It restates the task before doing it. It writes a paragraph of context before the code. It confirms what it just did after doing it. None of that is free.

Context waste is the sneaky one. It’s re-reading files Claude already processed. It’s correction loops where Claude makes the same mistake twice because it forgot a project rule. It’s a context window that fills up with noise - verbose bash output, full file contents when you needed two lines, progress bars and ANSI codes that nobody needed to see.

The good news: all three are tractable. The tools below attack each one directly.


The Tools

ADA - Claude Code Token Optimizer

I’ll start with this one because we built it at Jetpack Labs, and it’s the one I use every day.

We named it after Ada Lovelace - the 19th century mathematician widely regarded as the world’s first computer programmer. She was working with Charles Babbage on his Analytical Engine a century before computers existed. Seemed like the right name for a tool that makes a modern AI engine smarter about how it works.

ADA is a private plugin, not publicly released. I’m writing about it here because the approach generalises - if you’re building internal tooling, these are the levers worth pulling.

The core problem ADA solves: Claude Code does a lot of small round-trips by default. A search becomes Glob (847 files listed) + Grep (120 raw matches) + Read (full file, 400 lines). That’s roughly 8,000 tokens for information that could fit in 800.

ADA replaces those default tools with smarter equivalents that return exactly what the model needs - ranked, deduplicated, and trimmed.

ToolReplacesSavings
ada_searchGlob + Grep + Read~10× per search
ada_readRead40–60% on large files
ada_editEdit + MultiEditBatches multi-file edits
ada_sqlShell DB introspectionSchema in 50 lines vs 5,000
ada_recallRe-reading past contextPrevents correction loops

It also intercepts bash output before it hits context. git diff goes from hundreds of lines to 2-line hunks. git log strips email addresses and shortens hashes. Docker output drops sha256 digests. git push progress bars disappear entirely. Compression only triggers when savings exceed 15% - it doesn’t mangle useful output.

The other feature I use constantly is context modes. Set balanced mode once and you get a 200K context window, 80% auto-compact threshold, and a response style rule that tells Claude to skip pleasantries and task restatements. In aggressive mode it drops to fragments-only output. In a typical session, switching from default to balanced saves roughly 2× in output tokens before you’ve touched anything else.

Real-world result: 30–58% fewer input tokens on actual coding tasks.

Install is a local Claude Code plugin - no API, no account, nothing leaves your machine. Requires ripgrep and Node 20+.

(Private plugin - not publicly available)


RTK - Rust Token Killer

If you do nothing else on this list, install RTK.

It’s a single Rust binary that sits between your terminal commands and Claude’s context window. When Claude runs git status or npm install or cargo build, RTK intercepts the output, filters the noise, groups related items, truncates irrelevant parts, and deduplicates before passing it on.

In measurements across 2,900+ real-world commands, RTK removes an average of 89% of CLI output noise. In typical 30-minute AI coding sessions, teams report dropping from around 150,000 tokens to roughly 45,000 - about a 70% reduction.

Setup is a Claude Code PreToolUse hook that automatically rewrites commands (git statusrtk git status) without any manual steps. Zero dependencies. Works on macOS, Linux, and Windows. MIT license, no telemetry.

The numbers are almost too good to believe until you run it for a week. Verbose npm output, full cargo compilation logs, long test runner output - most of it is noise that Claude was reading and paying for.

rtk-ai/rtk


code-review-graph - 49× Token Reduction on Monorepos

(Haven’t run this in production yet - it’s next on my list.)

This one went GitHub Trending within days of release and for good reason.

On a Next.js monorepo with 27,000 files, manually specifying file context on every task defeats the purpose of an AI assistant. code-review-graph builds a structural map of your codebase from the AST using Tree-sitter, stores it locally, and queries it through MCP whenever Claude needs to understand what a change actually touches.

The result: Claude reads only what matters. On the Next.js monorepo benchmark, adding a rate limiter went from 739K tokens to 15K - 49× fewer. The tool pointed Claude to the right 15 files out of 27,000 and skipped the rest. On a 125-file project, savings are more modest at 4.6× - but still meaningful.

Supports 19 languages: Python, TypeScript, JavaScript, Go, Rust, Java, C#, Ruby, Kotlin, Swift, PHP, C/C++, Vue SFC, Solidity, Dart, R, Perl, Lua, and Jupyter notebooks.

If you work on anything larger than a side project, this is the highest-leverage install on this list.

tirth8205/code-review-graph


Caveman - 75% Output Token Reduction

Output tokens are the easiest drain to ignore because they feel free. They’re not.

Caveman is a Claude Code skill that makes the agent communicate in minimal, fragment-heavy responses. No preamble. No post-task summaries. No “I’ve completed the task you requested and here’s a breakdown of what I did.” Just the answer.

The math: a March 2026 paper titled “Brevity Constraints Reverse Performance Hierarchies in Language Models” found that constraining large models to brief responses improved accuracy by 26 points on some benchmarks. Fewer tokens, better signal.

In real sessions, Caveman cuts output tokens by 65% on average, with a range of 22–87% depending on task type. Installation is a single skill file. A hook writes a flag at session start so the behaviour kicks in from the first message.

It sounds gimmicky. It isn’t. Claude’s verbose defaults exist for conversational usability, not for agentic workflows where you’re reading tool output, not chat.

I tried Caveman and the output compression is real. I ended up back on ADA though - the context modes and bash output compression cover enough of the same ground that running both felt redundant. If you’re not using ADA, Caveman is the obvious alternative.

JuliusBrussee/caveman


Claude Token Efficient - Drop One File, Save 90%

(Haven’t tried this one yet - it’s on my list alongside code-review-graph.)

This is the lowest-friction item on the list.

Drop a single CLAUDE.md into your repository. It enforces strict terseness rules - no task restatements, no summaries, fragments over full sentences - with zero code changes and zero tooling overhead.

The author benchmarks 90% token savings, taking a typical project docs load from 11K to 1.3K tokens. The magic is that it works at the CLAUDE.md level: Claude reads it at session start and applies the rules globally, before any tools run.

It’s not a substitute for the more structural tools - you can’t CLAUDE.md your way out of reading 27,000 files. But for output verbosity and input preamble, it’s a free win that takes 30 seconds to set up.

aymenfurter/claude-token-efficient


Token Optimizer MCP - 95%+ Reduction via Caching

Most MCP tools return full responses every time. If Claude calls read_file on the same config three times in a session, that’s three full reads at full token cost.

Token Optimizer MCP adds caching and Brotli compression on top of your existing MCP tools. Repeated reads hit cache instead of the model. The SQLite backend persists across sessions for content that doesn’t change frequently. It also provides three lean wrapper tools for agents: list_tools() (short descriptions only), get_tool_schema(name) (full schema for one specific tool on demand), and invoke_tool(name, input) (execute with structured inputs).

The claimed 95%+ reduction is in the right conditions - high cache hit rate, repeated tool calls, stable content. Real-world savings will vary, but for codebases where Claude frequently re-reads the same files, the cache alone pays for itself quickly.

ooples/token-optimizer-mcp


Token Savior - Symbol Navigation, 97% Reduction

Token Savior attacks a specific problem: when Claude needs to understand code, it tends to read entire files. A 1,200-line service class when you needed the signature of one method. A full model file when you needed to check one relationship.

Token Savior navigates by symbol instead of file. It builds a persistent symbol index and lets Claude jump to the exact function, class, or method it needs without reading anything surrounding it. On code navigation tasks it claims 97% reduction, backed by the persistent memory index that means it doesn’t rebuild from scratch each session.

Works best in large, mature codebases where files are long and Claude spends a lot of time on comprehension reads.

token-savior/token-savior


Ghost Token Hunter - Context Quality, Not Just Size

Most token tools optimize quantity. This one optimizes quality.

“Ghost tokens” are invisible waste in your context: subtle formatting artifacts, repeated whitespace, BOM characters, encoding noise. They don’t show up in a diff but they degrade context quality and inflate token counts. Token Optimizer (alexgreensh) hunts them down, removes them, and restores clean context.

The framing is unusual but the problem is real, especially on projects that pull content from multiple sources - docs, API responses, database schemas. The cleaner your context, the more useful Claude’s responses become.

alexgreensh/token-optimizer


This one is for teams whose primary token drain is codebase comprehension.

Zilliz’s Claude Context MCP uses hybrid vector search to make your entire codebase available as semantic context. Instead of Claude reading files linearly, you get relevance-ranked retrieval - similar to RAG but applied to your local codebase.

The claimed 40% cost reduction is for teams where Claude regularly needs to pull context from large, complex repositories. The setup is heavier than most tools on this list, but for teams with large monorepos that aren’t candidates for Tree-sitter-based tools, this is the alternative to reach for.

zilliztech/claude-context


What I Actually Run

Here’s my current day-to-day stack, in order of impact:

ADA is always on. It’s the baseline. Context mode set to balanced. The combination of smarter search tools and bash output compression handles the bulk of input token waste.

RTK runs alongside it for the CLI output that ADA doesn’t intercept. Together they eliminate most of the terminal noise.

That combination, on real sessions, puts me in the 50–70% total token reduction range versus unoptimized defaults. Not theoretical - tracked via ADA’s SQLite session logging.

code-review-graph and claude-token-efficient are both on my list to trial next. The Tree-sitter-based approach in code-review-graph looks like the highest-leverage addition for larger codebases, and claude-token-efficient is low enough friction that there’s no good reason not to try it.

I tested Caveman and the output compression genuinely works - I just didn’t need it on top of ADA. If you’re running without ADA, it’s the obvious first move.


What to Do If You’re Managing a Team

The tooling is half the answer. The process is the other half.

Audit before you optimize. ADA’s session logging gives you actual numbers - baseline vs. optimized, cost per session, per-tool breakdown. Without that, you’re guessing. You want to know which sessions are expensive and why before you choose what to install.

Set context modes in your CLAUDE.md. Every project should have a project-level CLAUDE.md that sets verbosity expectations. If you don’t, Claude defaults to maximum verbosity for all developers, all sessions.

Track token spend the same way you track infrastructure spend. Engineering leaders should expect $1,000+ per developer annually. If you’re not measuring it now, you won’t see the problem until it’s already a budget conversation.

The 30% who’ve hit usage limits are using Claude right. They’re running real agentic workflows. The answer isn’t to use it less - it’s to use it more efficiently.


The Unsustainable Part

Anthropic’s current pricing is subsidized. The gap between what power users pay and what their usage actually costs is significant. That gap closes over time, either through price increases, rate limiting, or both.

The teams that have built efficient token practices now will be in a much better position when pricing normalizes. The teams that haven’t will face a choice: reduce AI usage or increase spend.

Neither of those is a good answer. Building efficient habits now is.

The tools above aren’t about getting less out of Claude Code. They’re about getting the same output for a fraction of the token cost. Same accuracy, same capability, dramatically less waste.

That’s the bet worth making.

© 2024 Shawn Mayzes. All rights reserved.