From $500/Day to $0: Investigating Local LLM Inference for Heavy Claude Code Users

I have a problem that sounds like a good problem to have: I’m spending a significant amount every week on Anthropic tokens, almost entirely through Claude Code. Not product API calls at scale. Just me, multiple terminal windows, and a workflow I’ve built around parallel AI-assisted development.

This article is not a how-to. It’s an investigation. I’ve been working through the math, the tradeoffs, and the architecture of moving to local inference, and I’m sharing everything I’ve found before I make a hardware decision I can’t undo.

If you’ve done this, I want to hear from you. More on that at the end.

The Actual Spend Pattern

Before drawing conclusions, I needed to understand exactly where the money was going. Tools like CodeBurn can give you a breakdown by project, model, and activity type across your Claude Code sessions.

What mine showed:

The vast majority of spend is from active development sessions, not product API calls at runtime
Coding, exploration, debugging, and feature development account for most of the cost
Cache hit rates are extremely high (mine sat at 97.2% over a 7-day window)
The heaviest sessions were on large, complex frontend refactoring projects

That last point about cache hits turned out to be the most important number in the whole analysis.

What a 97.2% Cache Hit Rate Actually Means

This is where most “just go local” arguments fall apart, and it took me a while to fully internalize it.

When you run a Claude Code session on a large project, the first thing that happens is context loading. Your CLAUDE.md files, project instructions, codebase patterns - this can be 30,000 to 60,000 tokens just to “boot up.” At standard input pricing, that’s expensive.

Anthropic’s prompt caching stores the computed state of that context on their servers. Every subsequent message in the session pays roughly 10% of the normal input cost to “hit” that cache instead of re-sending everything.

A 97.2% cache hit rate means you are effectively getting 10x the context capacity for your money. You’re paying full price for almost nothing.

Local inference has no prompt caching. Not in oMLX, not in llama.cpp out of the box for cross-session persistence, not in any open source stack that approaches what Anthropic offers. Every message re-processes the full context from scratch.

So when you run the real math, a session that costs $5 with Anthropic’s cached pricing might cost the equivalent of $40-50 in compute time locally, even ignoring hardware costs.

The Workflow That’s Driving the Spend

Understanding the spend pattern changed how I thought about the solution. Here’s what a typical day actually looks like:

I run multiple projects in parallel across separate terminal windows. In one window, I might be doing a large frontend refactor - investigating legacy code, generating a section-by-section migration plan, having the system build out each section, then running an agent that reviews the output against our coding standards. In another window, a separate product is getting new features built from detailed tickets. In a third, something else entirely.

I context-switch between them. When one finishes a task and responds, I review it, give it the next instruction, and move to the next window. Rinse and repeat.

Each project has detailed CLAUDE.md files, skills, agents, and documented tribal knowledge. The AI doesn’t have to infer our conventions - they’re written down.

The Chrome/Playwright screenshot comparison work - checking that built UI matches designs - is a smaller portion of the total, but it’s the most vision-dependent piece.

The key insight from mapping this workflow: the majority of what I do is structured, well-documented, and follows repeatable patterns. That’s actually a favorable profile for open source models.

— Advertisement —

$48.0K saved with WOZCODE

Across 10.8K coding sessions and counting

Why Purpose-Built Coding Models Change the Equation

The “local models aren’t good enough” argument made more sense when local meant 7B or 13B general-purpose models. The current generation is a different story - and the most important development isn’t just model size, it’s purpose-built coding models.

The headline model I’m investigating is Qwen3-Coder-Next - a coding-specific model scoring 58.7% on SWE-bench Verified with a 256K context window. That context window is larger than Anthropic’s 200K limit, which matters for large codebase sessions. Qwen3-72B remains relevant as a general-purpose fallback, but it’s not the primary recommendation anymore.

For the tasks that dominate my workflow:

Task	Local Viable?	Notes
Codebase investigation and summarizing	Yes	Reading and synthesizing is a strength
Section-by-section migration planning	Yes	Structured output with clear instructions
Building code from detailed specs	Yes	Where purpose-built coding models close the gap
Standards review agent	Yes	Pattern matching against documented rules
Multi-project feature development	Yes	Scoped, focused tasks
Screenshot to design comparison	Uncertain	Vision reasoning is a real gap
Overnight autonomous agentic runs	Depends on tooling	More on this below

The critical variable is documentation quality. Most of the quality gap between Claude and open source models comes from Claude being better at inferring what you actually want when instructions are ambiguous. If your CLAUDE.md files, skills, and tribal knowledge are thorough, you’ve already answered most of those questions. The model doesn’t need to infer as much.

A well-documented project on Qwen3-Coder-Next will outperform a poorly documented project on Sonnet. I’m reasonably confident of that.

The Hardware Question: Mac Studio M3 Ultra 256GB

If local inference makes sense, the hardware target for my use case is a Mac Studio M3 Ultra with 256GB unified memory.

Here’s why that specific configuration matters:

Memory math for concurrent sessions:

What’s resident	Memory needed
Qwen3-Coder-Next 4-bit weights	~40GB
Devstral (code review agent)	~24GB
Qwen2.5-VL-72B (vision)	~40GB
KV cache across active sessions	~22GB
Total	~126GB - fits with ~130GB to spare

On 256GB you can hold three specialized models resident simultaneously - a primary coding model, a dedicated review agent, and a full-size vision model - without any of them cold-starting when you switch tasks.

Speed estimates on M3 Ultra:

Model	Speed estimate
Qwen3-Coder-Next 4-bit	~40-55 tok/s
Devstral 4-bit	~80-100 tok/s
Qwen3-72B 4-bit (fallback)	~40-55 tok/s

For reference, Anthropic’s API with Sonnet typically streams at 60-100+ tok/s. The M3 Ultra gets you into the same range on coding-specific models.

A Note on MoE Models

One option worth calling out separately: MoE (Mixture of Experts) models like DeepSeek-V4-Pro have 1.6T total parameters but only activate roughly 49B per token during inference. On 256GB you can hold the full weight set resident while running at effective 70B inference cost.

This is a quality tier above dense 72B models that simply wasn’t accessible on smaller hardware configs. Whether it’s actually better than a purpose-built coding model on real tasks is something I want to test - but it’s a legitimate option to investigate, and 256GB is where it becomes feasible.

Can Local Hardware Replicate Prompt Caching?

This was the question I spent the most time on, because the cache hit math is so important.

The short answer: partially, with engineering work.

llama.cpp has KV cache save/reload functionality built in:

# Pre-compute project context and save KV state
--cache-prompt
--cache-reuse

In theory, you could pre-compute your CLAUDE.md and codebase context for each project once, save that KV state to disk, and reload it at the start of each session. This approximates what Anthropic does, but with some differences:

Within a session: KV cache works well, similar to what you get today
Cross-session persistence: Possible but requires scripting and setup
Multiple concurrent sessions: 256GB gives you headroom to keep several cached states resident simultaneously

oMLX doesn’t expose this directly today. You’d need to run llama.cpp underneath or use LiteLLM as a router to get fine-grained cache control. It’s a real engineering project, not a plug-and-play solution.

The Hybrid Architecture I’m Considering

Going fully local feels like the wrong call even if the hardware math works. The smarter approach is routing with purpose-built models for each task type.

LiteLLM can sit in front of everything as a proxy that speaks the Anthropic API format. Your tools and Claude Code don’t know the difference - they point at localhost.

Claude Code development sessions (coding, planning, refactoring)
    → Qwen3-Coder-Next on Mac Studio M3 Ultra (free at marginal cost)

Code review agents
    → Devstral on Mac Studio M3 Ultra (purpose-built for agentic coding tasks)

Screenshot/visual comparison
    → Anthropic API fallback (~5-10% of current spend)

When local model gets confused or drifts
    → Automatic fallback via LiteLLM router

The routing rules don’t require manual switching. You set them once based on request characteristics - presence of tool use, context length, task type - and it routes automatically.

How Documentation Quality Closes the Gap

This is worth its own section because it’s where the experiment lives or dies.

Claude compensates for vague instructions through inference. Open source models follow explicit instructions reliably but infer less. The quality gap between them isn’t fixed - it narrows directly in proportion to how well you’ve documented your project. That means the experiment outcome depends heavily on documentation quality, not just model quality.

A few things I’m planning to test specifically:

Add negative examples to CLAUDE.md files. Explicitly document what NOT to do, not just what to do. Claude has implicit guardrails from training that catch many common mistakes. Open source models don’t have the same implicit safety net - so you need to make the rules explicit.

Point agents at specific reference files rather than patterns. Instead of “follow existing patterns,” say “use src/components/UserCard.tsx as the reference implementation for this component.” Qwen3-Coder is very good at adapting from concrete examples - better than inference from description alone.

Use Qwen3’s thinking mode selectively. Qwen3 has a built-in thinking mode controlled by /think and /no_think tokens. Enable it for architecture decisions and novel debugging. Disable it for boilerplate generation where the speed overhead isn’t worth it.

The teams I’ve seen succeed with open source models for production work have one thing in common: they didn’t treat documentation as a nice-to-have. They treated it as load-bearing infrastructure.

What I’m Still Not Sure About

I want to be honest about the open questions because they’re the reason I haven’t pulled the trigger yet.

Agentic chain reliability: In a long autonomous overnight run, a local model is more likely to make a subtle wrong assumption early that compounds over 50+ steps. That said, this concern is much smaller for a well-tooled project. Scoped, purpose-built agents with automated validation - contract checks, test coverage enforcement, lint on changed files - catch drift before it compounds. The concern applies to vanilla setups. A project with 20+ documented skills covering specific framework patterns is in a materially better position. I still don’t know where my specific setup lands until I run it.

The actual cache delta: I know my sessions run at 97.2% cache hits with Anthropic. I don’t know exactly what that translates to in wall-clock time and throughput on local hardware, even with KV cache scripting. The numbers I have are estimates.

Quality on the hard 20%: 80% of the work is probably equivalent. The other 20% - novel architectural decisions, subtle debugging, recovering from a wrong path - that’s where I’d expect a real gap. Whether that gap matters in practice depends on how often I hit those cases, and how much the tooling structure compensates.

The Hypothesis

Here’s where I land after working through all of this:

For a developer running the kind of parallel, well-documented, multi-project workflow I’ve described, at significant Anthropic spend, a 256GB Mac Studio M3 Ultra likely pays for itself in 2-4 weeks and cuts ongoing costs by 85-90%.

The hardware works. Purpose-built coding models are close enough for the majority of the workload. The cache question is solvable with engineering effort. Screenshot and visual comparison - the remaining Anthropic spend - is a small slice of total sessions.

But I’m treating this as a hypothesis, not a conclusion. I haven’t run the experiment yet.

If you want the practical setup side — which backends work today, which models to pull for your hardware, and how the proxy situation has changed — I wrote a follow-up: Running Claude Code with a Local LLM in 2026: No Proxy Required.

What I Want to Know From You

If you’ve made this move, I have specific questions:

Are you running purpose-built coding models (Qwen3-Coder-Next, Devstral) or general 70B models?
Do you have documented skills and scoped agents per project, or vanilla Claude Code?
Have you implemented persistent KV cache across sessions and does it work reliably in practice?
What model are you running and on what hardware config?
What broke that you didn’t anticipate?

I’m going to write a follow-up once I’ve run the actual experiment. Whether I buy the hardware or decide the current setup is correct given the cache math, I’ll share the real numbers.

Drop a comment, reply on LinkedIn or Twitter, or reach out directly. This is an expensive decision and I’d rather learn from people who’ve done it than find out the hard way.