From $500/Day to $0: Investigating Local LLM Inference for Heavy Claude Code Users
- 14 min read
I have a problem that sounds like a good problem to have: I’m spending a significant amount every week on Anthropic tokens, almost entirely through Claude Code. Not product API calls at scale. Just me, multiple terminal windows, and a workflow I’ve built around parallel AI-assisted development.
This article is not a how-to. It’s an investigation. I’ve been working through the math, the tradeoffs, and the architecture of moving to local inference, and I’m sharing everything I’ve found before I make a hardware decision I can’t undo.
If you’ve done this, I want to hear from you. More on that at the end.
The Actual Spend Pattern
Before drawing conclusions, I needed to understand exactly where the money was going. Tools like CodeBurn can give you a breakdown by project, model, and activity type across your Claude Code sessions.
What mine showed:
- The vast majority of spend is from active development sessions, not product API calls at runtime
- Coding, exploration, debugging, and feature development account for most of the cost
- Cache hit rates are extremely high (mine sat at 97.2% over a 7-day window)
- The heaviest sessions were on large, complex frontend refactoring projects
That last point about cache hits turned out to be the most important number in the whole analysis.
What a 97.2% Cache Hit Rate Actually Means
This is where most “just go local” arguments fall apart, and it took me a while to fully internalize it.
When you run a Claude Code session on a large project, the first thing that happens is context loading. Your CLAUDE.md files, project instructions, codebase patterns - this can be 30,000 to 60,000 tokens just to “boot up.” At standard input pricing, that’s expensive.
Anthropic’s prompt caching stores the computed state of that context on their servers. Every subsequent message in the session pays roughly 10% of the normal input cost to “hit” that cache instead of re-sending everything.
A 97.2% cache hit rate means you are effectively getting 10x the context capacity for your money. You’re paying full price for almost nothing.
Local inference has no prompt caching. Not in oMLX, not in llama.cpp out of the box for cross-session persistence, not in any open source stack that approaches what Anthropic offers. Every message re-processes the full context from scratch.
So when you run the real math, a session that costs $5 with Anthropic’s cached pricing might cost the equivalent of $40-50 in compute time locally, even ignoring hardware costs.
The Workflow That’s Driving the Spend
Understanding the spend pattern changed how I thought about the solution. Here’s what a typical day actually looks like:
I run multiple projects in parallel across separate terminal windows. In one window, I might be doing a large frontend refactor - investigating legacy code, generating a section-by-section migration plan, having the system build out each section, then running an agent that reviews the output against our coding standards. In another window, a separate product is getting new features built from detailed tickets. In a third, something else entirely.
I context-switch between them. When one finishes a task and responds, I review it, give it the next instruction, and move to the next window. Rinse and repeat.
Each project has detailed CLAUDE.md files, skills, agents, and documented tribal knowledge. The AI doesn’t have to infer our conventions - they’re written down.
The Chrome/Playwright screenshot comparison work - checking that built UI matches designs - is a smaller portion of the total, but it’s the most vision-dependent piece.
The key insight from mapping this workflow: the majority of what I do is structured, well-documented, and follows repeatable patterns. That’s actually a favorable profile for open source models.
— Advertisement —
Why Purpose-Built Coding Models Change the Equation
The “local models aren’t good enough” argument made more sense when local meant 7B or 13B general-purpose models. The current generation is a different story - and the most important development isn’t just model size, it’s purpose-built coding models.
The headline model I’m investigating is Qwen3-Coder-Next - a coding-specific model scoring 58.7% on SWE-bench Verified with a 256K context window. That context window is larger than Anthropic’s 200K limit, which matters for large codebase sessions. Qwen3-72B remains relevant as a general-purpose fallback, but it’s not the primary recommendation anymore.
For the tasks that dominate my workflow:
| Task | Local Viable? | Notes |
|---|---|---|
| Codebase investigation and summarizing | Yes | Reading and synthesizing is a strength |
| Section-by-section migration planning | Yes | Structured output with clear instructions |
| Building code from detailed specs | Yes | Where purpose-built coding models close the gap |
| Standards review agent | Yes | Pattern matching against documented rules |
| Multi-project feature development | Yes | Scoped, focused tasks |
| Screenshot to design comparison | Uncertain | Vision reasoning is a real gap |
| Overnight autonomous agentic runs | Depends on tooling |
The critical variable is documentation quality. Most of the quality gap between Claude and open source models comes from Claude being better at inferring what you actually want when instructions are ambiguous. If your CLAUDE.md files, skills, and tribal knowledge are thorough, you’ve already answered most of those questions. The model doesn’t need to infer as much.
A well-documented project on Qwen3-Coder-Next will outperform a poorly documented project on Sonnet. I’m reasonably confident of that.
The Hardware Question: Mac Studio M3 Ultra 256GB
If local inference makes sense, the hardware target for my use case is a Mac Studio M3 Ultra with 256GB unified memory.
Here’s why that specific configuration matters:
Memory math for concurrent sessions:
| What’s resident | Memory needed |
|---|---|
| Qwen3-Coder-Next 4-bit weights | ~40GB |
| Devstral (code review agent) | ~24GB |
| Qwen2.5-VL-72B (vision) | ~40GB |
| KV cache across active sessions | ~22GB |
| Total | ~126GB - fits with ~130GB to spare |
On 256GB you can hold three specialized models resident simultaneously - a primary coding model, a dedicated review agent, and a full-size vision model - without any of them cold-starting when you switch tasks.
Speed estimates on M3 Ultra:
| Model | Speed estimate |
|---|---|
| Qwen3-Coder-Next 4-bit | ~40-55 tok/s |
| Devstral 4-bit | ~80-100 tok/s |
| Qwen3-72B 4-bit (fallback) | ~40-55 tok/s |
For reference, Anthropic’s API with Sonnet typically streams at 60-100+ tok/s. The M3 Ultra gets you into the same range on coding-specific models.
A Note on MoE Models
One option worth calling out separately: MoE (Mixture of Experts) models like DeepSeek-V4-Pro have 1.6T total parameters but only activate roughly 49B per token during inference. On 256GB you can hold the full weight set resident while running at effective 70B inference cost.
This is a quality tier above dense 72B models that simply wasn’t accessible on smaller hardware configs. Whether it’s actually better than a purpose-built coding model on real tasks is something I want to test - but it’s a legitimate option to investigate, and 256GB is where it becomes feasible.
Can Local Hardware Replicate Prompt Caching?
This was the question I spent the most time on, because the cache hit math is so important.
The short answer: partially, with engineering work.
llama.cpp has KV cache save/reload functionality built in:
# Pre-compute project context and save KV state
--cache-prompt
--cache-reuse
In theory, you could pre-compute your CLAUDE.md and codebase context for each project once, save that KV state to disk, and reload it at the start of each session. This approximates what Anthropic does, but with some differences:
- Within a session: KV cache works well, similar to what you get today
- Cross-session persistence: Possible but requires scripting and setup
- Multiple concurrent sessions: 256GB gives you headroom to keep several cached states resident simultaneously
oMLX doesn’t expose this directly today. You’d need to run llama.cpp underneath or use LiteLLM as a router to get fine-grained cache control. It’s a real engineering project, not a plug-and-play solution.
The Hybrid Architecture I’m Considering
Going fully local feels like the wrong call even if the hardware math works. The smarter approach is routing with purpose-built models for each task type.
LiteLLM can sit in front of everything as a proxy that speaks the Anthropic API format. Your tools and Claude Code don’t know the difference - they point at localhost.
Claude Code development sessions (coding, planning, refactoring)
→ Qwen3-Coder-Next on Mac Studio M3 Ultra (free at marginal cost)
Code review agents
→ Devstral on Mac Studio M3 Ultra (purpose-built for agentic coding tasks)
Screenshot/visual comparison
→ Anthropic API fallback (~5-10% of current spend)
When local model gets confused or drifts
→ Automatic fallback via LiteLLM router
The routing rules don’t require manual switching. You set them once based on request characteristics - presence of tool use, context length, task type - and it routes automatically.
How Documentation Quality Closes the Gap
This is worth its own section because it’s where the experiment lives or dies.
Claude compensates for vague instructions through inference. Open source models follow explicit instructions reliably but infer less. The quality gap between them isn’t fixed - it narrows directly in proportion to how well you’ve documented your project. That means the experiment outcome depends heavily on documentation quality, not just model quality.
A few things I’m planning to test specifically:
Add negative examples to CLAUDE.md files. Explicitly document what NOT to do, not just what to do. Claude has implicit guardrails from training that catch many common mistakes. Open source models don’t have the same implicit safety net - so you need to make the rules explicit.
Point agents at specific reference files rather than patterns. Instead of “follow existing patterns,” say “use src/components/UserCard.tsx as the reference implementation for this component.” Qwen3-Coder is very good at adapting from concrete examples - better than inference from description alone.
Use Qwen3’s thinking mode selectively. Qwen3 has a built-in thinking mode controlled by /think and /no_think tokens. Enable it for architecture decisions and novel debugging. Disable it for boilerplate generation where the speed overhead isn’t worth it.
The teams I’ve seen succeed with open source models for production work have one thing in common: they didn’t treat documentation as a nice-to-have. They treated it as load-bearing infrastructure.
What I’m Still Not Sure About
I want to be honest about the open questions because they’re the reason I haven’t pulled the trigger yet.
Agentic chain reliability: In a long autonomous overnight run, a local model is more likely to make a subtle wrong assumption early that compounds over 50+ steps. That said, this concern is much smaller for a well-tooled project. Scoped, purpose-built agents with automated validation - contract checks, test coverage enforcement, lint on changed files - catch drift before it compounds. The concern applies to vanilla setups. A project with 20+ documented skills covering specific framework patterns is in a materially better position. I still don’t know where my specific setup lands until I run it.
The actual cache delta: I know my sessions run at 97.2% cache hits with Anthropic. I don’t know exactly what that translates to in wall-clock time and throughput on local hardware, even with KV cache scripting. The numbers I have are estimates.
Quality on the hard 20%: 80% of the work is probably equivalent. The other 20% - novel architectural decisions, subtle debugging, recovering from a wrong path - that’s where I’d expect a real gap. Whether that gap matters in practice depends on how often I hit those cases, and how much the tooling structure compensates.
The Hypothesis
Here’s where I land after working through all of this:
For a developer running the kind of parallel, well-documented, multi-project workflow I’ve described, at significant Anthropic spend, a 256GB Mac Studio M3 Ultra likely pays for itself in 2-4 weeks and cuts ongoing costs by 85-90%.
The hardware works. Purpose-built coding models are close enough for the majority of the workload. The cache question is solvable with engineering effort. Screenshot and visual comparison - the remaining Anthropic spend - is a small slice of total sessions.
But I’m treating this as a hypothesis, not a conclusion. I haven’t run the experiment yet.
What I Want to Know From You
If you’ve made this move, I have specific questions:
- Are you running purpose-built coding models (Qwen3-Coder-Next, Devstral) or general 70B models?
- Do you have documented skills and scoped agents per project, or vanilla Claude Code?
- Have you implemented persistent KV cache across sessions and does it work reliably in practice?
- What model are you running and on what hardware config?
- What broke that you didn’t anticipate?
I’m going to write a follow-up once I’ve run the actual experiment. Whether I buy the hardware or decide the current setup is correct given the cache math, I’ll share the real numbers.
Drop a comment, reply on LinkedIn or Twitter, or reach out directly. This is an expensive decision and I’d rather learn from people who’ve done it than find out the hard way.