Switch to light mode

Running Claude Code with a Local LLM in 2026: No Proxy Required

- 10 min read

Running Claude Code with a local LLM in 2026 using Ollama and oMLX

I wrote about this topic in early 2025. The setup involved cloning a proxy repo, running a Flask server to fake Anthropic’s API, and hoping the model followed instructions well enough to be useful. It worked, but barely.

That version of the problem is solved now.

In January 2026, Ollama released v0.14.0 with a native Anthropic Messages API endpoint. LM Studio added the same shortly after. oMLX showed up for Apple Silicon users and solved the RAM ceiling problem that made running large models impractical on most Macs. The result: you can run Claude Code against a local model today with two environment variables and one ollama pull.

Here is what the landscape actually looks like now, and what I would recommend depending on your hardware.


Why Bother Running Local at All

The Stack Overflow 2025 developer survey put some numbers on what most of us already feel. 84% of developers use AI coding tools. 64% worry about sending sensitive code to cloud providers. Trust in AI tools dropped 11 points year over year, even as usage climbed.

That is the tension. You want the productivity. You do not want your proprietary codebase sitting in someone else’s training pipeline.

Running locally resolves that. Your code never leaves your machine. No API costs either, which adds up fast on a team working with Claude Sonnet all day.

The tradeoff used to be that local models were not good enough to justify the hassle. That changed in early 2026, when models like Qwen3-Coder started posting SWE-bench scores near Claude Sonnet levels. The gap closed enough that local-first is now a legitimate default for everyday coding tasks, not just a curiosity.


The Three Backends Worth Knowing

Ollama (The Default Choice)

Ollama v0.14.0 added a native Anthropic-compatible /api/messages endpoint. Before this, you needed a proxy to translate Claude Code’s API calls into something Ollama understood. Now you just point Claude Code at Ollama directly.

unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"

claude --model qwen3-coder

One context setting worth adding: Claude Code needs at least 64K tokens of context to work well. Set this if you hit issues:

export OLLAMA_CTX_SIZE=65536

Ollama is cross-platform, well-maintained, and has the largest model library. Start here.

LM Studio (If You Prefer a GUI)

LM Studio 0.4.1 added an Anthropic-compatible /v1/messages endpoint. The setup is nearly identical, just pointed at a different port:

lms server start --port 1234

export ANTHROPIC_BASE_URL="http://localhost:1234"
export ANTHROPIC_AUTH_TOKEN="lmstudio"
export CLAUDE_CODE_ATTRIBUTION_HEADER=0

claude --model openai/qwen3-coder-30b

The practical advantage of LM Studio over Ollama is the GUI model browser and a feature called LM Link, which lets you run Claude Code on your laptop while the actual model inference runs on a separate, more powerful machine on your local network. If you have a desktop sitting nearby with more GPU headroom, this is useful.

oMLX (For Apple Silicon, Specifically)

oMLX is the most interesting addition to this space. It is a free, open-source inference server built for Apple Silicon Macs, and it solves a real problem: running models that exceed your physical unified memory.

It does this with a two-tier cache. Hot context stays in RAM. Older context blocks get offloaded to SSD automatically. In practice, this means a 16 GB MacBook Pro can run models that would otherwise require 24 GB of RAM, at the cost of some speed on cache misses.

Install it with:

brew install omlx
omlx start

The Anthropic-compatible endpoint runs at http://localhost:8000:

export ANTHROPIC_BASE_URL="http://localhost:8000"
export ANTHROPIC_AUTH_TOKEN="omlx"

claude --model qwen3-coder

It also includes a web admin panel and built-in chat at http://localhost:8000/admin/chat. If you are on a Mac and hitting RAM limits with Ollama, oMLX is worth trying before you give up on a model.

One limitation: Apple Silicon only. No Windows or Linux support.

Rapid-MLX (For Speed-Focused Mac Users)

Rapid-MLX is worth knowing about if raw throughput matters. It claims 4.2x faster inference than Ollama on Apple Silicon, 0.08s cached time-to-first-token, and ships with 17 tool-call parser formats - which matters because Claude Code’s agentic loop is tool-call heavy. It explicitly supports Claude Code via environment variables and is in active development (v0.6.75 as of early June 2026).

brew install raullenchai/rapid-mlx/rapid-mlx

unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="http://localhost:8000"
export ANTHROPIC_AUTH_TOKEN="rapid-mlx"

claude --model qwen3-coder

It is newer and has a smaller community than Ollama, so expect rougher edges. But if you are running Claude Code heavily on a Mac and hitting latency walls, it is worth testing.

ExLlamaV3 + TabbyAPI (For NVIDIA GPU Users)

If you are on Windows or Linux with an NVIDIA GPU, the Mac-focused tools above are not your story. The combination to know is ExLlamaV3 with TabbyAPI.

ExLlamaV3 uses a new EXL3 quantization format that runs Llama-3.1-70B at 1.6 bits-per-weight in under 16 GB of VRAM. Models that previously required a multi-GPU server now fit on a single consumer card. TabbyAPI provides the OpenAI-compatible serving layer on top:

docker run -p 5000:5000 theroyallab/tabbyapi

# Use LiteLLM to bridge to Anthropic format for Claude Code:
pip install litellm
litellm --model openai/your-model --api-base http://localhost:5000 --port 4000

unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_AUTH_TOKEN="tabbyapi"

claude

TabbyAPI exposes an OpenAI-compatible endpoint, so you need a lightweight bridge like LiteLLM to translate it for Claude Code. One command, no configuration files.


Which Model to Pull

The model matters more than the backend. Here is what I would pull depending on hardware:

8-10 GB VRAM or 16 GB unified memory:

ollama pull qwen2.5-coder:14b

Solid coding quality, fits comfortably in 16 GB, 64K context window.

24 GB VRAM or unified memory:

ollama pull qwen3-coder

This is the 30B active-parameter variant. 74% SWE-bench Verified score. 256K context. This is the one to reach for if your hardware supports it.

Best for autocomplete and fast edits:

ollama pull codestral:22b

Mistral’s code-specific model. Faster response times, purpose-built for fill-in-the-middle tasks.

Best for multi-file agentic work:

ollama pull devstral-small-2

Mistral built this specifically for multi-file editing and debugging workflows. 24B parameters, 128K context.

Fast tool-calling with 128K context:

ollama pull glm-4.7-flash

GLM-4.7-Flash is consistently strong at tool use, which matters for Claude Code’s agentic operations.

Good all-rounder that fits 16 GB:

ollama pull gemma3:12b

Google’s Gemma 4 12B dropped June 3, 2026. It fits comfortably in 16 GB of RAM, has native vision support, and Ollama 0.23+ has MTP speculative decoding for it. A reasonable middle-ground pick if Qwen2.5-Coder:14b is not working for you.

Successor to Qwen3-Coder if you have the hardware:

ollama pull qwen3-coder-next

Qwen3-Coder-Next is the 80B total / 3B active follow-up with an agentic-first design. On paper the benchmark numbers are strong. In practice it needs similar hardware to qwen3-coder - 24 GB unified memory or VRAM.


What Has Not Changed

Claude Code still works best with large context. Models under 32K context are going to struggle with anything beyond small, isolated tasks. Stick to models with at least 64K.

The agentic loop also puts real load on the model. Responses are longer and more structured than a simple chat. Smaller or weaker models tend to drift off-format and break the loop. If Claude Code starts producing garbled responses or stops completing tasks mid-way, the model is not keeping up, not Claude Code.


The Hybrid Pattern

Most developers I have talked to are not running fully local for everything. The more practical setup is: local model for routine tasks (autocomplete, small edits, explaining code), cloud for the hard 20% (large refactors, unfamiliar codebases, complex debugging).

Tools like Cline and Roo Code support per-task routing. You can configure them to hit Ollama by default and escalate to Anthropic when the task exceeds a threshold. That gives you the privacy and cost wins on the bulk of daily work while keeping the cloud option available when you actually need it.

LiteLLM is the standard proxy for teams doing this at scale. One config file routes by model name - local Ollama for the routine work, Anthropic for the hard stuff - and everything downstream sees a single API endpoint. If you use it, pin to version 1.83.0 or later; versions 1.82.7 through 1.82.8 had a supply chain incident in March 2026.

If you want a fully offline agentic coding tool that does not involve Claude Code at all, Goose is the most credible option right now. Block donated it to the Linux Foundation’s Agentic AI Foundation in December 2025 alongside MCP and AGENTS.md. It supports Ollama natively, runs as a desktop app or CLI, and has the institutional backing that suggests it will be around long-term.


What to Do Right Now

If you have Ollama installed:

ollama pull qwen2.5-coder:14b  # adjust to your hardware

unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"

claude

That is the whole setup. The proxy era is over.

If you are on Apple Silicon and pushing against RAM limits, try oMLX as a drop-in replacement for the Ollama backend. It handles the memory management automatically.

The quality bar for local coding models has crossed the threshold where this is worth doing seriously, not just experimenting with. The privacy and cost arguments were always there. Now the models are good enough to make them stick.


Thinking through local AI infrastructure for your team? Let's talk through the tradeoffs.

Schedule a Call Schedule a call
© 2024 Shawn Mayzes. All rights reserved.