Cloud AI Coding Is Getting Worse. Local Models Can Replace It.

Something is happening to cloud AI coding tools.

On March 23, Claude Code users started reporting that their usage quotas were draining in 90 minutes instead of five hours. Anthropic confirmed they’d been “adjusting” limits during peak hours, cutting session capacity by nearly half between 5am and 11am Pacific. Max subscribers paying $200 a month posted that they could only use Claude 12 out of 30 days. One user watched a single prompt burn their quota from 21% to 100%.

On April 4, Anthropic banned 135,000 OpenClaw instances from using flat-rate subscriptions. Five days later, OpenAI launched a $100/month ChatGPT Pro tier for Codex users, quietly noting they’re “rebalancing usage to support more sessions rather than longer sessions.” Translation: shorter sessions.

The economics make sense. The average Claude Code developer burns about $6/day in API-equivalent compute. Outliers hit $5,000/day. Subscriptions priced for human-paced chat can’t absorb that.

I use Claude Code every day. I’ve noticed the thinking getting shallower. And I’ve been watching this trend long enough to see where it’s going.

So I tested whether local models are ready to pick up the slack. The short answer: yes, for most coding tasks, right now.

What I Found

I tested five open-weight models on RunPod GPUs, running them through Claude Code on eight real coding tasks: bug fixes, TDD, refactoring, feature addition, security patches, debugging, implementing from test specs, and algorithm optimization. Real files, real pytest suites, real pass/fail.

Four of the five models passed every task. All tests green, no human intervention.

Model	Size	Harness Score	Speed (GPU)	Fits 64GB Mac?
Gemma 4 31B	19GB	24/24	20 tok/s	Yes
Qwen3.5-27B	16GB	24/24	22 tok/s	Yes
Coder-Next 80B	48GB	24/24	114 tok/s	Tight
MiniMax M2.5 229B	138GB	24/24	65 tok/s	256GB only
Devstral 2 123B	75GB	0/24	18 tok/s	128GB+

Devstral writes the best code in isolation but couldn’t drive Claude Code due to a Mistral tool-format incompatibility. The model works, the plumbing doesn’t. Infrastructure matters as much as model quality.

The Nuance

These results come with context.

The harness matters. When I tested the same models through raw API calls (no Claude Code), they scored 13-22 out of 30 on agentic tasks. Through Claude Code, four went 24/24. The difference: Claude Code provides a feedback loop. It reads test failures, retries, iterates. The model doesn’t need to be perfect on the first try. It needs to be good enough to converge.

The tasks were well-defined. Bug fixes with test suites. Implementations with specs. Clear pass/fail criteria. Ask a local model to architect a new system or debug a race condition with no reproduction steps, and you’ll still want Opus.

Speed varies by hardware. Those GPU numbers don’t translate directly to Mac. On an M4 Max 64GB, Gemma 4 runs at about 6 tok/s. Usable but not fast. On the upcoming M5 Ultra 256GB, MiniMax runs at an estimated 45 tok/s. That’s responsive.

Code quality differs. MiniMax added thread safety to a Cache implementation without being asked. Coder-Next wrote the most documented code. Gemma 4 leaked internal reasoning into comments. All passed the tests, but you’d review the code differently.

The Speed Picture

Estimated useful tokens per second on Apple Silicon (bandwidth-extrapolated from GPU measurements):

Model	M4 Max 64GB	M4 Max 128GB	M5 Ultra 256GB
Coder-Next 80B	35	35	78
MiniMax M2.5	—	—	45
Qwen3.5-27B	7	7	16
Gemma 4 31B	6	6	14
Devstral 123B	—	6	12

-- = doesn’t fit. These are estimates with 10-20% variance expected on real hardware.

What I Spent

Item	Cost
RunPod GPU time (all benchmarks + harness tests)	~$30
Anthropic API	$0
Hardware purchased	$0

Everything in this series came from GPU rentals and scripts. The benchmark code and raw results are on GitHub.

The Rest of This Series

This is the first of six posts. The rest go deeper:

Can Local Models Be Coding Agents? The raw tool-calling benchmark: 10 tasks across 3 difficulty tiers. Where models break without a harness.
Every Mac That Can Run Local AI Models Hardware buyer’s guide with price/performance tables.
The Thinking Token Tax Why tok/s benchmarks are lying to you.
Building a Local Coding Stack Claude Code, Cline, Aider, OpenHands: what works with local models.
I Pointed Claude Code at Local Models The real test: code quality, architecture decisions, and what each model actually writes.