Cloud AI Coding Is Getting Worse. Local Models Can Replace It.

I spent $20 on GPU rentals and tested five local coding models on real tasks. Four of them passed everything. Here's the full picture.

ailocal-llmcoding

Something is happening to cloud AI coding tools.

On March 23, Claude Code users started reporting that their usage quotas were draining in 90 minutes instead of five hours. Anthropic confirmed they’d been “adjusting” limits during peak hours, cutting session capacity by nearly half between 5am and 11am Pacific. Max subscribers paying $200 a month posted that they could only use Claude 12 out of 30 days. One user watched a single prompt burn their quota from 21% to 100%.

On April 4, Anthropic banned 135,000 OpenClaw instances from using flat-rate subscriptions. Five days later, OpenAI launched a $100/month ChatGPT Pro tier for Codex users, quietly noting they’re “rebalancing usage to support more sessions rather than longer sessions.” Translation: shorter sessions.

The economics make sense. The average Claude Code developer burns about $6/day in API-equivalent compute. Outliers hit $5,000/day. Subscriptions priced for human-paced chat can’t absorb that.

I use Claude Code every day. I’ve noticed the thinking getting shallower. And I’ve been watching this trend long enough to see where it’s going.

So I tested whether local models are ready to pick up the slack. The short answer: yes, for most coding tasks, right now.

What I Found

I tested five open-weight models on RunPod GPUs, running them through Claude Code on eight real coding tasks: bug fixes, TDD, refactoring, feature addition, security patches, debugging, implementing from test specs, and algorithm optimization. Real files, real pytest suites, real pass/fail.

Four of the five models passed every task. All tests green, no human intervention.

ModelSizeHarness ScoreSpeed (GPU)Fits 64GB Mac?
Gemma 4 31B19GB24/2420 tok/sYes
Qwen3.5-27B16GB24/2422 tok/sYes
Coder-Next 80B48GB24/24114 tok/sTight
MiniMax M2.5 229B138GB24/2465 tok/s256GB only
Devstral 2 123B75GB0/2418 tok/s128GB+

Devstral writes the best code in isolation but couldn’t drive Claude Code due to a Mistral tool-format incompatibility. The model works, the plumbing doesn’t. Infrastructure matters as much as model quality.

The Nuance

These results come with context.

The harness matters. When I tested the same models through raw API calls (no Claude Code), they scored 13-22 out of 30 on agentic tasks. Through Claude Code, four went 24/24. The difference: Claude Code provides a feedback loop. It reads test failures, retries, iterates. The model doesn’t need to be perfect on the first try. It needs to be good enough to converge.

The tasks were well-defined. Bug fixes with test suites. Implementations with specs. Clear pass/fail criteria. Ask a local model to architect a new system or debug a race condition with no reproduction steps, and you’ll still want Opus.

Speed varies by hardware. Those GPU numbers don’t translate directly to Mac. On an M4 Max 64GB, Gemma 4 runs at about 6 tok/s. Usable but not fast. On the upcoming M5 Ultra 256GB, MiniMax runs at an estimated 45 tok/s. That’s responsive.

Code quality differs. MiniMax added thread safety to a Cache implementation without being asked. Coder-Next wrote the most documented code. Gemma 4 leaked internal reasoning into comments. All passed the tests, but you’d review the code differently.

The Speed Picture

Estimated useful tokens per second on Apple Silicon (bandwidth-extrapolated from GPU measurements):

ModelM4 Max 64GBM4 Max 128GBM5 Ultra 256GB
Coder-Next 80B353578
MiniMax M2.545
Qwen3.5-27B7716
Gemma 4 31B6614
Devstral 123B612

-- = doesn’t fit. These are estimates with 10-20% variance expected on real hardware.

What I Spent

ItemCost
RunPod GPU time (all benchmarks + harness tests)~$30
Anthropic API$0
Hardware purchased$0

Everything in this series came from GPU rentals and scripts. The benchmark code and raw results are on GitHub.

The Rest of This Series

This is the first of six posts. The rest go deeper: