Cloud AI Coding Is Getting Worse. Local Models Can Replace It.
I spent $20 on GPU rentals and tested five local coding models on real tasks. Four of them passed everything. Here's the full picture.
Something is happening to cloud AI coding tools.
On March 23, Claude Code users started reporting that their usage quotas were draining in 90 minutes instead of five hours. Anthropic confirmed they’d been “adjusting” limits during peak hours, cutting session capacity by nearly half between 5am and 11am Pacific. Max subscribers paying $200 a month posted that they could only use Claude 12 out of 30 days. One user watched a single prompt burn their quota from 21% to 100%.
On April 4, Anthropic banned 135,000 OpenClaw instances from using flat-rate subscriptions. Five days later, OpenAI launched a $100/month ChatGPT Pro tier for Codex users, quietly noting they’re “rebalancing usage to support more sessions rather than longer sessions.” Translation: shorter sessions.
The economics make sense. The average Claude Code developer burns about $6/day in API-equivalent compute. Outliers hit $5,000/day. Subscriptions priced for human-paced chat can’t absorb that.
I use Claude Code every day. I’ve noticed the thinking getting shallower. And I’ve been watching this trend long enough to see where it’s going.
So I tested whether local models are ready to pick up the slack. The short answer: yes, for most coding tasks, right now.
What I Found
I tested five open-weight models on RunPod GPUs, running them through Claude Code on eight real coding tasks: bug fixes, TDD, refactoring, feature addition, security patches, debugging, implementing from test specs, and algorithm optimization. Real files, real pytest suites, real pass/fail.
Four of the five models passed every task. All tests green, no human intervention.
| Model | Size | Harness Score | Speed (GPU) | Fits 64GB Mac? |
|---|---|---|---|---|
| Gemma 4 31B | 19GB | 24/24 | 20 tok/s | Yes |
| Qwen3.5-27B | 16GB | 24/24 | 22 tok/s | Yes |
| Coder-Next 80B | 48GB | 24/24 | 114 tok/s | Tight |
| MiniMax M2.5 229B | 138GB | 24/24 | 65 tok/s | 256GB only |
| Devstral 2 123B | 75GB | 0/24 | 18 tok/s | 128GB+ |
Devstral writes the best code in isolation but couldn’t drive Claude Code due to a Mistral tool-format incompatibility. The model works, the plumbing doesn’t. Infrastructure matters as much as model quality.
The Nuance
These results come with context.
The harness matters. When I tested the same models through raw API calls (no Claude Code), they scored 13-22 out of 30 on agentic tasks. Through Claude Code, four went 24/24. The difference: Claude Code provides a feedback loop. It reads test failures, retries, iterates. The model doesn’t need to be perfect on the first try. It needs to be good enough to converge.
The tasks were well-defined. Bug fixes with test suites. Implementations with specs. Clear pass/fail criteria. Ask a local model to architect a new system or debug a race condition with no reproduction steps, and you’ll still want Opus.
Speed varies by hardware. Those GPU numbers don’t translate directly to Mac. On an M4 Max 64GB, Gemma 4 runs at about 6 tok/s. Usable but not fast. On the upcoming M5 Ultra 256GB, MiniMax runs at an estimated 45 tok/s. That’s responsive.
Code quality differs. MiniMax added thread safety to a Cache implementation without being asked. Coder-Next wrote the most documented code. Gemma 4 leaked internal reasoning into comments. All passed the tests, but you’d review the code differently.
The Speed Picture
Estimated useful tokens per second on Apple Silicon (bandwidth-extrapolated from GPU measurements):
| Model | M4 Max 64GB | M4 Max 128GB | M5 Ultra 256GB |
|---|---|---|---|
| Coder-Next 80B | 35 | 35 | 78 |
| MiniMax M2.5 | — | — | 45 |
| Qwen3.5-27B | 7 | 7 | 16 |
| Gemma 4 31B | 6 | 6 | 14 |
| Devstral 123B | — | 6 | 12 |
-- = doesn’t fit. These are estimates with 10-20% variance expected on real hardware.
What I Spent
| Item | Cost |
|---|---|
| RunPod GPU time (all benchmarks + harness tests) | ~$30 |
| Anthropic API | $0 |
| Hardware purchased | $0 |
Everything in this series came from GPU rentals and scripts. The benchmark code and raw results are on GitHub.
The Rest of This Series
This is the first of six posts. The rest go deeper:
- Can Local Models Be Coding Agents? The raw tool-calling benchmark: 10 tasks across 3 difficulty tiers. Where models break without a harness.
- Every Mac That Can Run Local AI Models Hardware buyer’s guide with price/performance tables.
- The Thinking Token Tax Why tok/s benchmarks are lying to you.
- Building a Local Coding Stack Claude Code, Cline, Aider, OpenHands: what works with local models.
- I Pointed Claude Code at Local Models The real test: code quality, architecture decisions, and what each model actually writes.