Two models, two futures: Qwen 3.6 vs Opus 4.7

Two major model releases today: Opus 4.7 and Qwen 3.6.

I run an 18-task agentic coding benchmark that takes 3-5 hours. Real software engineering work - tool use, multi-file edits, complex reasoning. Qwen 3.6 is by far the best self-hosted model I’ve tested. It scores better than state-of-the-art 120B parameter models. I get 230 tok/s on my 5090 with 262k context and VRAM to spare.

It scores similarly to Gemini 2.5 Flash. Not “good for local” - actually competitive with cloud providers. Excellent tool-calling support. Usable for real software engineering work.

This is the first self-hosted model that actually works.

the numbers

Qwen 3.6 on consumer hardware:

No rate limits. No data leaving your premises. No wondering if your code is training someone else’s model. Just performance on hardware you own.

the other release

Opus 4.7: roughly 2-4% higher-scoring than 4.6 on my benchmark. Costs 4.5x as much per task.

The cache_create pricing is 4x higher. It builds more context per turn and uses a ton of thinking tokens. You’re paying for slightly better performance on problems that require extensive algorithmic reasoning - the hardest 10% of tasks where that extra reasoning depth matters. For everything else, you’re lighting money on fire.

the comparison

qwen 3.6 (local)opus 4.7 (cloud)
performance vs previousbeats 120B models2-4% over 4.6
token speed230 tok/srate limited
context window262kvaries by tier
costgpu amortization4.5x previous
data privacycompletenone
tool callingexcellentexcellent

The cost curve is going the wrong direction for cloud. Each marginal improvement costs exponentially more. We’re hitting the wall where “better” means dramatically more expensive for minimal gains.

what this means

These releases represent two different futures.

One: AI becomes increasingly centralized and expensive. You pay more for less improvement, your data flows through someone else’s servers, you work within their rate limits.

The other: AI becomes distributed and sufficient. You run models good enough for real work on hardware from Microcenter. Performance improves with each release while staying on consumer hardware.

Qwen 3.6 is the first local model where I don’t miss the cloud for most engineering tasks. There’s still a capability gap - Opus 4.7 does handle those hardest 10% of problems better. That gap will probably always exist.

But when a consumer GPU matches Gemini 2.5 Flash, the question becomes real: why are we paying 4.5x more for marginal gains? For the first time, running production-quality AI on the GPU you bought for gaming is an actual choice, not a compromise.