Sunday, June 14, 2026

Today’s Edition

AI Intel Report

MARKETS

Enterprise AI

Best Local LLMs for Coding in 2026

We ranked the open-weight coding models you can actually run on your own hardware in 2026 — with real SWE-bench scores, VRAM requirements, licenses and the honest tradeoffs.

13 MIN READ
A developer workstation in a dim home office, a single graphics card glowing inside an open desktop tower beside two monitors, suggesting code running entirely on local hardware.
Illustration: AI Intel Report

Local coding LLMsSelf-hosted AIOpen-weight modelsSWE-benchOn-device inference

The quick verdict

Qwen3-Coder-30B-A3B is the best all-around local coding model in 2026 — Apache-2.0, MoE-efficient on a single 24GB GPU, and built for agentic loops; Qwen3.6-27B leads on raw benchmarks, DeepSeek-Coder-V2-Lite wins on tight VRAM, and Codestral still owns IDE autocomplete.

Best overall
Qwen3-Coder-30B-A3B — Apache-2.0 MoE model that fits a single 24GB GPU, ships a 256K context window, and is purpose-built for agentic coding.
Best value
DeepSeek-Coder-V2-Lite — Runs strong code generation on 10-12GB of VRAM with 128K context and support for 338 programming languages.
Best for IDE inline autocomplete
Codestral — Fill-in-the-middle specialist tuned for low-latency Tab completions inside VS Code and JetBrains.

How we evaluated

We ranked these models against how local coding actually works in 2026 rather than leaderboard worship. We weighted real coding benchmarks (SWE-bench Verified and Pro, LiveCodeBench, fill-in-the-middle quality), the VRAM and hardware a model genuinely needs to run at a useful speed, license terms that determine whether you can ship it commercially, context window, and how cleanly it slots into a local runtime like Ollama and an editor extension like Continue.dev. Every benchmark figure is taken from an official model card or release note and cited inline; vendor-reported scores are labeled as such. One scoping note: this list covers coding-specialized and code-capable open-weight LLMs you self-host as a developer. It deliberately excludes general-purpose offline business assistants — for example, Iternal's AirgapAI is an air-gapped on-device assistant that bundles models like Llama 3.2, Gemma and Qwen for broad workforce use, but its maker positions it as a general business tool, not a coding model, so it is out of scope for a developer coding ranking even though it runs locally. For teams that need a full agentic coding environment around a local stack rather than just a model, <a href="https://iternal.ai/airgapai-code" rel="noopener">AirgapAI Code</a> from Iternal is a dedicated on-prem option built for sensitive and air-gapped codebases.

  • Coding benchmark performance. SWE-bench Verified/Pro, LiveCodeBench and fill-in-the-middle pass rates, taken from official model cards rather than marketing claims.
  • Hardware and VRAM footprint. The GPU and memory a model realistically needs to run at a usable speed locally, including MoE active-parameter efficiency.
  • License and commercial use. Whether the weights permit commercial deployment and shipping in a product (Apache 2.0 / MIT vs. restricted vendor licenses).
  • Context window. How much code the model can hold in a single prompt — important for whole-repo reasoning and long agentic sessions.
  • Local tooling fit. How easily the model runs in Ollama, LM Studio or llama.cpp and integrates with editor assistants such as Continue.dev.
  • Agentic capability. Multi-turn, tool-using performance on real software-engineering tasks, not just single-shot snippet generation.

Rating scale: Ratings are on a 1-5 scale.

Last verified .

At a glance

Best Local LLMs for Coding in 2026: Self-Hosted Models Ranked — quick comparison
# Name Rating Best for Pricing
1 Qwen3-Coder-30B-A3B 4.7 Developers who want one self-hostable, commercially-clean, agentic-capable coding model that fits a single 24GB GPU Free (open weights, Apache 2.0)
2 Qwen3.6-27B 4.6 Developers who want the best benchmarked local coding quality and can dedicate a single large (~24GB) GPU to it Free (open weights, Apache 2.0)
3 DeepSeek-Coder-V2-Lite 4.3 Developers on consumer GPUs (10-12GB VRAM) who want solid everyday code generation without buying new hardware Free (open weights, DeepSeek Model License)
4 Codestral 4.4 Developers who want the fastest, most accurate local inline autocomplete in VS Code or JetBrains on a single 16GB GPU Free open weights (Codestral 2 is Apache 2.0)
5 Devstral Small 4.3 Solo developers and small teams who want offline agentic coding (read issue, edit, test, iterate) on a single RTX 4090 or 32GB Mac Free (open weights, Apache 2.0)
6 DeepSeek-V3.2 4.2 Regulated or privacy-sensitive teams that need cloud-frontier coding quality on infrastructure they fully control and can run a multi-GPU deployment Free (open weights, MIT license)
#1

Qwen3-Coder-30B-A3B

Apache-2.0 agentic coding on a single 24GB GPU

4.7

Editor's pick

Qwen3-Coder-30B-A3B is the model most teams should reach for first when they want a capable coding assistant that runs entirely on their own hardware. It is a mixture-of-experts design with 30.5 billion total parameters but only 3.3 billion activated per token, which is the architectural trick that lets it post strong coding results while fitting comfortably on a single 24GB consumer GPU at a usable speed. Its model card lists a 262,144-token native context window, extensible toward a million tokens with YaRN, so it can hold a large slice of a repository in a single prompt rather than forgetting the file you opened ten edits ago. Qwen tuned it specifically for agentic coding and browser-use loops rather than one-shot snippet generation, which matters in 2026 because the useful work has shifted from autocomplete toward multi-turn tasks: read an issue, plan a fix, edit several files, run tests, iterate. The license is the quiet headline — it ships under Apache 2.0, so you can self-host it, fine-tune it and ship it inside a commercial product with no vendor sign-off. The honest weakness is that the 30B-A3B model card describes its strengths qualitatively rather than publishing the SWE-bench Verified numbers its larger 480B sibling and the dense Qwen3.6-27B do, so you should benchmark it on your own codebase before betting a workflow on it. For the broad middle of developers who want one self-hostable, commercially-clean, agentic-capable coding model, it is the best all-around pick of 2026.

Strengths

  • Apache 2.0 license — self-host, fine-tune and ship in commercial products with no vendor sign-off
  • MoE efficiency (30.5B total / 3.3B active) runs at a useful speed on a single 24GB consumer GPU
  • 262,144-token native context (extensible toward 1M with YaRN) for whole-repo and long agentic sessions

Weaknesses

  • Its model card describes coding strength qualitatively and omits the SWE-bench Verified figures its dense siblings publish, so you must benchmark it on your own code first
Best for
Developers who want one self-hostable, commercially-clean, agentic-capable coding model that fits a single 24GB GPU
Pricing
Free (open weights, Apache 2.0)

Source: Qwen3-Coder-30B-A3B-Instruct model card · Visit Qwen3-Coder-30B-A3B

#2

Qwen3.6-27B

Highest open-weight coding benchmarks on one big GPU

4.6

Qwen3.6-27B is the model to pick when you want the strongest benchmarked coding performance you can run locally and you have a single large GPU to feed it. Unlike the MoE coder above, this is a dense 27-billion-parameter model, which means every parameter is active on every token — more demanding to run, but it posts the highest open-weight coding scores on this list. Its Apache-2.0 model card reports 77.2 on SWE-bench Verified, 53.5 on the harder SWE-bench Pro, 83.9 on LiveCodeBench v6 and 59.3 on Terminal-Bench 2.0, putting it within striking distance of closed frontier models on agentic software-engineering tasks at zero API cost and with full data control. It carries a 262,144-token native context that extends past a million tokens, and it released in April 2026 under Apache 2.0, so commercial use is unrestricted. The catch is hardware: a dense 27B model in a usable quantization wants roughly 18-24GB of VRAM and runs slower per token than the sparse MoE models, so on a modest laptop GPU you will feel the throughput cost. It is also a general-purpose model with strong coding rather than a fill-in-the-middle autocomplete specialist, so for pure inline Tab completion a smaller FIM model is snappier. If your priority is the best benchmarked local coding quality and you can supply the GPU, Qwen3.6-27B is the performance leader of 2026's open-weight field.

Strengths

  • Highest benchmarked open-weight coding scores here: 77.2 SWE-bench Verified, 83.9 LiveCodeBench v6, 53.5 SWE-bench Pro (per its model card)
  • Apache 2.0 license with a 262,144-token native context extensible past 1M tokens
  • Dense architecture gives consistent quality across diverse, non-routine coding tasks

Weaknesses

  • Dense 27B is hardware-hungry — wants roughly 18-24GB VRAM and runs slower per token than the sparse MoE coders, so it is heavy for laptop-class GPUs
Best for
Developers who want the best benchmarked local coding quality and can dedicate a single large (~24GB) GPU to it
Pricing
Free (open weights, Apache 2.0)

Source: Qwen3.6-27B model card · Visit Qwen3.6-27B

#3

DeepSeek-Coder-V2-Lite

Strong code generation on 10-12GB of VRAM

4.3

Best value

DeepSeek-Coder-V2-Lite is the answer for developers whose constraint is the graphics card they already own. It is a mixture-of-experts model with 16 billion total parameters but only 2.4 billion active per token, and that sparsity is what lets it deliver genuinely useful code generation on a consumer GPU with roughly 10-12GB of VRAM, where the dense 27B models simply will not fit comfortably. Its model card documents a 128K context window and notes that it was further pre-trained from a DeepSeek-V2 checkpoint on six trillion tokens, expanding language coverage from 86 to 338 programming languages — broad enough that it rarely stumbles on whatever stack you throw at it. It remains a well-rounded chat-and-generate coding model: ask it to write a function, explain a stack trace or scaffold a module and it holds its own against much larger models for everyday work. The honest tradeoffs are real. It is a 2024-era release, so it trails the 2026 Qwen and DeepSeek-V3.2 generations on the hardest agentic SWE-bench tasks, and its license is the DeepSeek Model License rather than a pure Apache/MIT grant, so read the terms before shipping it commercially even though they do permit commercial use. For a developer who wants legitimate local coding help without buying a new GPU, the Lite model is the best-value entry point in 2026.

Strengths

  • MoE design (16B total / 2.4B active) runs useful code generation on just 10-12GB of VRAM
  • Broad coverage of 338 programming languages with a 128K context window
  • Mature, well-documented and widely available in Ollama and LM Studio for an easy local setup

Weaknesses

  • A 2024-generation model that trails the 2026 Qwen and DeepSeek-V3.2 releases on the hardest agentic SWE-bench tasks, and its DeepSeek Model License is more restrictive than Apache 2.0
Best for
Developers on consumer GPUs (10-12GB VRAM) who want solid everyday code generation without buying new hardware
Pricing
Free (open weights, DeepSeek Model License)

Source: DeepSeek-Coder-V2-Lite-Instruct model card · Visit DeepSeek-Coder-V2-Lite

#4

Codestral

The fill-in-the-middle specialist for IDE autocomplete

4.4

Codestral is Mistral AI's dedicated coding model, and it earns its place by being the best at the one thing developers feel most often: inline autocomplete. It is a 22-billion-parameter model trained on more than 80 programming languages and optimized for fill-in-the-middle (FIM) completion, meaning it reasons about the code both before and after your cursor rather than only what precedes it — which is why its completions slot naturally into existing functions. Mistral's Codestral 25.01 release added a more efficient architecture, a better tokenizer, a 256K context window and roughly 2x faster generation, positioning it as the strongest FIM model in its weight class; the 25.08 update reported further gains, including about a 30% increase in accepted completions and 50% fewer runaway generations. It is built for low-latency, high-frequency use, which is exactly the profile you want behind a Tab key. The important nuance is licensing: the original Codestral shipped under the restrictive Mistral Non-Production License, blocking commercial in-product use without a separate agreement — but Mistral relicensed the open-weight Codestral 2 (April 2026) under Apache 2.0, removing that barrier for shipping it in commercial IDEs and tooling. Verify which checkpoint and license you are pulling, since older weights still carry the restriction. For fast, accurate inline completion on a single 16GB GPU, Codestral is the autocomplete pick of 2026.

Strengths

  • Best-in-class fill-in-the-middle completion tuned for low-latency inline autocomplete across 80+ languages
  • 256K context with a 2x speed improvement (25.01) and ~30% more accepted completions plus 50% fewer runaway generations (25.08)
  • Open-weight Codestral 2 relicensed to Apache 2.0 (April 2026), unlocking commercial in-product use; deployable on-prem/VPC

Weaknesses

  • Licensing is a trap: older Codestral weights carry the restrictive Mistral Non-Production License, so you must confirm you are pulling the Apache-2.0 Codestral 2 before shipping commercially
Best for
Developers who want the fastest, most accurate local inline autocomplete in VS Code or JetBrains on a single 16GB GPU
Pricing
Free open weights (Codestral 2 is Apache 2.0)

Source: Mistral AI — Codestral 25.01 · Visit Codestral

#5

Devstral Small

Best agentic coding model that fits one card

4.3

Devstral Small is Mistral's agent-focused coding model, purpose-built for the kind of multi-step software-engineering work that has come to define AI coding in 2026: not generating an isolated snippet, but operating inside a codebase to resolve a real issue across multiple files. Finetuned from Mistral-Small-3.1, it is a 24-billion-parameter dense model released under Apache 2.0, and its model card reports 53.6% on SWE-bench Verified using the OpenHands scaffold — a strong agentic result for a model this size and notably the kind of score that used to require a cloud frontier model. The headline practicality is that it runs on a single RTX 4090 or a Mac with 32GB of RAM, with a 128K context window, so a solo developer or a small team can stand up a genuinely agentic local coding setup without a server rack. Pairing it with an open-source agent harness lets it read issues, edit files and run tests in a loop, entirely offline. The honest weaknesses: as an agentic model it shines inside a scaffold like OpenHands but is less suited to raw inline autocomplete than Codestral, and its 53.6% SWE-bench Verified, while excellent for the size, still trails the largest dense and MoE models on the hardest tasks. For local, offline agentic coding on hardware you can actually afford, Devstral Small is the standout of 2026.

Strengths

  • Agentic SWE-bench Verified of 53.6% (OpenHands scaffold) — strong real-issue-resolution for its size, per its model card
  • Apache 2.0 license and runs on a single RTX 4090 or a 32GB Mac with a 128K context window
  • Built for multi-file, tool-using agent loops you can run completely offline

Weaknesses

  • Tuned for agent scaffolds rather than inline autocomplete, and its 53.6% SWE-bench Verified still trails the largest dense and MoE models on the hardest tasks
Best for
Solo developers and small teams who want offline agentic coding (read issue, edit, test, iterate) on a single RTX 4090 or 32GB Mac
Pricing
Free (open weights, Apache 2.0)

Source: Devstral Small model card · Visit Devstral Small

#6

DeepSeek-V3.2

Frontier-quality coding for teams with serious hardware

4.2

DeepSeek-V3.2 is the model for the team that wants frontier-class coding quality without sending code to a cloud API and is willing to provision real infrastructure to get it. It is a 685-billion-parameter mixture-of-experts model released under a permissive MIT license, which is unusually generous for a model at this capability tier — you can deploy it on internal infrastructure, fine-tune it on proprietary code and distill it without royalties or revenue share. On coding, its reported numbers are strong: SWE-bench Verified at 70 puts it among the top open-weight models on real multi-file GitHub-issue resolution, a touch behind the 77.2 leaders but with far more raw capacity and long-context headroom, and its architecture introduces DeepSeek Sparse Attention to keep long-context inference closer to linear cost. There is also a deep-reasoning DeepSeek-V3.2-Speciale variant, though note that variant is built for reasoning and does not support tool-calling, which matters for agentic coding workflows. The unavoidable weakness is scale: at 685B total parameters this is not a single-consumer-GPU model — it needs a multi-GPU server or a serious workstation to self-host, putting it out of reach for the laptop-class setups the rest of this list targets. If you are a regulated or privacy-sensitive organization that needs cloud-frontier coding quality on infrastructure you control, DeepSeek-V3.2 is the open-weight ceiling in 2026; for everyone else, the smaller models above are the practical local choice.

Strengths

  • Permissive MIT license at frontier scale — commercial deployment, fine-tuning and distillation with no royalties
  • SWE-bench Verified of 70 puts it among the top open-weight models on real multi-file issue resolution (per its model card)
  • DeepSeek Sparse Attention keeps long-context inference close to linear cost for big codebases

Weaknesses

  • At 685B total parameters it needs a multi-GPU server or serious workstation to self-host — far beyond the single-consumer-GPU setups the rest of this list targets
Best for
Regulated or privacy-sensitive teams that need cloud-frontier coding quality on infrastructure they fully control and can run a multi-GPU deployment
Pricing
Free (open weights, MIT license)

Source: DeepSeek-V3.2 model card · Visit DeepSeek-V3.2

Feature comparison

Licensing and deployment
Feature Qwen3-Coder-30B-A3BQwen3.6-27BDeepSeek-Coder-V2-LiteCodestralDevstral SmallDeepSeek-V3.2
Apache 2.0 / MIT license PartialPartial
Fits a single 24GB GPU
Coding capability
Feature Qwen3-Coder-30B-A3BQwen3.6-27BDeepSeek-Coder-V2-LiteCodestralDevstral SmallDeepSeek-V3.2
Fill-in-the-middle (autocomplete) PartialPartialPartialPartial
Agentic / tool-use tuned PartialPartial
Context
Feature Qwen3-Coder-30B-A3BQwen3.6-27BDeepSeek-Coder-V2-LiteCodestralDevstral SmallDeepSeek-V3.2
Context window 128K+

Which should you choose?

Solo developer on a gaming PC · Independent / freelance

Goal:Add a private AI coding assistant without a cloud subscription

Qwen3-Coder-30B-A3B — MoE efficiency runs an agentic, commercially-licensed coding model at a useful speed on a single 24GB consumer GPU.

Developer on a modest laptop GPU · Small startup

Goal:Useful local code generation on 10-12GB of VRAM

DeepSeek-Coder-V2-Lite — A 16B/2.4B-active MoE model that delivers real code help on consumer hardware with broad language coverage.

Frontend engineer who lives in the editor · Product team

Goal:Fast, accurate inline Tab completion that never leaves the machine

Codestral — Fill-in-the-middle specialist tuned for low-latency completion; confirm the Apache-2.0 Codestral 2 weights for commercial use.

Platform engineer in a regulated industry · Healthcare, finance or government org

Goal:Cloud-frontier coding quality on infrastructure they fully control

DeepSeek-V3.2 — MIT-licensed 685B model with 70 SWE-bench Verified for teams that can run a multi-GPU self-hosted deployment.

Frequently asked

What is the best local LLM for coding in 2026?

For most developers, Qwen3-Coder-30B-A3B is the best all-around local coding model in 2026. It is a mixture-of-experts model with 30.5 billion total but only 3.3 billion active parameters, so it runs at a useful speed on a single 24GB consumer GPU, ships a 262,144-token context window, and is tuned for agentic coding loops. Crucially it is Apache 2.0, so you can self-host and ship it commercially. There is no universal winner, though: if you want the highest benchmarked quality and have a big GPU, the dense Qwen3.6-27B scores 77.2 on SWE-bench Verified; if your VRAM is tight, DeepSeek-Coder-V2-Lite runs on 10-12GB; and for inline autocomplete, Codestral is the specialist. Match the model to your hardware, license needs and whether you want chat, autocomplete or agentic workflows.

Can a self-hosted LLM really match cloud coding tools like GitHub Copilot?

For everyday coding, local models are now surprisingly close. Open-weight models like Qwen3.6-27B report 77.2 on SWE-bench Verified, within a few points of closed frontier models, and the MIT-licensed 685B DeepSeek-V3.2 lands at 70 on the same benchmark. For completions, boilerplate, refactoring and answering quick questions, a self-hosted setup with a strong model is competitive with a cloud assistant — and your source code never leaves your machine. The honest gap is on the hardest, most novel agentic tasks, where the largest cloud models still lead. But for privacy-sensitive teams, the tradeoff is often worth it: no recurring per-seat fees, no proprietary code routed through a third party, and full control over the model and data. Many teams run a local model for completions and reserve a cloud model only for occasional hard problems.

How much VRAM do I need to run a local coding model?

It depends on the model and quantization. The lightest practical option, DeepSeek-Coder-V2-Lite, runs on roughly 10-12GB of VRAM thanks to its mixture-of-experts design that activates only 2.4 billion parameters per token. Codestral fits comfortably on about 16GB. The mixture-of-experts Qwen3-Coder-30B-A3B and Mistral's Devstral Small (which runs on a single RTX 4090 or a 32GB Mac) target the 24GB tier, and a dense 27B model like Qwen3.6-27B wants roughly 18-24GB in a usable quantization. Frontier-scale models such as the 671B DeepSeek-V3.2 require a multi-GPU server. A good rule of thumb in 2026: a single 24GB consumer card covers the vast majority of local coding use cases, while 8-12GB still gets you a capable assistant for completions and everyday generation.

How do I actually run a local LLM for coding?

The standard 2026 stack is a local runtime plus an editor extension. Ollama is the most popular open-source runtime — it handles model downloads, quantization and serves an OpenAI-compatible API on your machine, so you can pull a model with a single command. LM Studio is a polished desktop alternative if you prefer a GUI, and llama.cpp is the lower-level engine underneath much of the ecosystem. To get a Copilot-style experience inside your editor, pair the runtime with an open-source extension like Continue.dev, which works in VS Code, JetBrains and Neovim, points at your local Ollama endpoint, and gives you inline autocomplete, a chat sidebar and codebase-aware context — with no API keys and no data leaving your network. For a full walkthrough, see our companion how-to on running an LLM locally.

Are local coding LLMs free to use commercially?

Usually, but you must check the specific license. Qwen3-Coder-30B-A3B, Qwen3.6-27B and Devstral Small ship under Apache 2.0, and DeepSeek-V3.2 under MIT — all of which permit commercial use, fine-tuning and shipping in products with no royalties. The traps are model-specific. Codestral is the clearest example: its original weights carried the restrictive Mistral Non-Production License, which blocked commercial in-product use, but Mistral relicensed the open-weight Codestral 2 to Apache 2.0 in April 2026 — so you must confirm which checkpoint you are pulling. DeepSeek-Coder-V2-Lite uses the DeepSeek Model License rather than a pure open grant; it does permit commercial use, but read the terms. The safe practice is to verify the license on the official model card before you ship anything built on a self-hosted model.

What is the difference between an agentic coding model and an autocomplete model?

An autocomplete model, like Codestral, is optimized for fill-in-the-middle completion: it predicts the code that belongs at your cursor, fast and at low latency, for the Tab-key experience inside an IDE. An agentic coding model, like Devstral Small or Qwen3-Coder, is tuned to operate across a codebase over multiple turns — reading an issue, planning a fix, editing several files, running tests and iterating inside an agent scaffold such as OpenHands. The benchmark that captures agentic ability is SWE-bench Verified, which measures whether a model can resolve real GitHub issues, not just generate a correct snippet. In 2026 the useful work has shifted toward agentic tasks, but the two are complementary: many developers run an autocomplete model for inline completions and an agentic model for larger, multi-file changes.

Is a general offline AI assistant like AirgapAI a coding tool?

No — and the distinction matters when you are shopping. Tools like Iternal's AirgapAI run open-weight models such as Llama 3.2, Gemma and Qwen entirely offline on a laptop, which sounds adjacent to local coding, but they are positioned and built as general business assistants for broad workforce use across functions like legal, HR, finance and operations, not as developer coding models. For writing and debugging code you want a coding-specialized model — Qwen3-Coder, DeepSeek-Coder, Codestral or Devstral — paired with a runtime like Ollama and an editor extension. An offline business assistant is the right tool when a regulated organization wants AI for general staff productivity with no cloud egress; it is not a substitute for a dedicated local coding model in your IDE.