Sunday, June 14, 2026

Today’s Edition

AI Intel Report

MARKETS

Enterprise AI

Private LLM & Self-Hosted AI: Running ChatGPT-Class Models On Your Own Infrastructure (2026)

A private LLM runs inside infrastructure you control so prompts and documents never reach a third party. Here is what that means in 2026, which open-weight models and runtimes to use, and when self-hosting actually pays off.

10 MIN READ
A single GPU server rack glowing in a locked on-premises data closet, cables neatly bundled, a closed steel door in the foreground suggesting compute kept inside the building.
Illustration: AI Intel Report
In short

A private LLM is a large language model run inside infrastructure you control — your own GPUs, a single-tenant private cloud, or an air-gapped network — so prompts, documents, and outputs never reach a third party. Self-hosting it means you own the deployment, the data boundary, and the cost curve.

Three years of public chatbots taught enterprises a hard lesson: every prompt and document sent to a hosted model flows through someone else's servers. For a marketing draft that is fine. For source code, patient records, deal data, or classified material, it can be a leak or a compliance violation. The architectural answer is a private, self-hosted LLM — the same class of model you would reach through a public API, but running on hardware you control. In 2026 that is no longer an exotic choice. Open-weight models have caught up, the runtimes to serve them are mature, and the only real question left is whether the economics work for your volume.

What is a private, self-hosted LLM?

A private LLM is any deployment where the model weights and the inference run inside infrastructure your organization controls, rather than a shared multi-tenant service reached over the public internet. Self-hosting is the mechanism: you download open-weight model files — the actual parameters, published by labs such as Meta, Alibaba, or DeepSeek — and run them on your own GPUs or servers. The opposite is the public, API-delivered model, where you send a request and the provider's model on the provider's infrastructure returns a response. Privacy here is not a setting you toggle; it is a property of the architecture, decided by where the weights run and where the data goes. This is why self-hosting and private deployment are usually the same conversation.

Demand is real, not theoretical. Enterprise adoption of large language models climbed from under 5% in 2023 to over 80% by 2026, and open-weight models such as Llama now feature prominently in those rollouts, frequently deployed alongside proprietary ones. Security and data-privacy compliance is the single most-cited factor in choosing how to deploy — ranked the top consideration by 31% of enterprises in that same analysis — which is precisely the constraint that pushes regulated organizations toward keeping the model in-house.

How private is private? The deployment spectrum

"Private" is a spectrum of increasing isolation, with control rising and convenience falling at each step. Where you land depends on your data-residency, offline, and audit requirements.

The private LLM deployment spectrum, from local workstation to fully air-gapped
DeploymentWhat it meansControl level
Local / on-deviceA model running on one workstation, laptop, or developer GPUHigh (single user)
Private / sovereign cloudA single-tenant or region-locked cloud the provider isolates for youModerate
On-premisesModels on servers in your own data center, behind your firewallHigh
Air-gappedAn isolated network with no internet egress at all — nothing can leaveMaximum

Each step up removes a surface through which data could escape. A private cloud keeps a vendor in the loop but contractually isolates your workload; on-premises removes the public cloud; an air-gapped deployment removes the network itself, which is why it is the standard for classified, defense, and the most sensitive regulated environments. This page is a spoke in our wider on-premise AI coverage — the pillar maps the full stack, while this guide focuses on the model layer.

Which models can you actually self-host in 2026?

The decisive shift of the last two years is that open-weight models now rival, and on several public benchmarks beat, proprietary frontier systems for everyday enterprise tasks. The catch for a private deployment is that license and hardware footprint matter as much as raw benchmark scores. The table below summarizes the leading permissively licensed options and what it takes to run them, drawn from published model cards and 2026 deployment analyses.

Leading self-hostable open-weight models in 2026: license, size, and minimum hardware
ModelTotal / active paramsLicenseContextMin hardware (quantized)
Qwen3-32B32.8B denseApache 2.0~131K~33 GB VRAM — 1× H100
Llama 4 Scout109B / 17B (MoE)Llama 4 Communityup to 10M~55–65 GB — 1× H100
Llama 4 Maverick400B / 17B (MoE)Llama 4 Community1M~200–243 GB — 4× H100
Qwen3-235B-A22B235B / 22B (MoE)Apache 2.0128K~235 GB — 8× H100
DeepSeek V3.2~685B / 37B (MoE)MIT128K~640 GB — 8× H100

Figures above are sourced from a 2026 cross-model cost-and-hardware comparison and corroborated by community model roundups. The license column is the part teams most often skip and most regret: Apache 2.0 (Qwen3) and MIT (DeepSeek) impose the fewest restrictions, while Meta's Llama 4 license adds conditions for very large applications. For a regulated buyer, a permissive license is a feature.

What runtime serves a private LLM?

Downloading weights is the easy part; serving them to real users is where deployments succeed or fail. The runtime layer has cleanly split by workload. Ollama and LM Studio are experience layers — the fastest way for one developer to pull a model and start chatting, with an OpenAI-compatible local endpoint. llama.cpp is the low-level engine for edge, embedded, and unusual hardware. vLLM is the production serving system: its PagedAttention and continuous batching are built for many concurrent users on data-center GPUs. The difference at scale is large — a widely cited 2026 benchmark found vLLM reaching roughly 793 tokens per second at peak concurrency versus about 41 for Ollama on the same hardware. The lesson: prototype with Ollama, serve a team with vLLM.

When does self-hosting beat a public API?

Self-hosting swaps a per-token meter for fixed capacity, so the economics turn entirely on volume and utilization. A public API is cheap to start and scales linearly; a private deployment carries upfront hardware or reserved-capacity cost but no per-request charge. The crossover where self-hosting becomes cheaper sits, by 2026 estimates, in the range of tens of millions of tokens per day, depending on model size and how fully you keep the GPUs busy. For context, rented H100-class capacity ran around $2.50 per hour on-demand in mid-2026, so an idle GPU is pure waste. Below the crossover, bursty and experimental workloads favor the API. Above it — and for any always-on, high-volume, or strictly regulated workload — a well-utilized private deployment wins on both cost and control. Model your own read and write patterns before committing; the most expensive private LLM is the one running at 10% utilization.

The honest tradeoffs in 2026

Private and self-hosted LLMs are not free wins. You inherit the operational burden the cloud provider used to carry: capacity planning, model updates, quantization choices, security hardening, and the on-call rotation when inference falls over. You also inherit the hardest part of any AI system — the data layer. A self-hosted model is only as good as the documents you retrieve over, and poorly governed source data silently destroys retrieval accuracy regardless of which model you chose. Frameworks like the NIST AI Risk Management Framework are far easier to satisfy when the system lives inside your own boundary, and data-residency rules such as the EU's GDPR often make a private deployment the only compliant option. But the discipline that makes a private LLM worth it — clean, governed, well-retrieved data — is exactly the discipline most teams underestimate. Get that right and a private, self-hosted model is competitive with anything you could rent; get it wrong and no amount of GPU fixes it.

Frequently asked

What is a private LLM in simple terms?

A private LLM is a large language model deployed inside infrastructure your organization controls, so that prompts, documents, and model outputs never leave your trust boundary. Instead of sending a request to a shared public endpoint like a hosted ChatGPT API, you download open-weight model files and run inference on your own GPUs, in a single-tenant private cloud, or on a fully air-gapped network. The defining test is control: where the weights physically run, where the data goes, and whether any third party could see either. If only your organization can, it is a private LLM. The trade is that you take on hosting, securing, updating, and scaling the system yourself, rather than renting that work from a provider.

Is a self-hosted LLM the same as a private LLM?

They overlap heavily but are not identical. Self-hosting describes how you run the model — you operate it on your own hardware or rented servers rather than calling a managed API. Private describes the goal: keeping data and inference under your control. Most self-hosted deployments are private, but you can also achieve a degree of privacy in a single-tenant or sovereign cloud you do not physically own, and you can technically self-host a model on a box that is poorly isolated and therefore not very private. In practice the terms are used interchangeably for the same architecture: open-weight model, your infrastructure, your data boundary. The strongest form is a self-hosted model on an air-gapped network with no internet egress at all.

Which open-weight models can I self-host in 2026?

The open-weight field has caught up to proprietary frontier systems on most enterprise tasks. The permissively licensed choices in 2026 include Alibaba's Qwen3 family (Apache 2.0), DeepSeek's V3.2 (MIT), and Meta's Llama 4 Scout and Maverick (the Llama 4 Community License, which adds conditions above 700 million monthly active users). Smaller models such as Gemma and Phi-class systems run on a single consumer GPU, while the largest mixture-of-experts models need multiple data-center GPUs. For a regulated deployment, license terms matter as much as benchmarks: Apache 2.0 and MIT impose the fewest restrictions, which is why they dominate commercial on-prem stacks. Always read the actual license before shipping a product on top of a model.

How much hardware do I need to run a private LLM?

It depends entirely on model size and quantization. A 32-billion-parameter dense model like Qwen3-32B needs roughly 33 GB of VRAM quantized — a single H100-class GPU. Mid-size mixture-of-experts models such as Llama 4 Scout fit on one 80 GB GPU at INT4, while the largest open models like DeepSeek V3.2 or Qwen3-235B require around eight H100s. On the small end, models in the 14-billion-parameter range run on a 16 GB consumer card, and quantized sub-30B models run on Apple Silicon or a single workstation GPU. For a team, the practical question is not whether you can run a model but whether you can serve it to many concurrent users, which is where a production runtime and more GPUs matter.

When is self-hosting an LLM cheaper than a cloud API?

Self-hosting trades a per-token meter for fixed capacity, so the math turns on volume. Public APIs are cheap to start and scale linearly with usage; private deployments carry upfront hardware or reserved-capacity cost but no per-request charge. Analyses in 2026 place the crossover roughly in the range of tens of millions of tokens per day, depending on model size, GPU utilization, and your input-output ratio. Below that, bursty or experimental workloads usually favor a public API. Above it — and especially for always-on, high-volume, or strictly regulated workloads — a well-utilized private deployment can cost far less per response. The mistake is buying GPUs that sit idle; underused private hardware is more expensive than the API it replaced.

Are private LLMs as capable as ChatGPT or other public models?

For most enterprise workloads, yes. The capability gap between the best open-weight models and proprietary frontier systems has narrowed sharply, and open models now lead several public coding, math, and long-context benchmarks. For everyday business tasks — summarization, retrieval-augmented question answering, drafting, classification, and extraction — a well-deployed open model over clean, governed data performs competitively. The very largest proprietary systems may still lead on the hardest frontier reasoning tasks, but that edge rarely decides a typical deployment. In practice the limiter is not the model: it is the quality of your retrieval data and how well your infrastructure is tuned. Garbage data produces garbage answers no matter how strong the underlying model is.