Frontier Models

DeepSeek Releases DeepSpec Toolkit and Draft Models on Hugging Face

The open-source release equips Qwen3 and Gemma 4 users with pre-trained draft models and training scripts that implement speculative decoding to accelerate inference serving while preserving output quality.

By Diane Okafor June 29, 2026 6 MIN READ

Inside a brightly lit expansive data center facility filled with neatly aligned rows of tall black server racks equipped with visible arrays of high performance graphics processing units and interconnected networking cables the scene captures an anonymous technician wearing a plain gray polo shirt and dark trousers standing with their back turned toward the camera while holding a generic black clipboard and examining the front panel of one open rack unit containing multiple parallel processing boards the environment features polished concrete flooring reflective metal surfaces overhead fluorescent lighting casting even illumination across the space numerous cooling fans spinning inside the racks bundles of thick power cables running along the floor and walls organized cable management trays mounted on the sides of each rack unit and distant figures of additional anonymous personnel in similar attire working at other server rows without any visible screens interfaces or markings the composition emphasizes the hardware infrastructure used for deploying advanced artificial intelligence models including those developed for speculative decoding acceleration techniques applied to specific language model architectures such as large scale transformer based systems the technician figure is positioned slightly off center with one hand resting on the rack door another holding the clipboard the surrounding area shows additional open racks revealing internal components like heat sinks memory modules and interconnects all arranged in a professional operational setting that represents the practical implementation of open source toolkits for improving inference speeds on models associated with organizations focused on frontier artificial intelligence research the overall atmosphere conveys a sense of methodical technical work in a controlled high security computing environment dedicated to hosting and serving machine learning assets from public repositories the depth of field keeps the foreground technician and immediate server hardware in sharp focus while softly blurring the background rows of identical equipment to highlight the scale of the installation without introducing any elements that distract from the core subject of AI model deployment infrastructure — Illustration: AI Intel Report

DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding.

DeepSeek has uploaded a series of draft models to Hugging Face as part of the DeepSpec collection, targeting users of the Qwen3 and Gemma 4 model families. These models are specifically engineered to support speculative decoding, a technique that pairs a smaller draft model with a larger target model to generate tokens more efficiently. The small model proposes sequences, and the target verifies them in one pass, accepting correct predictions and thereby increasing throughput without loss of accuracy. This approach is particularly valuable for enterprise deployments where serving latency directly impacts user experience and operational costs.

What background information explains the need for DeepSpec?

The development of DeepSpec stems from ongoing research into accelerating the inference process for large language models. Traditional autoregressive generation requires sequential token prediction, which can be slow for long outputs. Speculative decoding mitigates this by leveraging the fact that many tokens can be correctly predicted by smaller models. DeepSeek has built upon prior work in this area, creating variants like DSpark, DFlash, and Eagle3 that demonstrate measurable improvements in accepted token lengths. The GitHub repository was updated on June 28, 2026, indicating recent refinements to the full-stack solution that includes data preparation pipelines, training configurations, and evaluation metrics tailored to the Qwen3 and Gemma 4 targets.

Enterprise AI teams have long sought methods to reduce the computational overhead associated with running large models in production. With the proliferation of open-source models on platforms like Hugging Face, the ability to optimize inference has become a competitive differentiator. DeepSpec provides the tools necessary to train custom draft models or use the provided checkpoints, lowering the entry barrier for organizations looking to enhance their serving infrastructure. The MIT license ensures that the codebase can be freely used and modified, fostering community contributions and further innovation in the field.

What specific models and checkpoints are included in the release?

The collection on Hugging Face features several draft model checkpoints, including dspark_qwen3_4b_block7 for the 4 billion parameter Qwen3 model, along with dflash and eagle3 variants at the same size. Similar models exist for the 8B and 14B Qwen3 sizes, as well as for the 12B Gemma 4 model. Each checkpoint is optimized for a particular block or layer configuration, such as block7 or ttt7, to maximize compatibility with the target model's architecture. These ready-to-use models eliminate the need for extensive training on the part of the end user, enabling immediate experimentation with speculative decoding setups.

The naming convention for these models reflects their intended use case, with prefixes indicating the algorithm variant and suffixes denoting the target model and configuration details. Users can download these models directly from the collection page, which was updated recently to include the latest checkpoints. This accessibility supports rapid prototyping and integration into existing inference pipelines that utilize the Hugging Face ecosystem.

How does the speculative decoding process function in practice?

Speculative decoding operates by having the draft model generate a sequence of candidate tokens in an autoregressive manner, which is computationally inexpensive due to its smaller size. The target model then processes these candidates in parallel during a single forward pass, verifying each token against the actual output it would have produced. Tokens that match are accepted, and the process continues from the last accepted token, while mismatches trigger a rejection and a restart from the correct position. This method can lead to multiple tokens being generated per target model call, resulting in higher tokens per second rates.

The effectiveness of this technique depends on the quality of the draft model and its alignment with the target model's distribution. DeepSpec addresses this through specialized training procedures that align the draft models closely with their targets. The provided evaluation scripts allow users to measure acceptance rates and adjust parameters accordingly. In benchmarks, the DSpark variant has shown superior performance in terms of accepted token length compared to earlier methods.

What are the technical components of the DeepSpec codebase?

The DeepSpec repository contains comprehensive scripts for every stage of the draft model lifecycle. Data preparation scripts process datasets to create training examples suitable for speculative decoding tasks. Training configurations are provided for the DSpark, DFlash, and Eagle3 algorithms, with hyperparameters tuned for the supported model families. Evaluation tools compute metrics such as acceptance rate and speedup factors, enabling rigorous assessment of model performance. The entire stack is documented in the README, guiding users through the workflow from initial setup to deployment.

Integration with existing frameworks is facilitated by the use of standard formats for model checkpoints, ensuring compatibility with popular inference engines. The MIT license permits commercial use and modification, which is critical for enterprise adoption. Developers can extend the codebase to support new model architectures or refine the algorithms for specific use cases.

Clone the DeepSpec repository from GitHub and install dependencies.
Select a target model such as Qwen3 8B and corresponding draft model size.
Run data preparation scripts on relevant corpora to generate training data.
Execute the training script with the chosen algorithm configuration.
Evaluate the trained draft model using the provided metrics scripts.
Deploy the pair of draft and target models in a serving framework for inference.

What are the implications for market stakeholders and enterprise users?

The open-sourcing of DeepSpec has broad implications for the AI industry, particularly for companies relying on Qwen3 and Gemma 4 models in their applications. By reducing inference times, organizations can handle higher query volumes with the same hardware, leading to cost savings and improved scalability. This is especially relevant in sectors such as customer service, content generation, and real-time analytics where response speed is paramount. The availability of pre-trained models means that smaller teams without dedicated research resources can still benefit from advanced techniques.

For model providers and platform operators, this release may accelerate the adoption of speculative decoding as a standard practice. It also positions DeepSeek as a contributor to the open-source ecosystem, potentially attracting talent and partnerships. Stakeholders in the supply chain for AI infrastructure, including cloud providers and hardware vendors, may see increased demand for optimized serving solutions that incorporate these methods.

How have experts and the community responded to this development?

Initial reactions from the AI community have highlighted the practical value of having ready draft models available alongside the training toolkit. This combination allows both researchers and practitioners to quickly test and deploy the technology. The detailed documentation and supported algorithms lower the learning curve for those new to speculative decoding.

DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding.DeepSeek, via GitHub README

What can be expected in the future for this technology?

Looking ahead, the open-source nature of DeepSpec is likely to spur further research and improvements in draft model design. Community contributions could lead to better alignment techniques, support for additional model families, and integration with emerging inference optimizations. As more users adopt these tools, feedback will drive refinements to the codebase and potentially new variants that achieve even higher efficiency gains.

The release also sets a precedent for other organizations to share their internal tools for inference acceleration, contributing to a more collaborative environment in AI development. This could ultimately lead to faster progress in making large models more accessible and practical for a wider array of applications.

Draft Model Variants and Their Targets
Variant Name	Target Model	Size	Algorithm Type
dspark_qwen3_4b_block7	Qwen3	4B	DSpark
dflash_qwen3_4b_block7	Qwen3	4B	DFlash
eagle3_qwen3_4b_ttt7	Qwen3	4B	Eagle3
dspark_gemma4_12b	Gemma 4	12B	DSpark

Frequently asked

What is speculative decoding?

Speculative decoding uses a small draft model to propose tokens that a larger target model verifies in batches, allowing more tokens to be accepted per inference step while preserving output quality.

How can users access the DeepSpec models?

The draft models are available on Hugging Face under the DeepSpec collection, and the full toolkit is on GitHub with MIT license for training and evaluation.