# DeepSeek Releases DeepSpec Toolkit and Draft Models on Hugging Face

> The open-source release equips Qwen3 and Gemma 4 users with pre-trained draft models and training scripts that implement speculative decoding to accelerate inference serving while preserving output quality.

*Published 2026-06-29 · By Diane Okafor*

DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding.

DeepSeek has uploaded a series of draft models to Hugging Face as part of the DeepSpec collection, targeting users of the Qwen3 and Gemma 4 model families. These models are specifically engineered to support speculative decoding, a technique that pairs a smaller draft model with a larger target model to generate tokens more efficiently. The small model proposes sequences, and the target verifies them in one pass, accepting correct predictions and thereby increasing throughput without loss of accuracy. This approach is particularly valuable for enterprise deployments where serving latency directly impacts user experience and operational costs.

## What background information explains the need for DeepSpec?

The development of DeepSpec stems from ongoing research into accelerating the inference process for large language models. Traditional autoregressive generation requires sequential token prediction, which can be slow for long outputs. Speculative decoding mitigates this by leveraging the fact that many tokens can be correctly predicted by smaller models. DeepSeek has built upon prior work in this area, creating variants like DSpark, DFlash, and Eagle3 that demonstrate measurable improvements in accepted token lengths. The GitHub repository was updated on June 28, 2026, indicating recent refinements to the full-stack solution that includes data preparation pipelines, training configurations, and evaluation metrics tailored to the Qwen3 and Gemma 4 targets.

Enterprise AI teams have long sought methods to reduce the computational overhead associated with running large models in production. With the proliferation of open-source models on platforms like Hugging Face, the ability to optimize inference has become a competitive differentiator. DeepSpec provides the tools necessary to train custom draft models or use the provided checkpoints, lowering the entry barrier for organizations looking to enhance their serving infrastructure. The MIT license ensures that the codebase can be freely used and modified, fostering community contributions and further innovation in the field.

## What specific models and checkpoints are included in the release?

The collection on Hugging Face features several draft model checkpoints, including dspark_qwen3_4b_block7 for the 4 billion parameter Qwen3 model, along with dflash and eagle3 variants at the same size. Similar models exist for the 8B and 14B Qwen3 sizes, as well as for the 12B Gemma 4 model. Each checkpoint is optimized for a particular block or layer configuration, such as block7 or ttt7, to maximize compatibility with the target model's architecture. These ready-to-use models eliminate the need for extensive training on the part of the end user, enabling immediate experimentation with speculative decoding setups.

The naming convention for these models reflects their intended use case, with prefixes indicating the algorithm variant and suffixes denoting the target model and configuration details. Users can download these models directly from the collection page, which was updated recently to include the latest checkpoints. This accessibility supports rapid prototyping and integration into existing inference pipelines that utilize the Hugging Face ecosystem.

## How does the speculative decoding process function in practice?

Speculative decoding operates by having the draft model generate a sequence of candidate tokens in an autoregressive manner, which is computationally inexpensive due to its smaller size. The target model then processes these candidates in parallel during a single forward pass, verifying each token against the actual output it would have produced. Tokens that match are accepted, and the process continues from the last accepted token, while mismatches trigger a rejection and a restart from the correct position. This method can lead to multiple tokens being generated per target model call, resulting in higher tokens per second rates.

The effectiveness of this technique depends on the quality of the draft model and its alignment with the target model's distribution. DeepSpec addresses this through specialized training procedures that align the draft models closely with their targets. The provided evaluation scripts allow users to measure acceptance rates and adjust parameters accordingly. In benchmarks, the DSpark variant has shown superior performance in terms of accepted token length compared to earlier methods.

## What are the technical components of the DeepSpec codebase?

The DeepSpec repository contains comprehensive scripts for every stage of the draft model lifecycle. Data preparation scripts process datasets to create training examples suitable for speculative decoding tasks. Training configurations are provided for the DSpark, DFlash, and Eagle3 algorithms, with hyperparameters tuned for the supported model families. Evaluation tools compute metrics such as acceptance rate and speedup factors, enabling rigorous assessment of model performance. The entire stack is documented in the README, guiding users through the workflow from initial setup to deployment.

Integration with existing frameworks is facilitated by the use of standard formats for model checkpoints, ensuring compatibility with popular inference engines. The MIT license permits commercial use and modification, which is critical for enterprise adoption. Developers can extend the codebase to support new model architectures or refine the algorithms for specific use cases.

- Clone the DeepSpec repository from GitHub and install dependencies.
- Select a target model such as Qwen3 8B and corresponding draft model size.
- Run data preparation scripts on relevant corpora to generate training data.
- Execute the training script with the chosen algorithm configuration.
- Evaluate the trained draft model using the provided metrics scripts.
- Deploy the pair of draft and target models in a serving framework for inference.

## What are the implications for market stakeholders and enterprise users?

The open-sourcing of DeepSpec has broad implications for the AI industry, particularly for companies relying on Qwen3 and Gemma 4 models in their applications. By reducing inference times, organizations can handle higher query volumes with the same hardware, leading to cost savings and improved scalability. This is especially relevant in sectors such as customer service, content generation, and real-time analytics where response speed is paramount. The availability of pre-trained models means that smaller teams without dedicated research resources can still benefit from advanced techniques.

For model providers and platform operators, this release may accelerate the adoption of speculative decoding as a standard practice. It also positions DeepSeek as a contributor to the open-source ecosystem, potentially attracting talent and partnerships. Stakeholders in the supply chain for AI infrastructure, including cloud providers and hardware vendors, may see increased demand for optimized serving solutions that incorporate these methods.

## How have experts and the community responded to this development?

Initial reactions from the AI community have highlighted the practical value of having ready draft models available alongside the training toolkit. This combination allows both researchers and practitioners to quickly test and deploy the technology. The detailed documentation and supported algorithms lower the learning curve for those new to speculative decoding.

> DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding.DeepSeek, via GitHub README

## What can be expected in the future for this technology?

Looking ahead, the open-source nature of DeepSpec is likely to spur further research and improvements in draft model design. Community contributions could lead to better alignment techniques, support for additional model families, and integration with emerging inference optimizations. As more users adopt these tools, feedback will drive refinements to the codebase and potentially new variants that achieve even higher efficiency gains.

The release also sets a precedent for other organizations to share their internal tools for inference acceleration, contributing to a more collaborative environment in AI development. This could ultimately lead to faster progress in making large models more accessible and practical for a wider array of applications.

Draft Model Variants and Their TargetsVariant NameTarget ModelSizeAlgorithm Typedspark_qwen3_4b_block7Qwen34BDSparkdflash_qwen3_4b_block7Qwen34BDFlasheagle3_qwen3_4b_ttt7Qwen34BEagle3dspark_gemma4_12bGemma 412BDSpark

## Sources

1. [Lists multiple draft models: deepseek-ai/dspark_qwen3_4b_block7 (1B), dflash variants, eagle3 variants for Qwen3 and Gemma4; updated about 11 hours ago.](https://huggingface.co/collections/deepseek-ai/deepspec)
2. [Full-stack codebase with configs for DSpark/DFlash/Eagle3 targeting Qwen3 and Gemma4; released checkpoints table; MIT license; README details workflow and supported algorithms.](https://github.com/deepseek-ai/DeepSpec)
3. [DSpark raises accepted token length by 26–31% over Eagle3 and 16–18% over DFlash on Qwen3 models](https://www.marktechpost.com/2026/06/27/deepseek-releases-dspark-a-speculative-decoding-framework-that-accelerates-deepseek-v4-per-user-generation-60-85-over-mtp-1/)

---
Source: https://aiintelreport.com/frontier-models/deepseek-deepspec-draft-models-hugging-face
Index: https://aiintelreport.com/llms.txt · Full text: https://aiintelreport.com/llms-full.txt