Frontier Models

DeepSeek Open-Sources DSpark Speculative Decoding for V4 Flash and Pro

The initiative supplies full-stack training code and attaches the module to existing checkpoints while extending support to Gemma and Qwen families for measurable inference gains.

By Marcus Vance June 27, 2026 7 MIN READ

Inside a large industrial data center facility with rows of tall black server racks filled with densely packed GPU accelerator cards and cooling systems, several anonymous engineers wearing neutral colored casual clothing and safety badges work methodically around open hardware bays. One engineer in the foreground uses both hands to carefully align and attach a compact rectangular accelerator module onto a primary compute board seated within a rack unit, representing the integration of an optimization module to existing model checkpoints. Nearby another engineer kneels beside a separate rack examining bundles of high speed interconnect cables while a third stands at a workstation table sorting through stacks of reference hardware boards labeled only by generic part numbers. The environment features concrete flooring, overhead ventilation ducts, LED status indicator lights on equipment panels, and multiple identical server chassis doors propped open to reveal internal circuit boards and memory modules. In the background additional racks extend into the distance with visible power distribution units and liquid cooling manifolds running along the aisles. The scene captures the engineers collaborating on hardware assembly tasks that symbolize full stack training code deployment and extension of speculative decoding support across varied model families including representations of advanced inference optimized systems. Every surface shows realistic wear patterns on metal casings, organized tool trays with screwdrivers and anti static wrist straps, spare component bins, and diagnostic probes placed on adjacent surfaces without any visible markings or displays. The composition centers on the physical act of module attachment within a production scale computing environment illustrating measurable performance enhancements through hardware level modifications in a live operational setting. — Illustration: AI Intel Report

DSpark is a speculative decoding draft model algorithm that has been open-sourced by DeepSeek as part of the DeepSpec repository to boost inference performance.

DeepSeek has taken a significant step toward democratizing advanced inference optimization techniques by open-sourcing the DSpark speculative decoding framework. The release includes the DeepSpec GitHub repository which contains full-stack code for training and evaluating draft models for speculative decoding. This is accompanied by the attachment of the speculative decoding module to the DeepSeek-V4-Flash checkpoint on Hugging Face, resulting in the DeepSeek-V4-Flash-DSpark variant. The approach is designed to deliver substantial gains in throughput without the need to train entirely new models from scratch. By providing the code openly, DeepSeek enables developers and researchers to experiment with and extend the technology to other model families. The decision to release both the code and the integrated checkpoint underscores a commitment to practical tools that address real deployment challenges in frontier model inference.

What background information is available on the DeepSeek-V4 series?

The DeepSeek-V4 series was introduced in a recent arXiv paper as a preview of highly efficient models capable of handling million-token contexts. The series includes DeepSeek-V4-Pro, which has 1.6T total parameters with 49B activated, and DeepSeek-V4-Flash, which has 284B total parameters with 13B activated. Both models utilize a Mixture-of-Experts architecture and support a context length of one million tokens through a hybrid attention mechanism. This design allows for reduced computational requirements during inference, particularly in long context settings. The paper highlights the efficiency gains over previous versions in terms of FLOPs and memory usage for KV cache. The hybrid attention architecture is key to enabling the long context support while maintaining efficiency. This architecture combines different attention mechanisms to optimize for both short and long range dependencies in the input sequence.

As a result, the models can process very long documents or conversations without the typical quadratic scaling issues in standard transformers. The release of these models alongside the DSpark framework suggests a focus on practical deployment considerations for frontier scale models. The MoE design means only a fraction of parameters are active at any time, which contributes to the observed reductions in resource consumption during extended context processing. Stakeholders evaluating these models for production use will note the emphasis on both capability and operational efficiency in the provided specifications.

What details define the DSpark release and its integration?

According to the Hugging Face model card, DeepSeek-V4-Flash-DSpark is not a new model but rather the same checkpoint with an additional speculative decoding module attached. This design choice simplifies the deployment process for users who already have the base model. The module is designed to work with the existing weights, allowing for quick integration of the speculative decoding capability. The GitHub repository provides the necessary code to train custom draft models using the DSpark algorithm, which is one of three options available in the DeepSpec framework. The full-stack implementation covers both the training pipeline for draft models and the evaluation protocols needed to measure speculative decoding performance. This integrated release reduces the engineering effort required for organizations seeking to adopt the technique.

Users can therefore begin with the provided DeepSeek-V4-Flash-DSpark weights and extend the same module training approach to their preferred target models. The separation of the base checkpoint from the speculative module also means that updates to the underlying model can be applied independently without retraining the draft component from scratch. This modular structure supports iterative improvement cycles common in production AI systems.

Which target model families receive support from the DeepSpec framework?

The DeepSpec repository explicitly supports target models from the Qwen and Gemma families. This extension broadens the applicability of the speculative decoding technique beyond DeepSeek's own models. Developers working with Qwen or Gemma can now leverage the draft model training code to improve their inference speeds. The full-stack nature of the codebase includes tools for both training the draft models and evaluating their performance in speculative decoding setups. This open approach could lead to wider adoption of the technique across different model ecosystems. The documentation within the repository outlines the configuration steps required to adapt the draft model training for these external families, ensuring compatibility with their tokenizers and architectures.

Support for multiple families demonstrates the framework's flexibility and reduces the need for custom implementations when organizations operate heterogeneous model portfolios. Researchers can therefore compare performance across architectures using a shared training and evaluation pipeline, which facilitates standardized benchmarking of speculative decoding methods.

What technical specifics characterize the DeepSeek-V4 models?

The technical specifications indicate that the DeepSeek-V4-Pro model has a total of 1.6 trillion parameters but activates only 49 billion during inference due to its MoE design. Similarly, the Flash variant activates 13 billion out of 284 billion parameters. The one million token context is supported through the hybrid attention, which helps in managing the memory footprint of the KV cache. In long context scenarios, this results in significant savings, as evidenced by the reduced FLOPs requirement. The hybrid attention mechanism alternates between local and global attention patterns to balance accuracy and efficiency across varying sequence lengths.

DeepSeek-V4 Model Specifications
Model	Total Parameters	Activated Parameters	Context Length
DeepSeek-V4-Pro	1.6T	49B	1 million tokens
DeepSeek-V4-Flash	284B	13B	1 million tokens

How are the supported algorithms in DeepSpec structured?

DSpark serves as the primary algorithm emphasized in the current release for its balance of training stability and decoding acceleration.
DFlash provides an alternative draft model approach included for direct performance comparisons within the same evaluation harness.
Eagle3 completes the set of three algorithms, offering additional options for users seeking varied trade-offs between draft model size and acceptance rate.

The inclusion of multiple algorithms allows users to choose the best fit for their specific use case and hardware setup. DSpark is highlighted in the release for its performance characteristics when paired with the V4 models. The evaluation tools in the repository enable side-by-side testing of the three algorithms against common target models, providing quantitative data on throughput improvements and acceptance rates. This comparative capability supports informed selection during production integration.

DeepSeek-V4-Flash-DSpark is not a new model. It is the same checkpoint with an additional speculative decoding module attached.DeepSeek-AI, Model release note

What market and stakeholder implications arise from this release?

The open-sourcing of DSpark and the DeepSpec codebase could lower the barrier for companies and researchers to implement advanced inference optimizations. Stakeholders in the AI deployment space may see reduced costs associated with running large models at scale. By supporting multiple model families, the framework has the potential to influence how inference is handled in production environments across the industry. The availability of the code on GitHub facilitates collaboration and further development by the community. Organizations evaluating inference stacks can incorporate the provided training pipeline without licensing restrictions, which accelerates internal experimentation cycles.

For enterprises using models like Gemma or Qwen, this provides a new tool to enhance their existing setups without requiring changes to the base models. The focus on efficiency in million token contexts aligns with growing demands for processing long documents and complex tasks in AI applications. Reduced KV cache requirements translate directly to lower memory provisioning needs on inference hardware, which can improve overall system utilization rates in shared computing clusters.

What comes next in the development of these technologies?

Future developments may include further optimizations to the draft model training process and expansion to additional model families. The release of the full-stack code suggests that DeepSeek intends to continue contributing to the open source ecosystem for AI inference tools. Researchers can build upon the provided implementations to explore new speculative decoding strategies. Continued community contributions to the DeepSpec repository are expected to refine the training objectives and evaluation metrics over time.

The modular attachment method demonstrated with DeepSeek-V4-Flash-DSpark may serve as a template for integrating similar modules into other frontier models. This pattern could encourage additional open releases of speculative decoding components from other organizations, fostering a more interoperable set of inference acceleration tools across the ecosystem.

Frequently asked

What models does DSpark support besides DeepSeek's own?

The DeepSpec framework supports target models from the Qwen and Gemma families in addition to DeepSeek models. The repository provides configuration files and training scripts tailored for these external architectures.