Frontier Models
Sakana Fugu Ultra Leads SWE-Bench Pro with 73.7 Percent Score Using Conductor Orchestration
Sakana AI's 7B Fugu Ultra model coordinates multiple frontier models to outperform Claude Opus 4.8 and GPT-5.5 on software engineering benchmarks, offering an accessible API alternative.
Sakana Fugu Ultra is a 7 billion parameter conductor model developed by Sakana AI that uses multi-agent orchestration to coordinate swappable frontier models for superior performance on software engineering and coding tasks.
Sakana AI has released Fugu Ultra, a model that redefines how AI systems approach complex software engineering challenges. The 7B parameter conductor achieves a score of 73.7 percent on the SWE-Bench Pro benchmark. This result places it ahead of Claude Opus 4.8, which scores 69.2 percent, and GPT-5.5 at 58.6 percent. The Gemini 3.1 Pro model reaches 54.2 percent on the same benchmark. These figures come from evaluations using the mini-swe-agent scaffolding as described in Sakana's technical documentation. The approach allows the smaller model to leverage the strengths of multiple larger models without requiring access to the largest proprietary systems. Sakana positions Fugu Ultra as a way to access collective intelligence through orchestration. The model is grounded in research from the ICLR 2026 TRINITY and Conductor papers. This training enables it to dynamically select and combine agents based on the task at hand. Developers can access the system through an OpenAI-compatible API, with Fugu for balanced latency and Fugu Ultra for maximum quality on difficult tasks. The system trails some models like Anthropic Fable 5 on certain benchmarks but performs competitively with accessible models. This development highlights a shift toward orchestration as a key capability in frontier AI. The model also demonstrates strong results on additional tests, reaching 82.1 on TerminalBench 2.1, further validating the orchestration strategy across varied evaluation environments.
How does the Fugu Ultra conductor model coordinate multiple agents?
The technical foundation of Fugu Ultra lies in its ability to act as an orchestrator rather than a standalone solver. It draws from a swappable pool of frontier models, selecting the most appropriate ones for each subtask. The conductor model, trained on principles from the Conductor papers, evaluates the query and determines the optimal formation of agents. This process resembles a coach in sports, as described by Stefan Nielsen, choosing different players and tactics based on the opponent. The model uses the mini-swe-agent framework to structure the interactions among the selected agents. Each agent can be drawn from models like those from OpenAI, Anthropic, or Google, allowing flexibility. The system assembles the team, coordinates their efforts, and refines the output through iterative processes. This dynamic assembly enables Fugu Ultra to exceed the capabilities of any single model in the pool. The research indicates that this method provides access to collective knowledge that individual models cannot achieve alone. Training involved extensive simulations of agent interactions to optimize the orchestration strategies. The result is a system that can handle agentic coding tasks with high efficiency. Sakana AI emphasizes that the 7B size keeps the conductor lightweight while the heavy lifting is done by the coordinated experts. This architecture also helps in navigating export control restrictions by not relying on any single restricted model. The system is particularly effective for hard tasks where single models struggle with complex dependencies. Additional testing has shown that the conductor can adapt to new models quickly, reducing the time to integrate new capabilities. The training data included a wide range of task types to ensure generalizability. This comprehensive training contributes to the robust performance across different benchmarks. The model also shows promise in reducing hallucinations by cross-verifying outputs from multiple agents. This is an important feature for applications where accuracy is critical. Sakana AI is committed to ongoing research to further enhance these aspects of the system.
Further details from the technical report show that Fugu Ultra excels in both SWE-Bench Pro and TerminalBench 2.1. On the latter, it achieves a score of 82.1. The model is designed to handle the full lifecycle of software engineering tasks, from understanding requirements to generating and testing code. The orchestration allows for parallel execution where possible and sequential refinement when necessary. By using different models for different stages, the system can optimize for accuracy in reasoning, code generation, and debugging. The swappable pool means that new models can be added without retraining the conductor. This modularity is a key advantage in the rapidly evolving field of frontier models. Sakana's approach builds on the idea that no single model is best at everything, so combining them strategically yields better results. The training process for the conductor involved reinforcement learning on successful orchestration paths. This has led to the high performance observed in the benchmarks. The system is particularly effective for hard tasks where single models struggle with complex dependencies. The model supports extensive iteration cycles that allow agents to critique and improve upon each other's contributions before finalizing outputs.
| Model | SWE-Bench Pro Score |
|---|---|
| Fugu Ultra | 73.7 |
| Claude Opus 4.8 | 69.2 |
| GPT-5.5 | 58.6 |
| Gemini 3.1 Pro | 54.2 |
| Mythos Preview | 59.0 |
What market implications arise from the availability of Fugu Ultra?
The release of Fugu Ultra through an OpenAI-compatible API opens new possibilities for developers and enterprises. Companies can integrate the conductor model into their workflows without the need to manage multiple API keys for different providers. The two tiers allow users to choose based on their needs, with Fugu Ultra reserved for the most demanding tasks. This accessibility reduces the barriers to using advanced agentic systems. In terms of market dynamics, it challenges the dominance of single large models by offering a meta-layer that enhances their utility. Sakana AI notes that Fugu models trail inaccessible models like Anthropic Fable 5 on some benchmarks but match or exceed publicly accessible ones. This positions the product as a practical solution for many users. The approach also addresses concerns around export controls by allowing orchestration of available models. Stakeholders in the AI industry may see this as a way to democratize access to high-performance agentic capabilities. The API compatibility means minimal changes for existing applications built on OpenAI standards. Sakana expects adoption in software development teams looking for reliable coding assistance. The performance gains come from the intelligent combination rather than scale alone. This could influence how future models are developed, with more focus on orchestration capabilities. The pricing structure supports both experimentation and production use cases, broadening the potential user base beyond large organizations.
Market analysts may note that this model represents a new category of AI systems focused on coordination. The ability to swap models in the pool provides resilience against changes in model availability or performance. For example, if one provider updates their model, the conductor can incorporate the new version seamlessly. This flexibility is valuable in a field where models are frequently updated. The pricing and availability details are provided on the Sakana website, making it straightforward for potential users to evaluate. The emphasis on agentic coding and software engineering tasks aligns with growing demand in those areas. By achieving state-of-the-art results on SWE-Bench Pro and TerminalBench 2.1, Fugu Ultra demonstrates real-world applicability. The model is particularly suited for environments where multiple models are already in use, allowing the conductor to optimize their combined output. This could lead to cost savings by reducing the need for the most expensive models on every task. Sakana's strategy appears to be building an ecosystem around the conductor concept. Integration options with popular development environments further enhance its appeal to professional teams.
What expert perspectives exist on the Fugu orchestration method?
Sakana Fugu はスポーツに例えるなら、さまざまな選手(AIモデル)やフォーメーション(組み合わせ)、戦術(タスク処理)を対戦相手(クエリ)に応じて使い分けるコーチのような存在です。AIのチームを指揮するAIとして集合知にアクセスし、個々のモデルを上回るパフォーマンスを発揮しますStefan Nielsen, Chief Scientist, Sakana AI
Stefan Nielsen, Chief Scientist at Sakana AI, has provided insights into the philosophy behind the Fugu models. His description frames the conductor as a coach that adapts strategies based on the specific challenge. This analogy underscores the adaptive nature of the system. Other experts in the field have noted the potential of such orchestration to push beyond the limits of individual models. The technical report from the Fugu Team highlights the state-of-the-art performance in agentic coding. The approach is seen as a step toward more robust AI systems that can handle real-world software engineering scenarios. Reactions from the community focus on the practical benefits of the API access. The model provides a way to experiment with multi-agent systems without building the infrastructure from scratch. This lowers the entry barrier for researchers and developers interested in agentic AI. The success on benchmarks like SWE-Bench Pro validates the training methods used. Sakana AI continues to refine the conductor based on feedback and new research. The integration with existing tools through the compatible API facilitates broader adoption. Experts expect this to influence the design of future agent frameworks. Community discussions often highlight the potential for similar conductor models in other domains.
What steps outline the process of Fugu Ultra task handling?
- Analyze the incoming query to determine task requirements and complexity.
- Select appropriate expert models from the swappable pool based on strengths.
- Assemble the agents into an optimal formation for collaboration.
- Coordinate the execution using the mini-swe-agent scaffolding.
- Evaluate the outputs and refine through additional iterations if needed.
- Deliver the final result optimized for the specific task.
What role does the mini-swe-agent play in the Fugu system?
The mini-swe-agent serves as the scaffolding that structures the interactions between the conductor and the expert agents. It provides the framework for task decomposition, agent communication, and result aggregation. Without this scaffolding, the orchestration would lack the necessary structure to produce coherent outputs. The Sakana team has optimized this component to work seamlessly with the Fugu Ultra model. The scaffolding allows for standardized evaluation of agent contributions and facilitates the dynamic adjustments during task execution. This component is crucial for maintaining consistency across different model combinations. The technical report details how the mini-swe-agent enables the high scores observed in the benchmarks. It handles the logistics of agent calls and response parsing. The integration with the conductor model is a key innovation in the Fugu system. Developers using the API benefit from this pre-built scaffolding without needing to implement it themselves. The design ensures that the system can scale to more complex tasks by extending the scaffolding as needed. Sakana AI continues to iterate on this component to improve efficiency and accuracy. The use of this scaffolding is mentioned in the benchmark evaluations as a standard part of the testing protocol. This allows for fair comparisons with other systems using similar setups. The overall system benefits from this modular approach to agent management. The scaffolding also includes mechanisms for error detection and recovery that enhance reliability during extended task sessions.
What is next for Sakana Fugu models in the frontier AI landscape?
Looking ahead, Sakana AI is likely to expand the capabilities of the Fugu conductor series. Future iterations may incorporate additional models into the pool as new frontier systems are released. The company may also enhance the training of the conductor to handle even more complex multi-step tasks. Integration with other tools and platforms could broaden its use cases beyond software engineering. The research papers from ICLR 2026 provide a foundation for further advancements in orchestration techniques. Sakana aims to maintain the balance between model size and performance, keeping the conductor efficient. The API tiers may see updates to include more options for users. As benchmarks evolve, the team will continue to test and report new scores. The focus remains on providing a tool that accesses collective intelligence from multiple sources. This strategy positions Sakana as a key player in the agentic AI space. Developers can expect ongoing improvements in latency and quality. The overall trend suggests more emphasis on meta-models that orchestrate rather than replace existing systems. Sakana's work with Fugu Ultra sets a precedent for this direction in frontier model development. The company is exploring applications in additional domains to demonstrate the versatility of the conductor approach.
The development of Fugu Ultra reflects broader trends in the AI industry toward specialized orchestration models. As frontier models grow in capability, the need for systems that can effectively combine them becomes more apparent. Sakana AI's work provides a concrete example of how this can be achieved with a relatively small conductor model. The success on multiple benchmarks validates the underlying research from the TRINITY and Conductor papers. Future work may explore applying similar techniques to other domains beyond coding, such as scientific research or creative tasks. The API availability makes it possible for a wide range of users to experiment with these ideas. Sakana plans to release additional documentation and examples to support adoption. The company is also engaging with the research community to share insights from the development process. This open approach could accelerate progress in the field of multi-agent systems. The benchmark results serve as a starting point for further improvements and comparisons. Overall, Fugu Ultra represents an important step in making advanced AI orchestration accessible and effective. The modular design allows for rapid adaptation as new models emerge in the ecosystem.
Frequently asked
What is SWE-Bench Pro?
SWE-Bench Pro is a benchmark designed to evaluate AI models on realistic software engineering tasks drawn from GitHub issues. It measures the ability to generate correct code patches for complex problems.
How can developers access Sakana Fugu Ultra?
Developers can access Sakana Fugu Ultra through an OpenAI-compatible API. Two tiers are available: Fugu for balanced performance and Fugu Ultra for maximum quality on challenging tasks.
Does Fugu Ultra require access to restricted models?
Fugu Ultra orchestrates from a pool of publicly accessible frontier models. This design helps avoid export control issues associated with some advanced AI systems.