Companies scaling AI deployments are hitting an invisible performance wall. The culprit? Fixed beaters who can’t keep up with changing workloads.
Speculators are smaller AI models that work alongside larger linguistic models during inference. They formulate multiple forward codes, which the main model then verifies in parallel. This technique (called speculative decoding) has become essential for organizations trying to reduce inference costs and latency. Instead of generating tokens one by one, the system can accept multiple tokens at once, greatly improving throughput.
Together Amnesty International Today I announced a new research and system called ATLAS (AdapTive-LeArning Speculator System) that aims to help institutions overcome the challenge of persistent speculators. This technology provides self-learning inference optimization that can help deliver up to 400% faster inference performance than the baseline performance level available in current inference techniques such as vLLM. The system addresses a critical problem: as AI workloads evolve, inference speeds degrade, even with dedicated speculators.
The company that She got her start In 2023, the focus is on Improve heuristics On the organization’s AI platform. Earlier this year the company It raised $305 million As customer adoption and demand grow.
"The companies we work with generally, as they scale, they see changing workloads, and then they don’t see the same amount of speed of speculative execution as before," Trey Dow, chief scientist at Together AI, told VentureBeat in an exclusive interview. "These scalpers generally don’t do well when their workload area starts to shift."
The problem of workload drift is no one talking about
Most speculators in production today are "fixed" Models. They are trained once on a static dataset representing expected workloads, and then deployed without any adaptability. Companies like Meta and Mistral ship pre-trained scalpers along with their main models. Inference platforms like vLLM use these static speculations to increase throughput without changing the quality of the output.
But there is a catch. As the use of artificial intelligence in an organization develops, the accuracy of the fixed racket decreases.
"If you’re a company that produces programming agents, and most of your developers are writing in Python, and suddenly some of them switch to writing Rust or C, then you see that speed starts to drop," Dao explained. "The hitter has a mismatch between what he has been trained to do versus his actual workload."
This workload skew represents a hidden tax on scaling AI. Firms accept deteriorating performance or invest in retraining dedicated speculators. This process only captures a snapshot in time and quickly becomes outdated.
How adaptive speculators work: a dual model approach
ATLAS uses a dual speculation structure that combines stability and adaptation:
The static speculator -Heavy model trained on wide data provides consistent baseline performance. It’s like "Speed floor."
Adaptive scalper – A lightweight model that continuously learns from live traffic. He specializes on the go in emerging domains and usage patterns.
Trust-aware controller – The coordination layer dynamically chooses which racket you want to use. It controls speculation "Look forward" Based on degrees of trust.
"Before the adaptive beater learns anything, we still have the fixed beater to help provide the speed boost initially," Ben Athiwaratakon, an AI scientist at Together AI, explained to VentureBeat. "Once the adaptive striker becomes more confident, the speed grows over time."
The technical innovation lies in balancing the acceptance rate (the number of times the target model agrees with the draft tokens) and the draft arrival time. Since the adaptive model learns from traffic patterns, the controller relies more on lightweight paddles and forward reach. This doubles the performance gains.
Users do not need to adjust any parameters. "On the user side, users do not have to operate any knobs;" Dow said. "For our part, we have rotated these knobs so that users can adjust them in a configuration that achieves good speed."
Performance that rivals custom silicone
Together, AI testing shows ATLAS reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. Even more impressive is that these numbers on Nvidia B200 GPUs match or exceed specialized inference chips like Your puppy Custom hardware.
"Improving software and algorithms is able to bridge the gap with truly specialized hardware," Dow said. "We’ve seen 500 codes per second on these massive models which is faster than some dedicated chips."
The 400% speedup that the company claims to conclude represents the cumulative effect of Together’s Turbo optimization suite. FP4 quantization provides an 80% speedup over the FP8 baseline. The fixed turbo beater adds another 80-100% gain. The adaptive system layers at the top. Each improvement multiplies the benefits of the others.
Compared to standard inference engines such as vLLM Or TensorRT-LLM from Nvidia, the improvement is significant. Together, the AI technology measures the stronger baseline between the two for each workload before applying speculative optimizations.
Memory computation swap explained
The performance gains stem from exploiting a fundamental inefficiency in modern reasoning: wasted computational power.
Dow explained that usually during inference, a lot of computing power is not fully used.
"During inference, which is actually the dominant workload nowadays, you are mostly using the memory subsystem," He said.
Speculative decoding trades off idle computation to minimize memory accesses. When the model generates one token at a time, it is memory bound. The GPU remains idle while waiting for memory. But when the speculator proposes five tokens and the target model checks them simultaneously, compute utilization rises while memory access remains roughly constant.
"The total amount of computation to generate five codes is the same, but you only had to access memory once, instead of five times." Dow said.
Think of it as intelligent caching for AI
For infrastructure teams familiar with traditional database optimization, adaptive speculation serves as an intelligent caching layer, but with a crucial difference.
Traditional caching systems such as Redis or memcached require exact matches. You can store exactly the same query result and retrieve it when you run that specific query again. Adaptive scalpers work differently.
"You can look at it as a smart way of caching, not exactly caching, but detecting some of the patterns that you see," Dao explained. "Broadly, we see that you’re working with similar code, or working with similar code, you know, and controlling the account in a similar way. We can then predict what the big model will say. We’re getting better and better at predicting that."
Instead of storing exact responses, the system learns patterns in how the model generates tokens. It recognizes that if you are editing Python files in a certain code base, certain code sequences become more likely. The speculator adapts to those patterns, improving his predictions over time without the need for matching inputs.
Use cases: RL training and evolving workloads
There are two institutional scenarios that particularly benefit from adaptive speculators:
Enhancing training learning: Consistent speculators quickly get out of alignment as policy evolves during training. ATLAS constantly adapts as policy distribution changes.
Advanced workloads: As organizations discover new use cases for AI, the composition of the workload changes. "Maybe they started using AI for chatbots, but then they realized, they can write code, so they started switching to code," Dow said. "Or they realize that these AI systems can actually call up tools and control computers and do accounting and things like that."
In a dynamic coding session, the adaptive system can specialize in the specific code base being edited. These are files that were not seen during training. This increases acceptance rates and decoding speed.
What it means for organizations and the inference ecosystem
ATLAS is now available on Together AI custom endpoints as part of the platform at no additional cost. The company’s 800,000-plus developers (compared to 450,000 in February) have access to optimization.
But the broader implications extend beyond a single vendor’s product. The shift from static to adaptive optimization represents a fundamental rethinking of how inference platforms work. As organizations deploy AI across multiple domains, the industry will need to move beyond one-time trained models toward systems that continuously learn and improve.
Together AI has historically released some of its research technologies as open source and collaborated with projects such as vLLM. Although the fully integrated ATLAS system is proprietary, some of the underlying technologies may ultimately impact the broader inference ecosystem.
For companies looking to lead in AI, the message is clear: adaptive algorithms on commodity hardware can match custom silicon at a fraction of the cost. As this approach matures throughout the industry, software optimization is increasingly outpacing specialized hardware.
[og_img]
Source link