The latest addition to the wave of small enterprise models comes from AI21 Labswhich is betting that bringing models to devices will free up traffic in data centers.
AI21’s Jamba Reasoning 3B, a “small-scale” open source model that can run extended reasoning and generate code and response based on ground truth. Jamba Reasoning 3B handles over 250,000 tokens and can run reasoning on high-end hardware.
Jamba Reasoning 3B works on devices such as laptops and mobile phones, the company said.
Uri Goshen, co-CEO of AI21, told VentureBeat that the company sees more enterprise use cases for small models, mainly because moving most inference to hardware frees up data centers.
“What we’re seeing now in the industry is an economic problem where there are very expensive data center buildouts, and the revenue that’s being generated by the data centers versus the rate of consumption of all of their chips shows that the math doesn’t add up,” Goshen said.
He added that in the future, “the industry in general will be hybrid, meaning some calculations will be on hardware locally while other inferences will move to GPUs.”
Tested on a MacBook
Jamba Reasoning 3B combines the Mamba architecture with transformers to allow it to run a 250K token window on hardware. The AI21 said it can do 2-4 times faster inference speeds. Goshen said the Mamba’s geometry contributed greatly to the model’s speed.
The hybrid architecture of Jamba Reasoning 3B also allows it to reduce its memory requirements, thus reducing its computing needs.
AI21 tested the model on a standard MacBook Pro and found that it could process 35 tokens per second.
The model works best for tasks that involve function calls, policy-based generation, and tool routing, Goshen said. Simple requests, such as asking for information about an upcoming meeting and asking the form to create an agenda for it, can be done on devices, he said. More complex reasoning tasks can be saved to GPU clusters.
Small models in organizations
Companies have been interested in using a mix of small models, some specifically designed for their industry and some that are condensed versions of LLMs.
In September, dead Released MobileLLM-R1, a family of inference models Parameters range from 140m to 950m. These models are designed for mathematics, programming, and scientific thinking rather than chat applications. MobileLLM-R1 can run on devices with restricted computing.
Google‘s Gemma It was one of the first small models to hit the market, designed to run on portable devices such as laptops and mobile phones. Gemma ever since It has been expanded.
Companies like Fico They also started building their own models. FICO was launched Her small model of FICO Focused Language and FICO Focused Sequence will only answer questions about finance.
Goshen said the big difference their model offers is that it is smaller than most yet can run thinking tasks without sacrificing speed.
Benchmark test
In the benchmark test, Jamba Reasoning 3B showed strong performance compared to other small models, including Quinn 4b, deadLlama 3.2B-3B and Phi-4-Mini from Microsoft.
It outperformed all models on IFBench and Humanity’s Last Exam, though it came second only to the Qwen 4 on MMLU-Pro.
Another advantage of small models like Jamba Reasoning 3B is that they are highly routable and provide better privacy options for organizations because reasoning is not sent to a server elsewhere, Goshen said.
“I think there’s a world where you can improve customer needs and experience, and the models that will be maintained on devices are a big part of it,” he said.
[og_img]
Source link