EAGLET enhances AI agent performance on longer-horizon missions by creating customized plans

It was supposed to be 2025 year "artificial intelligence agents," According to Nvidia CEO Jensen Huang, and others in the AI industry. In many ways, this has been the case with many leading AI model providers such as OpenAI, GoogleAnd even Chinese competitors like it Release from Alibaba Fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web searching and report writing.

But there is still a big hurdle to the future of high-performance, reliable AI agents: convincing them to stay on task when the task extends over a number of steps. Third party benchmark tests It shows that even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the more time they spend on it (beyond hours).

A A new academic framework called EAGLET It proposes a practical and effective way to improve long-term task performance in LLM-based agents – without the need for manual data classification or retraining.

It was developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign. Eaglet offers a "Global chart" Which can be integrated into the agent’s existing workflow to reduce hallucinations and improve task efficiency.

EAGLET is a fine-tuned language model that interprets task instructions—usually presented as prompts by the user or the agent’s operating environment—and generates a high-level plan for the agent (supported by its LLM). He does not intervene during implementation, but his advance guidance helps reduce planning errors and improve task completion rates.

Addressing the planning problem in long-horizon agents

Many LLM-based agents struggle to complete long-term tasks because they rely on reactive, step-by-step thinking. This approach often leads to trial-and-error behaviors, hallucinatory planning, and ineffective courses of action.

EAGLET addresses this limitation by introducing a Global Planning Unit Which works together with the port agent.

Instead of mixing planning and work generation into a single model, EAGLET separates them, allowing for more cohesive task-level strategies.

Two-stage training pipeline without any human annotation

The EAGLET planner is trained using a two-stage process and does not require human-written plans or annotations.

The first phase involves creating synthesis plans with highly capable LLMs, such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a new strategy called symmetric consensus filtering, which retains only those that improve task performance for both expert and novice execution agents.

In the second stage, a rule-based reinforcement learning process refines the plan further, using a specially designed reward function to evaluate how well each plan helps multiple agents succeed.

Introducing the Export Capacity Gain Bonus (ECGR)

One of EAGLET’s key innovations is the Port Capacity Gain Bonus (ECGR).

This reward measures the value of the created plan by checking whether it helps both high- and low-ability agents to complete tasks more successfully and with fewer steps.

It also includes a decay factor in favor of shorter, more efficient mission paths. This approach avoids over-rewarding plans that are only useful to already competent agents and promotes more generalized planning guidance.

Compatible with current dealers and models

The EAGLET chart is designed to be modular and "plug and play," Which means it can be inserted into existing proxy pipelines without having to retrain the executor.

In evaluations, the scheme boosted performance across a variety of base models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

It has also proven effective regardless of stimulation strategy, working well with standard ReAct-style prompts as well as methods such as Reflexion.

Cutting-edge performance across benchmarks

EAGLET has been tested on three widely used benchmarks for long-term agent tasks: ScienceWorld, which simulates scientific experiments in a text-based laboratory environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home environment; and WebShop, which assesses goal-directed behavior in a realistic online shopping interface.

In all three domains, EAGLET-equipped implementation agents outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

In experiments conducted on the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, an increase of +19.9 points across tasks.

In ScienceWorld’s unseen scenarios, performance increased from 42.2 to 61.6.

In the scenarios seen by ALFWorld, EAGLET improved scores from 22.9 to 54.3, a performance increase of more than 2.3x.

Stronger gains have been seen with more capable models.

For example, GPT-4.1 improved from 75.5 to 82.2 on average with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already strong performance.

In some benchmarks, performance gains reached +11.8 points, such as when combining EAGLET with the ETO port method in ALFWorld’s unseen missions.

Compared to other planning baselines such as MPO, EAGLET consistently provided higher task completion rates. For example, in ALFWorld’s unseen tasks using GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6 – a +4.5 point advantage.

Additionally, the paper indicates that agents using EAGLET complete tasks in fewer steps on average. Using GPT-4.1 as the port, the average hop count decreased from 13.0 (no scheme) to 11.1 (EAGLET). With GPT-5, it decreased from 11.4 to 9.4, which supports the claim of improved implementation efficiency.

Efficiency gains in training and implementation

Compared to RL-based methods such as GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with approximately one-eighth the training effort.

This efficiency also carries over to execution: agents using EAGLET typically need fewer steps to complete tasks. This translates into reduced inference time and cost calculation in production scenarios.

There is no common law – yet

As of the version submitted to arXiv, the authors have not released an open source implementation of EAGLET. It is not clear if or when the code will be released, under what license, or how it will be maintained, which may limit the framework’s usefulness for enterprise deployment in the near term.

VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.

There are still questions regarding enterprise-level deployment

Although the scheme is described as plug-and-play software, it is still unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support decoupling of scheme execution.

Likewise, the training setting leverages multiple implementation agents, which may be difficult to replicate in enterprise environments with limited access to the model. VentureBeat asked the researchers whether the consensus filtering method could be adapted for teams that only have access to a single executor model or limited computational resources.

The EAGLET authors report success across model types and sizes, but it is not yet known what the minimum viable model scale for practical deployment is. For example, can enterprise teams effectively use the schema with subparameter 10B open models in latency-sensitive environments? Additionally, the framework may provide industry-specific value in areas such as customer support or IT automation, but it remains to be seen how easily the scheme can be fine-tuned or customized for such sectors.

Real time versus pre-created planning

Another open question is how EAGLET is best deployed in practice. Should the real-time planner work alongside the implementers in a loop, or is it better to use it offline to create global plans in advance for known task types? Each approach has implications for access time, cost, and operational complexity. VentureBeat has posed this question to the authors and will report back on any ideas that emerge.

Strategic trade-offs for enterprise teams

For technical leaders in medium to large organizations, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without common tools or implementation guidelines, the framework still presents a build-versus-wait decision. Companies must weigh potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.

Potential use cases in enterprise settings

For organizations developing effective AI systems — especially in environments that require tiered planning, such as IT automation, customer support, or online interactions — EAGLET offers a model for how to incorporate planning without retraining. Its ability to drive both open and closed source models, coupled with its efficient training approach, may make it an attractive starting point for teams seeking to improve agent performance with minimal overhead.

[og_img]

Source link