OpenCUA Open Source Computer Agents compete with ownership models from OpenAi and Anthroproy

Want more intelligent visions of your inbox? Subscribe to our weekly newsletters to get what is concerned only for institutions AI, data and security leaders. Subscribe now

A new framework of researchers in Hong Kong University HKU and cooperative institutions mainly provide an open source to create strong artificial intelligence agents who can operate computers. The frame, called OpenCUAIt includes tools, data and recipes to expand the scope of the development of computer use agents (CUAS).

Trained models use this framework strongly on CUA standards, outperform current open sources and compete closely with closed agents from the leadership of AI Laborators such as Openai and Anthropor.

The challenge of building computer use agents

Computer use agents It is designed to complete the tasks independently on a computer, from moving sites to complex software. They can also help automate the workflow in the institution. However, the most capable CUA systems are royal, with important details about training and building data on them and their development processes.

“Since the lack of transparency reduces technical developments and raises safety concerns, the research community needs CUA’s work that is really open to study their capabilities, restrictions and risks.” Determine them.

Artificial intelligence limits its limits

Power caps, high costs of the symbol, and inference delay are reshaped. Join our exclusive salon to discover how the big difference:

Transforming energy into a strategic advantage

Teaching effective reasoning for real productivity gains

Opening the return on competitive investment with sustainable artificial intelligence systems

Securing your place to stay in the foreground: https://bit.ly/4mwngngo

At the same time, the open source efforts are facing a set of their obstacles. There was no developed infrastructure for collecting various diverse data necessary to train these factors. The open source data collections on the user’s graphic interfaces (GUIS) contain limited data, and many search projects do not provide sufficient details about their methods, which makes it difficult for others to repeat their work.

According to the paper, “These restrictions hinder the progress collectively in CUAS for general purposes and restrict the meaning of the expansion of expansion, generalization, and possible learning approach.”

OpenCUA

OpenCUA Framework Source: XLANG LAB at HKU

OpenCUA is an open source frame designed to address these challenges by expanding the scope of data collection and models themselves. In essence, the Agentnet tool to record human demonstrations of computer tasks on various operating systems.

The tool simplifies data collection by playing in the background on the teacher’s personal computer, capturing videos on the screen, mouse inputs, keyboard, and basic access tree, which provides organized information about the elements that appear on the screen. Then this initial data is processed in “government work paths”, with a computer snapshot (case) with the user’s corresponding procedure (click, printed, etc.). The explanations can then review these demonstrations, edit and present these demonstrations.

Agentnet Source: XLAng LAB at HKU

Using this tool, the researchers collected the Agentnet Data set, which contains more than 22,600 task shows via Windows, MacOS and Ubuntu, which extends over 200 applications and web sites. “This data collection captures in an original complexity of human behaviors and environmental dynamics from user’s personal computing environments,” the paper notes.

By realizing that screen recording tools arouse important data on institutions, researchers designed the Agentnet tool with security in mind. Xinyuan Wang, co -author of the paper and PhD student at HKU, explained that they had implemented a multi -layer privacy protection framework. “First, the commenters themselves can monitor the data they generate completely … before they decide whether it will be sent,” he told Venturebeat. The data is then subject to manual verification of privacy problems and automatic scanning through a major model to detect any remaining sensitive content before the release. Wang added: “This operation with layers guarantees durability at the institution level for the environments that deal with sensitive or financial data,” Wang added.

To accelerate the evaluation, the team also sponsored the Agentnetbench, an unconnected standard that provides multiple procedures for each step, providing a more efficient way to measure the performance of the agent.

A new recipe for training agents

OpenCUA Framework introduces a new pipeline for data processing and training of computer use agents. The first step turned raw human demonstrations into clean pairs of the appropriate movement for training Language models (VLMS). However, the researchers found that just training models on these husbands achieve limited performance gains, even with large amounts of data.

OpenCUA Source Pipeline Series Source: XLANG LAB at HKU

The main insight was to increase these tracks A series of ideas (Bed) logic. This process creates a detailed “internal monologue” for each procedure, which includes planning, memory and thinking. This organized logic is organized at three levels: high -level screen note, reflective ideas that analyze the situation and plan the following steps, and finally the brief and executable procedure. This approach helps the agent to develop a deeper understanding of the tasks.

The researchers wrote: “We find a basic natural logic for the basis of the use of generalized computer, which helps Cuas to absorb cognitive capabilities,” the researchers wrote.

This data synthesis pipeline is a general framework that companies can adapt to training agents on their unique internal tools. According to Wang, the Foundation can record its own workflow demonstrations and use the same “reflector” and “generator” pipeline to create the necessary training data. “This allows them to pave a high -performance factor specifically designed for their inner tools without the need for manual thinking effects,” he explained.

OpenCUA status on the test

The researchers applied an OpenCUA framework to train a group of open source VLMS, including QWEN and Kimi-VL variables, with teachers’ sizes from 3 billion to 32 billion. Models have been evaluated on a set of online standards and communications that test their ability to perform tasks and understand the graphical user interface.

The model has created 32 billion parameters, OpenCUA-32B, a new success rate on the latest model between open source models on the Osworld Standard. It also outperformed Openai GPT-4O Cua The performance gap was greatly closed with the leading ownership models in humans.

OpenCUA shows an enormous improvement on the basic models (left) during competition with the leading CUA models (right) Source: XLANG LAB at HKU

For enterprise developers and product leaders, the research offers many major results. OpenCUA method is widely applicable, which improves performance on different -brown models (both, density and A mixture of expertsAnd sizes. The trained agents also show a strong circular, working well through a variety of tasks and operating systems.

According to Wang, the framework is especially suitable for automating the workflow of the intensive frequent employment institutions. He told Venturebeat: “For example, in the Agentnet Data set, we already pick up some EC2 AWS counterpart demonstrations and composition of comments parameters on MTurk,” he told Venturebeat. “These tasks include many serial steps, but track repeated patterns.”

However, Wang indicated that the bridge of the gap for direct publication requires addressing the main challenges on safety and reliability. “The biggest challenge in real publishing is safety and reliability: the agent must avoid errors that can unintentionally change the system settings or lead to harmful side effects that go beyond the intended task.”

Researchers released codeand Data setAnd Weights For their models.

Since open resource agents built on frameworks such as OpenCUA become more capable, they may mainly develop the relationship between their knowledge workers and computers. Wang imagines a future in which efficiency in complex programs becomes less important than the ability to express goals to the AI agent.

He described two basic status of work: “automation that is not connected to the Internet, as the agent enhances his broader knowledge of the software to follow up on an important to one side to the party” and “online cooperation, where the agent responds in the actual time and works alongside man, very similar to the colleague.” Basically, human beings will provide “what” strategic “what” strategic, while the increasingly developed artificial intelligence factors deal with “how”.

Daily visions about business use cases with VB daily

If you want to persuade your boss at work, you have covered VB Daily. We give you the internal journalistic precedence over what companies do with obstetric artificial intelligence, from organizational transformations to practical publishing operations, so that you can share visions of the maximum return on investment.

Read with us privacy policy

Thanks for subscribing. Check more VB news bulletins here.

An error occurred.