Google’s AI can now browse the web for you, click buttons, and fill out forms with Gemini 2.5 Computer Use

Photo of author

By [email protected]



Some of the largest providers of large language models (LLMs) have sought to go beyond multimedia chatbots – and extend their models to "Agents" Which can actually take further actions on behalf of the user across websites. He remembers OpenAI’s ChatGPT proxy (formerly known as "Operator") and Use of anthropic computerBoth were released within the past two years.

Now, Google is getting into the same game as well. Today, the search giant Its DeepMind AI Lab has unveiled a new, improved and specially trained version of its powerful Gemini 2.5 Pro LLM software. known as "Gemini 2.5 Pro computer use," Which can Use a virtual browser to browse the web for you, retrieve information, fill out forms, and even take actions on websites – All through one text message to the user.

"These are early days, but the form’s ability to interact with the web — such as scrolling, filling out forms, and navigating drop-down menus — is challenging. An important next step in building general-purpose agents," He said Google CEO Sundar Pichai, As part of a Longer statement on the social network, X.

The form is not available to consumers directly from Google.

instead of, Google is a partner With another company, Browser baseHe founded it Former Twilio engineer Paul Klein in early 2024Which offers virtual "Headless" A web browser intended for use by artificial intelligence agents and applications. (A "Headless" A browser is one that does not require a graphical user interface, or graphical user interface, to navigate the web, although in this case and others, Browserbase displays a graphical representation of the user).

Users can view the new Gemini 2.5 Computer Use model directly on Browserbase here And even compare it side-by-side with competing legacy offerings from OpenAI and Anthropic in a new way "Browser Square" It was launched by the startup (although only one additional model can be chosen alongside the Gemini at a time).

For AI creators and developers, it is prepared as raw material, albeit proprietary to LLM through Gemini API in Google AI Studio to Rapid prototypingAnd Google Cloud Vertex Artificial Intelligence Model selector and application building platform.

The new offer depends on capabilities Gemini 2.5 Pro, It was released back in March 2025 But it has been significantly updated several times since then, with a particular focus on enabling AI agents to have direct interactions with user interfaces, including browsers and mobile applications.

On the whole it seems Gemini 2.5 Computer Use is designed to allow developers to create agents that can independently complete interface-based tasks—such as clicking, typing, scrolling, filling out forms, and navigating behind login screens.

Instead of relying solely on APIs or structured input, this model allows AI systems to interact with software visually and functionally, just as a human would.

Brief practical user tests

In my brief, unscientific initial hands-on tests on Browserbase, Gemini 2.5 Computer Use successfully navigated to Taylor Swift’s official website as instructed and gave me a summary of what was being sold or promoted at the top — a special edition of her latest album. "The life of a showgirl."

In another test, I had the Gemini 2.5 Computer Use search Amazon for highly rated and well-reviewed solar lights that I could put in my backyard, and I was pleased to watch it successfully complete a Google Captcha search designed to weed out non-human users ("Check all the boxes with a motorcycle.") I did it within seconds.

However, once he got there, he stopped and was unable to complete the task, despite his submission to "The mission competed" message.

I must also point out here that while ChatGPT Proxy from OpenAI and Claude Anthropy Gemini 2.5 Computer Use can create and edit local files – such as PowerPoint presentations, spreadsheets, or text documents – on behalf of the user, and Gemini 2.5 Computer Use does not currently provide direct file system access or native file creation capabilities.

Instead, it is designed to control and navigate web and mobile user interfaces through actions such as clicking, typing, and scrolling. Its output is limited to suggested UI actions or chatbot-style text responses; Any structured output such as a document or file must be handled separately by a developer, often through custom code or third-party integrations.

Performance standards

Google says Gemini 2.5 Computer Use has demonstrated leading results in multiple interface control benchmarks, especially when compared to other major AI systems including Claude Sonnet and OpenAI’s agent-based models.

Evaluations were conducted via Browserbase and Google’s own testing.

Some highlights include:

  • Online-Mind2Web (browser base): 65.7% for Gemini 2.5 vs. 61.0% (Claude Sonnet 4) and 44.3% (OpenAI Agent)

  • WebVoyager (browser base): 79.9% for Gemini 2.5 vs. 69.4% (Claude Sonnet 4) and 61.0% (OpenAI Agent)

  • Android World (Deep Mind): 69.7% for Gemini 2.5 vs. 62.1% (Claude Sonnet 4); The OpenAI model cannot be scaled due to its inaccessibility

  • Operating world: Not currently supported by Gemini 2.5; The top competitor’s score was 61.4%.

In addition to strong accuracy, Google reports that the model runs with lower latency than other browser control solutions — a key factor in production use cases like UI automation and testing.

How it works

Agents driven by the computer usage model operate within an interaction loop. They receive:

  • User task prompt

  • Screenshot of the interface

  • History of past actions

The form parses this input and produces a recommended user interface action, such as clicking a button or typing in a field.

If necessary, it can request confirmation from the end user for more risky tasks, such as making a purchase.

Once the action is performed, the interface state is updated and a new screenshot is sent back to the form. The loop continues until the task is completed or stopped due to an error or safety decision.

The model uses a specialized tool called computer_useThey can be integrated into custom environments using tools such as playwright Or via Browser base Demo sandbox.

Use and adoption cases

According to Google, teams internally and externally have already begun using the model across several areas:

  • Google Payments Team Reports indicate that Gemini 2.5 Computer Use successfully recovered more than 60% of failed test executions, reducing a major source of engineering inefficiency.

  • Auto tabThe third-party AI agent platform said the model outperformed others on complex data analysis tasks, boosting performance by up to 18% on the toughest assessments.

  • poke.coma proactive AI assistant provider, says that the Gemini model often works 50% faster of competing solutions during interface interactions.

The model is also used in Google’s product development efforts, including… Mariner Projectthe Firebase testing agentand Putting artificial intelligence into research.

Safety measures

Because this model directly controls software interfaces, Google emphasizes a layered approach to safety:

  • A Safety service for every step Inspects each proposed action before implementation.

  • Developers can specify System level instructions To prevent or request confirmation for specific actions.

  • The model has built-in safeguards to avoid actions that might compromise security or violate Google’s prohibited usage policies.

For example, if a form encounters a CAPTCHA, it will generate an action to click on the checkbox and mark it as requiring user confirmation, ensuring that the system does not proceed without human supervision.

Technical capabilities

The form supports a wide range of built-in UI actions such as:

  • click_at, type_text_at, scroll_document, drag_and_dropAnd more

  • User-defined functionality can be added to extend access to mobile or custom environments

  • The screen coordinates (scale 0–1000) are normalized and translated back to pixel dimensions during execution

He accepts Image and text Inputs and outputs Text responses or Function calls To perform tasks. The recommended screen resolution for best results is 1440×900Although it can work with other sizes.

API pricing remains nearly identical to the Gemini 2.5 Pro

Pricing for Gemini 2.5 Use the computer It aligns closely with the standard Gemini 2.5 Pro model. They both follow the same per-token billing structure: tokens are priced for input $1.25 per million tokens For claims of less than 200,000 tokens, and $2.50 per million tokens For claims longer than that.

Output symbols follow a similar price breakdown $10.00 per million For smaller responses and $15.00 For larger ones.

Models vary in availability and additional features.

Gemini 2.5 Pro includes a free tier Developers are allowed to use the model at no cost, with no token cap posted, although usage may be subject to rate caps or quota limits depending on the platform (e.g. Google AI Studio).

This free access includes both input and output codes. Once developers exceed their quota or switch to the paid tier, standard pricing per token applies.

in contrast, Gemini 2.5 Computer Use is available exclusively through the paid tier. there No free access Currently available for this model, all uses are subject to token-based fees from the start.

In terms of features, Gemini 2.5 Pro supports optional capabilities like context caching (starting at $0.31 per million tokens) and grounding with Google Search (free for up to 1,500 requests per day, then $35 for every additional 1,000 requests). These are not available for computer use at this time.

Another distinction is data processing: Output from the computer usage model is not used to improve Google products in the paid tier, while free tier use for Gemini 2.5 Pro contributes to improving the model unless explicitly opted out.

In general, developers can expect similar token-based costs across both models, but they should consider tier access, built-in capabilities, and data usage policies when determining which model best suits their needs.



[og_img]

Source link

Leave a Comment