Will updating your AI agents help or hinder their performance? The new Raindrop experimentation tool tells you

Photo of author

By [email protected]



It seems that almost every week in the past two years since ChatGPT’s launch, new large language models (LLMs) have been released from competing labs or from OpenAI itself. Businesses are under intense pressure to keep up with the exponential pace of change, let alone understand how to adapt to it – which, if any, of these new models should they adopt to power their workflows and the custom AI agents they build to implement them?

Help arrived: Launch observability of AI applications raindrop Experiments launcheda new analytics feature that the company describes as the first A/B testing suite designed specifically for enterprise AI agents — allowing companies to see and compare how updating agents to new underlying models, or changing their instructions and access to tools, impacts their performance with real end users.

This release expands Raindrop’s existing monitoring tools, giving developers and teams a way to see how their agents behave and develop in real-world conditions.

Using experiments, teams can track how changes—such as a new tool or prompt, updating a form, or rebuilding the entire pipeline—affect AI performance across millions of user interactions. The new feature is now available to users on Raindrop’s Pro subscription plan ($350 per month) at Raindrop.ai.

A data-driven lens on agent development

Raindrop Co-Founder and CTO Bin Hallaq It’s noted in the product announcement video (above) that experiments help teams figure out “how to change literally anything,” including tool usage, user intent, release rates, and exploring differences by demographics like language. The goal is to make model iteration more transparent and measurable.

The Experiments interface displays results visually, showing when an experiment is performing better or worse than its baseline. Increases in negative signals may indicate greater task failure or partial code output, while improvements in positive signals can reflect more complete responses or better user experiences.

By making this data easier to interpret, Raindrop encourages AI teams to approach agent iteration with the same rigor as a modern software deployment – ​​tracking results, sharing insights, and addressing regressions before they become worse.

Background: From observability of artificial intelligence to experimentation

Raindrop’s trial launch builds on the company’s foundation as one of the first Native AI observability platformsdesigned to help organizations monitor and understand how their generative AI systems behave in production.

like VentureBeat reported earlier this yearthe company — originally known as Dawn AI — emerged to address what ails you, A former human interface designer at Apple, he calls the “black box problem” of AI performance, helping teams spot failures “as they happen” and explaining to organizations what went wrong and why."

At the time, Hylac described how “AI products constantly fail, in ways that are both funny and terrifying,” noting that unlike traditional software, which makes explicit exceptions, “AI products fail silently.” The original Raindrop platform focused on detecting these silent failures by analyzing signals such as user comments, task failures, rejections, and other conversational anomalies across millions of daily events.

Company co-founders — Hilak, Alexis Jubaand Zubin Singh Kotecha Raindrop was created after facing the difficulty of debugging AI systems in production.

“We started by building AI products, not infrastructure,” Hilak said. venturebeat. “But very quickly, we realized that to grow anything serious, we needed tools to understand AI behavior — and those tools didn’t exist.”

Using experiments, Raindrop extends the same task from Failure detection to Measure improvements. The new tool turns observability data into actionable comparisons, allowing organizations to test whether changes in models, claims, or pipelines make their AI agents better or just different.

Solve the problem of “Ratings pass, agents fail”.

Although traditional evaluation frameworks are useful for measuring performance, they rarely capture the unpredictable behavior of AI agents operating in dynamic environments.

As co-founder of Raindrop Alexis Juba He explained it LinkedIn Ad“Traditional evaluations don’t really answer this question. They’re great unit tests, but you can’t predict your user’s actions and your agent works for hours, calling hundreds of tools.”

Juba said the company constantly heard a common frustration among the teams: “Assessments work, agents fail.”

The experiments aim to fill this gap through supply What actually changes When developers ship updates to their systems.

The tool allows side-by-side comparisons between models, tools, intents or properties, demonstrating measurable differences in behavior and performance.

Designed for real-world AI behavior

In the announcement video, Raindrop described experiments as a way to “compare anything and measure how your agent’s behavior actually changes in production across millions of real interactions.”

The platform helps users detect issues such as task failure spikes, forgetfulness, or new tools that lead to unexpected errors.

It can also be used in reverse – starting with a known problem, such as “agent stuck in a loop,” and tracking the pattern, tool, or tag that leads to it.

From there, developers can drill down into detailed implications to find the root cause and quickly deliver a fix.

Each experiment provides a visual analysis of metrics such as tool usage frequency, error rates, conversation duration, and response length.

Users can click on any comparison to access the underlying event data, giving them a clear view of how agent behavior has changed over time. Shared links make it easy to collaborate with teammates or report results.

Integration, scalability and accuracy

According to Hylak, Experiments integrates directly with “premium platforms that companies know and love (like Statsig!)” and is designed to work seamlessly with existing measurement and analytics pipelines.

For companies that don’t have these integrations, they can still compare performance over time — such as yesterday versus today — without additional setup.

Teams typically need about 2,000 users per day to produce statistically significant results, Hilak said.

To ensure the accuracy of comparisons, Experiments monitors the adequacy of sample size and alerts users if a test lacks enough data to draw valid conclusions.

“We’re obsessed with making sure that metrics like mission failure and user frustration are metrics that you can alert an on-call engineer to,” Hilak explained. Teams can dig deeper into the specific conversations or events driving those metrics, ensuring transparency behind each overall number, he added.

Security and data protection

Raindrop operates as a cloud-hosted platform but also offers personally identifiable information (PII) redaction for organizations that need additional control.

Hylak said the company is SOC 2 compliant and has launched PII keeper A feature that uses artificial intelligence to automatically remove sensitive information from stored data. “We take customer data protection very seriously,” he stressed.

Pricing and plans

Part of Raindrop’s experiences Professional planwhich costs $350 per month or $0.0007 per interaction. The Professional tier also includes deep search tools, topic clustering, custom issue tracking, and semantic search capabilities.

raindrops Starting plan – $65 per month or $0.001 per interaction – Provides basic analytics including issue detection, user feedback flags, Slack alerts, and user tracking. Both plans come with a 14-day free trial.

Larger organizations may choose Enterprise plan With personalized pricing and advanced features like single sign-on (SSO), custom alerts, integrations, PII redaction, and priority support.

Continuous improvement of artificial intelligence systems

Through experiments, Raindrop positions itself at the intersection of AI analytics and software observability. Its focus on “measuring truth,” as stated in the product video, reflects a broader push within the industry toward accountability and transparency in AI operations.

Instead of relying solely on offline benchmarks, Raindrop’s approach focuses on real user data and understanding context. The company hopes this will allow AI developers to move faster, identify root causes faster, and deliver better-performing models with confidence.



[og_img]

Source link

Leave a Comment