From hallucinations to devices: Lessons from the real computer vision project went alongside

Photo of author

By [email protected]


Join the event that the leaders of the institutions have been trusted for nearly two decades. VB Transform combines people who build AI’s strategy for real institutions. Learn more


Computer vision projects are rarely running as planned, and this exception was not. The idea was simple: Create a model that can look at a picture of a laptop and determine any physical damage – things such as cracking screens, lost keys or broken hinges. It looked like a case of use directly for photo models Language modelS (LLMS), but it quickly turned into something more complex.

Along the way, we faced problems with hallucinations, outputs and unreliable images that were not even laptops. To solve it, we have ended up applying a framework to the agents in an athuma – not to automate tasks, but to improve the performance of the model.

In this post, we will go through what we tried, unless it succeeded and how we helped us a set of methods in the end of building something reliable.

Where we started: a homogeneous claim

Our initial approach was somewhat standard for a multimedia model. We used one large router to pass a picture to LLM is capable of the image He was asked to determine visual damage. This homogeneous strategy is easy to implement and works properly for clean and well -defined tasks. But data in the real world rarely play along.

We faced three main issues early:

  • HallucinogenicThe model may sometimes invent damage that did not exist or nominate what he saw.
  • Unwanted image detectionIt did not have a reliable way for the photo brand that was not even laptops, such as offices, walls, or people sometimes receiving irrational damage reports.
  • Inconsistent: A mixture of these problems made the model unreliable for operational use.

This was the point that became clear that we will need repetition.

First repair: mixing photo decisions

One thing we have noticed is the amount of image quality that affected the outcome of the model. Users have downloaded all types of images ranging from sharp and high accuracy to fog. This prompted us to refer to research Highlights how the image accuracy affects deep learning models.

We trained and tested the model using a mixture of high -resolution images. The idea was to make the model more flexible in a wide range of image characteristics that it would face in practice. This helped improve consistency, but the basic issues of hallucinations and dealing with unwanted images continued.

The Multimodal Detour: Text-only Llm goes multimedia

Encouraging modern experiences in the plural BatchWhere the illustrations are created from the images and then interpreted through a language model, we decided to try them.

Here is how to work:

  • LLM begins with generating multiple possible comments for a picture.
  • Another model, called a multimedia inclusion model, checks the suitability of each comment on the image. In this case, we used Siglip to record the similarity between the image and the text.
  • The system maintains the highest number of illustrations based on these degrees.
  • LLM uses these upper illustrations to write that new, trying to approach what the image already shows.
  • This process is repeated until the explanatory designations stop improvement, or strike a specific limit.

While smart theory, this approach presented new problems with our use state:

  • Continuous hallucinations: Sometimes illustrations include fake damage, which has been confidently informed of LLM.
  • Incomplete coverage: Even with multiple comments, some problems were fully missed.
  • Increased complexity, minimal benefitThe added steps made the system more complicated without reliably excelled over the previous preparation.

It was an interesting experience, but in the end it is not a solution.

Creative use of frameworks work

This was the turning point. While the frameworks are usually used to coordinate task flows (I think the evaluation coordination agents or customer service procedures), we asked whether it was destroying the task of interpreting the image into a smaller, Specialized agents It may help.

We have built an organized framework like this:

  • Orchestrator agent: Examine the image and identify the components of the laptop (screen, keyboard, structure, ports).
  • Carcine agentsThe factors designated inspected each component of specific types of damage; For example, one for cracking screens, and the other for lost switches.
  • Unwanted discoveryA separate agent has been marked if the image is a laptop in the first place.

This task -based standard approach has produced more accurate and interpretative results. The hallucinations have decreased significantly, a sign of unwanted images was made reliably and the task of each agent was simple and focused enough to control quality well.

Blind spots: Al -Wakeel’s approach

Badly of effectiveness, it was not perfect. Main restrictions appeared:

  • Cumin increase: Running multiple serial factors that have been added to the total inference time.
  • Cover gapsAgents only can discover the problems that have been explicitly programmed to search for. If a picture shows something unexpected that no worker has been assigned to definition, it would be without anyone noticing it.

We needed a way to balance accuracy with coverage.

Mixed solution: combining agents and translated agents

For gaps, we created a hybrid system:

  1. the Working framework First, he ran, dealing with careful detection of known types of damage and unwanted images. We have restricted the number of agents to the most important factors to improve cumin.
  2. Then, a Homoing photo router llm Wipe the image of anything else that the agents have missed.
  3. Finally, we Set the form Using a set of photos of high priority images, such as frequently reported damage scenarios, to increase accuracy and reliability.

This mixture has given us the accuracy and ability of agents preparation, the extensive coverage of the homogeneous application and the increased confidence in the targeted precise installation.

What we learned

Some things have become clear by the time when we concluded this project:

  • The enemy frameworks are more diverse than you are credited with: Although it is usually associated with the workflow management, we have found that it can enhance the performance of the model useful when applying it in an organized standard.
  • Mix a different approach that beats depending on only one: A mixture of careful detection based on the agent in addition to the extensive coverage of LLMS, in addition to a little control as it was more important, gave us much more reliable results than any one way alone.
  • Visual models are vulnerable to hallucinationsEven the most advanced settings can jump into conclusions or see things that are not present. It is necessary to design a thoughtful system to keep these errors under examination.
  • Variety of image quality makes a differenceTraining and testing with both clear and high -precision images and low -quality daily sounds to stay flexible when facing unexpected images in the real world.
  • You need a way to take unwanted pictures: It was an unwanted or unwanted images of images that we had made, and had a major impact on the reliability of the system in general.

Final ideas

What started as a simple idea, using a LLM claim to detect physical damage to laptop images, quickly turned into a much deeper experience in combining various artificial intelligence techniques to address unexpected and realistic problems. Along the way, we realized that some of the most useful tools were not originally designed for this type of work.

The numerous frameworks, which are often seen as workflow tools, have proven amazingly effective when re -used for tasks such as detecting organized damage and filtering images. With a little creativity, they helped us build a system that was not only more accurate, but easier to understand and manage in practice.

Shruti Tiwari is the AI ​​product manager at Dell Technologies.

Vadiraj Kulkarni is a data world in Dell Technologies.



https://venturebeat.com/wp-content/uploads/2025/06/Computer.webp?w=1024?w=1200&strip=all
Source link

Leave a Comment