Break the data bottleneck: Salesforce ProVision accelerates multimodal AI training

Photo of author

By [email protected]


Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more


As companies around the world double down on their AI projects, the availability of high-quality training data has become a major bottleneck. while The public Internet has been largely exhausted As a data source, major players like OpenAI and Google are securing it Exclusive partnerships To expand their data sets, further restricting others’ access.

To address this growing concern, Sales force A big step has been made in the field of visual training data. The company has just introduced ProVision, a new framework that programmatically generates visual instruction data. These datasets are systematically collected to enable the training of high-performance multimodal language models (MLMs) that can answer questions about images.

The company has already released the ProVision-10M dataset with this approach and is using it to enhance the performance and accuracy of various multimedia AI models.

For data professionals, this framework represents a major advance. By programmatically generating high-quality visual instruction data, ProVision alleviates reliance on limited or inconsistent datasets, a common challenge in training multimedia systems.

Furthermore, the ability to systematically compile datasets ensures better control, scalability, and consistency, enabling faster iteration cycles and reducing the cost of domain-specific data acquisition. This work complements ongoing research in the field of synthetic data generation and comes just one day later Nvidia launches Cosmosa set of universal base models specifically designed to generate physics-based videos from a set of inputs, such as text, images, and video, for physical AI training.

Visual instruction data: a key component of multimodal artificial intelligence

Today, instruction datasets are the core of pre-training or fine-tuning AI. These specialized data sets help models effectively follow and respond to specific instructions or queries. In the case of multimodal AI, models have the ability to analyze content such as images after learning from a range of different data points, accompanied by pairs of questions and answers – or visual instruction data – that describe them.

Now, here’s the thing: producing visual instruction datasets is very difficult. If an organization manually generates data for each training image, it will end up wasting a lot of time and human resources to complete the project. On the other hand, if she chooses to use special linguistic models for this task, she has to deal with high computational costs and the risk of hallucinations, as the quality and accuracy of question-answer pairs may not be good enough.

Moreover, the use of ad hoc models is also a black box mechanism because it makes it difficult to accurately explain the data generation process and control or customize the outputs.

Enter Salesforce ProVision

To address these gaps, Salesforce’s AI research team came up with ProVision, a framework that uses scene diagrams combined with human-written software to systematically collect vision-focused instruction data.

In essence, a scene graph can be described as a structured representation of image semantics, where objects in the content are represented as nodes. The attributes of each object—such as color or size—are mapped directly to its nodes, while the relationships between these objects are depicted as directed edges connecting corresponding nodes. These representations can be obtained from manually annotated datasets such as Visual Genome, or can be generated with the help of a scene graph generation pipeline that combines several state-of-the-art vision models covering different aspects of image semantics, ranging from object and feature detection to depth estimation.

Once the scene diagrams are ready, they run programs written using Python and script templates that act as complete data generators capable of generating question-answer pairs for AI training pipelines.

“Each data (data) generator uses hundreds of pre-defined templates, which systematically combine these annotations to produce various instruction data.” These generators are designed to… compare, retrieve, and interpret basic visual concepts of objects, attributes, and relationships based on the detailed information encoded in each Scene graph,” the researchers behind the framework wrote in an article paper.

Create help data using Salesforce ProVision

ProVision-10M dataset for AI training

In its work, Salesforce used both approaches — augmenting manually annotated scene graphs and creating scene graphs from scratch — to prepare scene graphs that powered 24 single-image data generators and 14 multi-image generators.

“Using these data generators, we can automatically compile questions and answers given the image scene histogram. For example, given an image of a busy street, ProVision can generate questions such as, ‘What is the relationship between the pedestrian and the car?’ or ‘Which one is closer to the building?’ Red, car or pedestrian?” Lead researchers Jiu Zhang and Lu Xue noted in a Blog post.

Data generators with the first approach, augmenting scene graphs in Visual Genome with depth annotation and segmentation from Depth Anything V2 and SAM-2, helped them generate 1.5 million single-image instruction data points and 4.2 million multi-image instruction data points. Meanwhile, the other, using 120,000 high-resolution images from the DataComp dataset and models such as Yolo-World, Coca, Lava-1.5, and Osprey, generated 2.3 million single-image instruction data points and 4.2 million multi-image instruction data points.

Overall, the four sections together make up ProVision-10M, a dataset containing more than 10 million unique instruction data points. It is now available on Face hugging It has already proven its effectiveness in AI training paths.

Specifically, when the company integrated the ProVision-10M into its multi-modal AI fine-tuning recipes — LLaVA-1.5 for single-image instruction data and Mantis-SigLIP-8B for multi-image instruction data — it saw notable improvements, with average modeling performance Better than fine tuning without ProVision data.

“When adopted in the instruction fine-tuning phase, our single-image instruction data leads to an improvement of up to 7% in 2D splitting and 8% in 3D splitting for CVBench, along with a 3% increase in performance over QBench2, RealWorldQA, and MMMU,” the researchers noted. The paper notes that our multi-image instruction data leads to an 8% improvement in Mantis-Eval.

Fintuning with ProVision dataset
Fine-tuning using the ProVision dataset

Synthetic data is here to stay

While there are several tools and Platformsincluding Nvidia’s new Cosmos World Foundation models, to generate different types of data (from images to videos) that can be used to train multimodal AI. Only a few have looked at the problem of creating instruction datasets that pair with that data. .

Salesforce addresses this bottleneck with ProVision, giving organizations a way to go beyond manual labeling or black-box language models. The programmatic approach to generating code data ensures that the generation process and metrics can be interpreted and controlled efficiently while maintaining real-world accuracy.

In the long term, the company hopes that researchers can build on this work to enhance scene graph production lines and create more data generators covering new types of instruction data, such as those for videos.



https://venturebeat.com/wp-content/uploads/2023/10/cfr0z3n_vintage_rotoscope_style_of_a_man_controlling_a_giant_da_1e196e6a-cdc6-49b7-9ef8-aaa935e36e80.png?w=1024?w=1200&strip=all
Source link

Leave a Comment