A new project that makes Wikipedia data easier to artificial intelligence

Photo of author

By [email protected]


On Wednesday, Wikimedia Deutselland announced a new database that would make the wealth of knowledge in Wikipedia more accessible in artificial intelligence models.

The system, which is called a Wikidata project, is called a semantic search-based semantic research-a technology that helps computers to understand the meaning and relationships between words-to the data on Wikipedia and its sister platforms, which consists of about 120 million posts.

In conjunction with the new support for the form of the context of the context of the model (MCP), a standard that helps artificial intelligence systems communicate with data sources, the project makes the data easier for the LLMS natural language.

The project was implemented by the German branch in Wikimedia in cooperation with the JINA.AI and Datastax nervous research company, an actual time -owned training company.

Wikidata has offered readable data from the Wikimedia properties for years, but the tools in advance are only permitted for keyword search and SPARQL discretion, which is a specialized inquiry language. The new system will work better through the recovery generation systems (RAG) that allow artificial intelligence models to withdraw external information, giving developers an opportunity to establish their models in the knowledge that was verified by Wikipedia editors.

The data has also been organized to provide a decisive semantic context. Inquire about the database The word “world”, For example, the lists of prominent nuclear scientists as well as scientists who worked in Bell Labs will be produced. There are also translations of the word “world” in different languages, a picture of scientists in Wikimedia at work, and induction of related concepts such as “researcher” and “researcher”.

The database is It can be publicly accessed on Toolforge. It also hosts Online symposium for interested developers On October 9.

TECHRUNCH event

San Francisco
|
27-29 October, 2025

The new project comes at a time when artificial intelligence developers are scrambling for high -quality data sources that can be used to adjust the models. Training systems themselves have become more sophisticated – they are often assembled As complex training bodies Instead of simple data collections – it still requires coordinated data closely to work well. For the pamphlets that require high accuracy, the need for reliable data is particularly urgent, and although some may view Wikipedia, its data is directed towards much more facts than the Catchalle data collections such as Common crawlIt is a huge collection of web pages that have been scraped online.

In some cases, high -quality data pressure can have expensive consequences on artificial intelligence laboratories. In August, Antarubur offered a lawsuit with a group of authors whose work was used as training materials, by approving Pay $ 1.5 billion To end any claims of violations.

In a statement to the press, Philip Sadi, director of the Wikidata AI project, confirmed his project independence from the main AI laboratories or large technology companies. “The launch of this inclusion project shows that strong artificial intelligence should not be controlled by a handful of companies,” Saadi told reporters. “It can be open, cooperative and designed to serve everyone.”



https://techcrunch.com/wp-content/uploads/2025/09/printed-wikipedia.jpg?w=640

Source link

Leave a Comment