It is now possible to generate everything from emails to search papers – in the world, everything from emails can now be created to research papers – in English. But the transformation into a different language, and artificial intelligence begins to slip.
Calika Bali, the main researcher at Microsoft Research India at the Fortune Brainstorm Ai Singapore Conference on Wednesday, said most of the large language models “are somewhat similar to a researcher in Volibright interested in Asia as their field of study.” “They know a lot about (the topic), but they miss the culture. It is an external view of the country’s culture.”
Bali pointed to the question of classic mathematics – “John and Mary owns the main lime pie that they need to divide into five parts” – to show the problem of using artificial cultural intelligence.
The general artificial intelligence models will translate the claim directly. But as Bali pointed out, “In a country like India, most people do not know what the pie is, (not to mention) the main lime pie.”
To develop models that better understand local culture, more data is needed in local languages. But getting this data is not always simple.
Almost half of the web content in the English language, which means that there is no lack of high -quality digital resources for LLMS to learn the English language. Other languages that Don’t enjoy This abundance itself, developers must explore different ways to get training data.
Kasima Tharncipitchai, head of the artificial intelligence strategy at SCB 10X, highlighted the foundation work by the original speakers needed to build a training data set.
Tharnpipitchai led the SCB 10x project to launch Thai Llm Typhoon. To create a data collection in Thain, Thernpipitchai said that the indigenous speakers had to prolong through large -handed -hand data collections, and to identify high -quality Thai data sources and which was not.
“There are no tricks here, you really have to do work,” he said. “It is really just an effort. It’s almost brutal power.”
SCB 10x launched a hurricane a year and a half ago. Tynpipitchai said that Typhoon was able to outperform the GPT-3.5 in Thai, a fact “says more about how twice the performance of GPT-3.5 in Thailand” from their work.
However, non -English web data was stripped of raising legal concerns.
Khalil Nuh, the founder and CEO of Malaysian Startup Mesolitica, which is developing LLM Malay, said that the company has asked the owners of data to remove their sources from the training data set, which is available online because it is an open source model.
This has also limited a small group of high -quality data they have in the Malay. “The challenge we face is to work with the owners of private data collections,” Nuh said.
NOOH and Bali explores the generation of artificial data to help create more high -quality data in their target languages. Machines can translate abundant English language content into other languages to supplement limited data collections. This is especially useful for LLMS that is trying to work in regional dialects with almost no digital presence.
Nuh said: “How can we capture all 16 dialects in Malaysia through synthetic (data).”
But there are some obstacles that prevent access to data that cannot be overcome “brute force” or generating machines. In many societies, researchers must balance a full image with the management of cultural sensitivities when collecting data in local languages.
“In general, India is very positive,” Bali pointed out, “There are things that you do not ask” when collecting data on the ground. Communities may not want to share information on certain topics, even if they are widely known among people in the region.
NOOH added that in Malaysia, the three “race, religion and platform” – are all the topics of regional sensitivity.
Although there are no regulations about what “LLMS can say” in Malaysia, Nooh said that myolitica “has gone ahead to prepare the required components if they are needed.
To treat cultural sensitivities in Thailand, the SRNPIPitchai explained that SCB 10x issued a “security model” for the use of the public sector, as well as the regular hurricane model.
https://fortune.com/img-assets/wp-content/uploads/2025/07/54673567705_44ffce5cd1_o-e1753452492193.jpg?resize=1200,600
Source link