How much is the information that LLMS really memorizes? Now we know, thanks to Meta, Google, Nvidia and Cornell

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Most people interested in AI already know that large language models (LLMS) – such as those behind Chatgpt, Claude’s Claude, and Google’s Gemini – trained in huge data collections: trillion of words that are pulled from web sites, books, code doors, increasingly, and other modes such as images, Video and video. But why?

From these data, LLMS develops a statistical and generalized understanding of the language and its patterns and the world – encrypted in the form of billions of parameters, or “settings”, in a network of artificial neurons (which are mathematical functions that convert input data into output signals).

By exposure to all these training data, LLMS learns to discover and generalize the patterns that are reflected in their nerve cell parameters. For example, the word “Apple” often appears near terms related to food, fruit or trees, and sometimes computers. The model picks up that apples can be red, green, or yellow, or even sometimes other colors if it is corrupt or rare, an “apple” in English, and is eated. This statistical knowledge affects how the form responds when the user enters a mentor – the formation of the output that creates it based on the links that you “learned” from training data.

But a big question remains – even among researchers of artificial intelligence -: How much LLM training data is used for construction Generalized Concepts representations, and how much it is save Literally or in an identical or almost identical way to the original data?

This is important not only to better understand how LLMS works – and when they make mistakes – but also as models, they defend themselves in claiming copyrights that are presented by data creators and their owners, such as artists and signs of recordings. If it turns out that LLMS to reproduce large parts of their training data literally, the courts may be more likely to support with the prosecutors who discuss that the models that have been illegally copied of protected materials. If it is not – if the models are found to create outputs based on generalized patterns instead of accurate symmetrical copies – the developers may be able to continue the scraping and training of copyright data in light of the current legal defenses such as fair use.

Now, we finally have an answer to the LLMS protection issue for generalization: A new study issued this week From the researchers in Meta, Google DeepMind, Cornell University and Nvidia find it Models similar to GPT have a fixed save capacity of about 3.6 bits for each teacher.

To understand what 3.6 bits mean in practice:

  • One part is the smallest unit of digital data, represents either 0 or 1.
  • Storage 3.6 bits allows about 12.13 distinctive values, as calculated by 2^3.6.
  • It comes to the amount of information needed to choose one of 12 options-like choosing a month of the year or as a result of a roll of 12 aspects.
  • He – she It is not enough to store even one English (which needs about 4.7 bits), But just enough to cure a letter of a reduced group of 10 common English (which requires about 3.32 bits).
  • In the panels, 3.6 bits is 0.45 byte – less than half of the size of a typical letter stored in ASCII (which is used 8 bits or 1 byte).

This number is independent of the model within the reasonable architectural differences: different depths, displaying, and accuracy that produced similar results. The estimate is kept steady through the sizes of the models and even the levels of accuracy, with full accuracy models reaching a little higher values ​​(up to 3.83 bits/parameters).

More training data does not lead to more memorization – in fact, it will be the form Less To save any one data point

One of the main meals of searching is that the models do not save more when they are trained in more data. Instead, the fixed capacity of the model is distributed through the data set, which means that each individual data point receives less attention.

Jack Morris, the main author, Explanation via the social network x “Training on more data will force the models to save less for each sample.”

These results may help reduce concerns about large models that keep copyright or sensitive content.

If the preservation is limited and reduced through many examples, the possibility of cloning any specific specific training decreases. In essence, more training data leads to a safer generalization behavior, not increased risk.

How the researchers identified these results

To accurately determine the amount of language models, the researchers used an unconventional but strong approach: They trained transformer models on the uniform bitstings data sets. Samples were taken from each of these bits independently, ensuring that there are no patterns, structure or repetition through examples.

Because every unique sample is free from common features, that is, the ability to show the model in Reflecting the reconstruction or identification of these chains during the evaluation directly the amount of information they kept – or savedTraining.

The main reason for this setting was completely eliminating the possibility of generalization. Unlike the natural language – which is full of grammatical structure, semantic interference, and concepts repeated – random data does not contain any such information. Each example is mainly noise, with no statistical relationship with any other relationship. In such a scenario, any performance must come according to the form on testing data from memorizing training examples, due to the lack of a distribution pattern that must be circulated from.

The authors argue that their way is perhaps One of the only preliminary ways to separate memorization from learning In practice, because when LLMS is trained in a real language, even when a result is produced that matches training data, it is difficult to know if you have saved inputs or only deduced the basic structure of the patterns they noticed.

This method allows researchers to set a direct relationship between the number of model parameters and the total stored information. By increasing the size of the model and training each variable to saturation, through hundreds of experiments on models ranging from 500 kilos to 1.5 billion teachers, note consistent results: 3.6 bit reserved to each teacherWhich they reported as a basic measure of LLM memory capacity.

The team applied its methodology to the models trained in the real world data collections as well. When training in the text, the models showed a balance between memorization and generalization.

Smaller data collections encouraged larger size, but with increasing the size of the data set, models have turned towards circulating patterns. This transition was characterized by a phenomenon known as “dual landing”, as the performance decreased temporarily before improvement as soon as the circular was kicks.

The study also studied how the typical accuracy – training in BFLOAT16 versus Float32 – leads to the possibility of preservation. Notice a modest increase from 3.51 to 3.83 bits for each teacher when turning to a full 32 -bit resolution. However, this gain is much lower than doubling the available bits that suggest it, which means decreasing the returns from the upper accuracy.

Unique data is likely to be saved

The paper proposes the limitation law that is associated with the capacity of the model and the size of the data set to effectively.

These attacks try to determine whether a certain data point is part of the model training set. Research shows that such attacks become unreliable with the growth of the data set, which supports the argument that widely helps to reduce the risk of privacy.

While the paper focuses on the behavior of the middle state, some researchers have indicated that certain types of data-such as very unique or logical writing-are still more likely to be preserved.

The authors admit this restriction and emphasize that their way is designed to characterize public trends instead of edge cases.

Moving towards a greater human understanding of LLM understanding

By providing a preliminary and measurable definition, the study gives developers and researchers new tools to assess the behavior of language models. This is not only with typical transparency, but also with compliance, privacy and moral standards in developing artificial intelligence. The results indicate that more data-not less-may be the safest path when language models are widely trained.

To put a total of the form of the form in the perspective:

  • The parameter model can save 500 km about 1.8 million bits, or 225 kilos of data.
  • The parameter model can contain 1.5 billion on about 5.4 billion bits, or 675MB of initial information.
  • It cannot be compared to storing model files such as images (for example, a 3.6MB image is about 30 million bits), but it is important when distributed through separate text patterns.

I am not a lawyer or legal expert, but I expect such research to be mentioned in many continuous lawsuits between artificial intelligence providers and data creators/rights holders.



https://venturebeat.com/wp-content/uploads/2025/06/cfr0z3n_stark_crisp_neat_pop_art_colorful_flat_illustration_pol_390d4d22-fe0b-4e8f-8582-f917f6e29d3b.png?w=1024?w=1200&strip=all
Source link

Leave a Comment