Llms As Data Compressors

We’ve all experienced the frustration of large file sizes, whether it’s trying to share a high-resolution video or store our ever-growing photo libraries. Traditional data compression techniques have come a long way, but certain data types like audio and video remain challenging to squeeze down without significant loss of quality. This got me thinking: what if we approached compression from a completely different angle, leveraging the power of large language models (LLMs)?

BEAVOR My intuition tells me that memorization and compression are two sides of the same coin for these sophisticated models. Think about it: if an LLM can efficiently represent a piece of data using fewer internal “bits” (the essence of compression), it stands to reason that it has also “learned” or “memorized” the underlying patterns and structure of that data. Consequently, it should be able to reconstruct it with high accuracy. Conversely, if the model encounters data it hasn’t seen before, its attempt to encode and decode it might result in a less efficient representation and a potentially flawed reconstruction.

This idea resonates strongly with the findings of a fascinating research paper titled “How much do language models memorize?” by Morris et al. (2025). In this work, the researchers delve deep into quantifying how much information language models actually “know” about the data they are trained on. They formally separate unintended memorization (information about the specific training dataset) from generalization (knowledge about the broader data-generation process).

What’s particularly insightful is their approach to measuring memorization using concepts from Kolmogorov complexity and Shannon information theory. Essentially, they suggest that if a model can assign a high probability (and thus a shorter encoded length) to a piece of data, it has, in a way, “compressed” it. This “compression rate,” estimated through model likelihoods, becomes a proxy for how much the model has memorized.

The paper’s findings reveal some crucial points: LLMs have a measurable “capacity” for memorization, estimated around 3.6 bits per parameter for models in the GPT family. Models tend to memorize data until they reach this capacity, after which they start to generalize more effectively. The ability of a model to “compress” seen data accurately highlights its internal representation of that information.

This connection between memorization and the efficient representation of data opens up exciting possibilities beyond just understanding model behavior. Could we potentially train large models on vast datasets of audio, video, or other complex data types, not just for generative tasks, but also for highly efficient compression?

Imagine a future where an LLM, having “learned” the intricate patterns of music, could compress audio files to a fraction of their current size without significant loss. Or perhaps a model trained on medical imaging data could create highly compressed yet accurate representations for storage and analysis.

Furthermore, the paper’s exploration of unintended memorization hints at an even more intriguing (though potentially complex) application: secure data encoding. If a model’s “compressed” representation is deeply tied to its unique learned weights, could this serve as a form of implicit encryption, where only a model with similar “knowledge” could effectively reconstruct the original data?

Of course, this is speculative and comes with significant challenges. Training such models would require immense computational resources and massive datasets. Defining appropriate evaluation metrics for “compression” in these contexts would also be crucial. Moreover, the security implications of relying on a model’s memorization for encoding would need careful consideration.

However, the fundamental link between a language model’s ability to “understand” and “memorize” data, as highlighted in the paper by Morris et al., strongly suggests that these powerful tools could revolutionize how we approach data compression and potentially even data security in the future. It’s a fascinating area of research, and I’m eager to see how these initial findings pave the way for innovative applications we can only begin to imagine.