Adventures with compression
Published on under the Compression category. Toggle Memex mode
Earlier this year, I learned about the Hutter Prize, a competition in which you are asked to compress a specific file -- 1 GB of Wikipedia data -- to a smaller size than any participants before you. At the time of writing this post, the challenge is to compress the 1 GB file into less than 114 MB. This challenge intrigued me. How can I make a specific file smaller?
Over the holidays, I have been thinking a lot about compression. How can I make the 1 GB file smaller? This led me down some interesting rabbit holes. In this post, I am going to talk about a few of the approaches I tried to make the file smaller. I am likely to get some technical details wrong. My impetus for this post is to help clarify my thoughts and to share knowledge that may help others who are interested in compression. I have not broken any new ground, but I have had fun.
Surprisals: A surprise
I am intrigued by word surprisals. Surprisal is a metric that measures how "surprising" a given word is in a document. This is in the same vein as measuring how common a word is, except surprisal values measure entropy -- the amount of information in something -- whereas word counts measure quantity.
I started down the rabbit hole with an idea: is there a way I could encode the surprisals for a document in such a way that the document is all surprisals and I have a dictionary that maps each surprisal back to the word? This was not directly related to the task at hand -- compressing Wikipedia -- but did put me on a useful path where I discovered dictionary encoding, an approach to compression.
Before I continue, I need to explicitly what I mean by "word". For this post, I use "word" to refer to a sequence of characters. That sequence could involve letters, numbers, and symbols (i.e. brackets).
To calculate words, I split a document by the space character. This allowed me to use my surprisals logic for any sequence of arbitrary text. This was important because my experimental dataset -- my blog -- contained a lot of symbols from code and formatting. I wanted to have numbers for every sequence of text so that I could compress and decompress text exactly. When the output of decompression is the same as the original text, you have lossless compression. This means that you lost no information during compression.
I was not going to get far "compressing" a document by turning it into floating point numbers, which is how my surprisal values were represented. I revisited the topic of quantization, used to compress weights for neural networks. You can use quantization to "compress" floating points into different ranges. You can even turn floating points into integers. But, with every number changed, there is a loss of information. I suspected that there would be a collision somewhere down the line given the vast vocabulary with which I was dealing, so I spent time working on ensuring numbers were unique.
My hypothesis for turning the words into integers (numbers without decimal points) was that numbers were likely to compress better. There are fewer single numbers -- 1 to 9 -- than there are characters. Common numbers would be compressed well by a Huffman tree, I assumed. My knowledge of Huffman trees is limited, but I think that is true. My knowledge of computer science theory is burgeoning. I had the idea that integers take up less space than strings, so turning words into numbers was the direction in which I wanted to go.
I declared a dictionary that mapped words to quantized surprisals. The word "coffee" might be 6 and "person" might be 18. If a quantized surprisal value was already used, I would assign the associated word a new number that was out of the range of quantization. Quantization functions return numbers in limited ranges.
At this moment, I had learned two components that I would later use:
- I can substitute words with numbers, which will be my compressed document and;
- I need a dictionary that maps numbers back to their words.
My decompressor would:
- Read each number
- Find the word associated with each number in a dictionary, and;
- Substitute the number with the word.
This could run quickly, taking only a few seconds on a document with tens of thousands of words.
Eventually, I had a script that could compress and decompress text. The output was the same as the input. I was thrilled. What's more, is that my files were smaller than they were at the beginning. The file size was measured by combining both the dictionary needed to map numbers to letters as well as my compressed document. I compressed a document, yay!
There was just one problem: surprisals, I discovered, were unnecessary.
Dictionary encoding: Why not just use numbers?
My love of word entropy took me far. As I tinkered, I realised that it might be more efficient to assign smaller numbers (i.e. 1, 2, 3) to the most common words. Quantizing surprisals did not do this. If a word like "coffee" (which is in the top 10 used words in the corpus of text I used for experimentation, my blog) got the number 1024, I had a problem. Space could be saved if coffee was 1.
I decided to change my approach. I took the 50 words with the lowest surprisals -- roughly equivalent to the most common words -- and assigned them smaller numbers. Then I asked: why don't I use this approach for the whole corpus of text? I was no longer interested in numbers in the document representing both a connection to the word and its surprisal. This was not the goal of the Hutter Prize, but was what got me started thinking about this problem.
When I removed my surprisal quantization code, I realised that my compressor performance was not harmed. I do not have the exact performance change to hand, but I had the thought that I was over-complicating the problem.
File structure optimizations
With a sequence of numbers, I started to think about how I could compress them further. I researched integer compression. I did not get very far. Every number contained essential information. Without starting to write my own Huffman tree, I was out of ideas in terms of directly manipulating the data. Thus, I moved on to a different part of the compression problem: figuring out how to store my data.
I initially saved my compressed data as a raw text file filled with numbers. Then I learned that I could use bytes. I learned to use the Python struct module and the bytearray method to save my numbers as bytes. My compressor's performance improved. I did not compress the dictionary that maps numbers to words this way, though. I just realised that my dictionary saving code has encoded the integers that map to words rather than the words themselves. James facepalms, as this means his dictionary file should be bigger than it currently is in terms of file size.
Two-stage compression
Throughout all of this, I was using gzip then brotli to compress my compressed text. Thus, my overall compression workflow was comprised of two stages:
- Run my compressor to create a file with bytes that could be unpacked to get numbers. These numbers could be decoded using the dictionary stored in a separate file.
- Run
brotlito compress the file even more.
Thinking of other avenues
I have had so much fun working on this project. I have spent many hours thinking about compression, going down rabbit holes. For example, I thought about:
- What if I think of numbers in the document as strings? Can I use an algorithm like a Burrows-Wheeler transform to compress the data more?
- Using a GPT model to predict the missing word in a sentence.
- Removing the last character from each word and using a trie with surprisals to efficiently predict the missing letter. (The trie will take up a not insignificant space for large documents, I suspect.)
- Encoding the text as an image and compressing the image. I would need to use a lossless image compressor, and using RGB would increase the number of values associated with each word. Perhaps if I changed the image to greyscale? Or perhaps that is not worth exploring.
- Using a discrete fourier transformation. I know almost nothing about DFTs, but I thought it was worth a try. I did some reading and then tried to apply a DFT algorithm to my data. On the full corpus of my blog, the transformation was taking so long I stopped it and moved on to a different idea.
I need to think outside the box. The first three ideas are in the realm of exploration, the last two were more wild explorations. Indeed, part of solving problems like this is exploring different options and refining your focus to ones that make sense. I am going to take a break from this project for a bit, but removing and predicting characters is something I want to explore more (including independently from compression).
Learning
This project has pushed my knowledge of computer science and compression in so many ways. I have pursued this project with passion, excitement, and intrigue. The problem is well defined: compress 1 GB of Wikipedia. How you do that is difficult. Through trying to compress the file -- and my blog, which is substantially smaller than the Wikipedia file and thus easier to work with in testing -- I have learned about dictionary coding, byte streams, Burrows-Wheeler transformations, and more. I have followed many paths that did not work, but that is okay. I know what didn't work. That information helps direct me to new areas of exploration.
I feel great that I implemented a dictionary coding technique without explicitly knowing what dictionary coding was at the time. Independently arriving at similar solutions to a defined problem is fascinating and inspiring.
If you have any advice on new directions I should explore, let me know. This topic is entirely new to me. To get this far, I am applying my knowledge of computational linguistics to help guide me, using the internet as a reference.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
