More adventures with compression
Published on under the Compression category. Toggle Memex mode
I have been working on the Hutter Prize compression challenge, in which participants are tasked to compress 1GB of Wikipedia into as small a file as possible. The file is marked up as XML data. The current record for compressing the 1GB file is ~113 MB.
To strive toward this goal, I have made my own, smaller dataset, comprised of all of the posts on my blog. This dataset contains words, code, characters, and YAML markup. If my approach yields interesting results, I compress the "enwik8" dataset, which contains 100 MB of Wikipedia. This gives me a rough idea of the results I can expect if I run compression on the full dataset.
In my last post on compression, I documented my journey with word surprisals, learning about dictionary encoding, and some of the (perhaps strange) directions I explored in pursuit of compressing the file. Since then, I have explored many new avenues, which I will document in this post.
I am publishing all my code in my adventures-with-compression GitHub repository. This repository contains the programming equivalent of sketches, though: code is (mostly) functional but unpolished.
Square root compression
I have developed a dictionary encoder that converts tokens -- any sequence of characters between spaces -- into numbers. My hypothesis behind doing so is that the numbers will compress better than the text.
Thus far, I have found good results with this approach, after compressing the numbers with another compressor like brotli. I do this second compression so I can see how small I can get the file. Later, I can expand my compression algorithm appropriately. My best attempt with dictionary coding resulted in compressing Wikipedia to below 25 MB. But, that is still several MB off the goal.
Here is an example of the compressed data:
290 291 292 202 265 266 288 293 131 252 112
Each number represents a token and can be reversed losslessly into the original data.
I have been thinking: what can I do with these numbers? How can I make a big number smaller? Rather than thinking about the file as text, what if I think of it as numbers?
I thought about compessing the data using square roots. I decided to bundle up the numbers into chunks of 20 numbers. Then, I calculate their square root, and try to reverse the process. There was a problem, however: this process is lossy. The square roots are often large, irrational numbers like this:
3.195357634211014e+29
When I tried to write a reversal script, I could get the original data back intact.
I was excited about the square root approach, but I continued on the theme of numbers to arrive at a new idea: using the greatest common divisor (also known as the highest common factor) of a number to compress the file.
Highest common factor
The highest common factor of two numbers is the largest number that divides into both numbers. I wondered: could I use this to make numbers smaller? I followed this process:
- Turn the file into a series of chunks, each N numbers in size.
- Concatenate the numbers in the chunks into a big number, using string operations.
- Calculate the highest common factor between the chunk and 10**CHUNK_SIZE. At the end, CHUNK_SIZE was 30,000.
- Divide each number by its highest common factor
- Save the number and its highest common factor into a file.
The division operation allowed me to shave off some of the numbers in the text file. To reconstitute the concatenated strings, I needed a word length list. This stores how long each number is in each chunk. While the file with the highest common factors was smaller than the uncompressed file, the reason was mostly not down to the factorising operation. Rather, it was likely because I had removed many spaces from the file. There would also be an impact from the numbers being different; perhaps they had lower entropy. But, I also had to store the word lengths, which meant all of my performance gains from the concatenatation and factorization were lost.
There are likely more intelligent ways to calculate the "10**CHUNK_SIZE" value that I use in the highest common factor operation. With that said, such results would not offset the negatives of the requisite word list for this approach.
My results for compressing my blog were:
- The tokenized numbers: 2.1MB
- Using my tokenizer then brotli: 642KB
- Using my tokenizer, concatenating, then using the highest common factor operation: 631KB
- Using my tokenizer, concatenating, then using highest common factor, including the word list length: 766KB
These numbers do not include the dictionary required to map the tokenized numbers back to words. The dictionary is 624KB in size. I plan to think about how to compress that later.
I tried lowest common multiple, and achieved:
- The tokenized numbers: 2.1MB
- Using my tokenizer then brotli: 642KB
- Using my tokenizer, concatenating, doing lowest common multiple, and including the word list length: 764 KB
Every length list introduces one new number that I have to store for every number. These numbers also do not include the 624KB dictionary.
Convert to a FLAC file
In my last post, I noted an interest in converting the numbers to an image, in an attempt to see how an image compression algorithm would work on my data. Perhaps the premise is fundamentally flawed for my data is not structured in the same way as an image. But, I wanted to experiment. Learning what doesn't work helps inform what I should do next.
Continuing with the theme of thinking of my file as a different modality of information, I thought: what if I convert my data into a WAV file, then convert it to FLAC, a lossless audio compression format?
I wrote a script that converted my sequence of numbers into a WAV file. I had to normalize my data to the range of -32767 to 32767. In my normalizing code, I converted the result to an integer. This means there was some loss. Thus, I was unable to reconstruct the original file. With that said, the results from compression were not impressive. The resulting FLAC file was 867KB. I compressed the file with brotli, resulting in a 859K file. I was trending in the wrong direction with this approach. My file was smaller than the original 2.1MB file, but text compression yielded better results.
Overfitting an LSTM
One idea in the back of my mind is that I could overfit a neural network on purpose so that it memorizes the tokenized numbers exactly. I experimented with this on a small sequence of numbers (n < 50). I manipulated the sequence to create training pairs. I used an LSTM architecture with around four layers. No dropout or regularization was applied. I trained for 2,000 epochs. The resulting network trained fast on my small sequence of numbers. But, when training on my entire blog, every epoch would have taken a significant amount of time. (I didn't take notes from the results, so I can only document here what I recall.)
With that said, my numbers were not allocated using a traditional algorithm like TF-IDF or byte-pair encoding. Thus, there are perhaps optimizations in this direction. I have read neural network compression is possible, but I don't have any experience building architectures from scratch. This is another exciting opportunity for learning!
I probably need to play with the LSTM architecture more, but any deep learning approach is going to come with drawbacks. I am curious about how a transformer model specifically trained on enwik8 (the 100 MB Wikipedia dataset) or enwik9 (the 1GB dataset) would perform. But, the size of the network may be prohibitive. The transformer network and its dictionary would have to be under 113 MB, which feels like a tall order.
Conclusion
I am having fun with this challenge. I have been walking in all sorts of directions. Since my last post, some people have emailed me with ideas to explore, too, for which I am incredibly grateful. If you have any feedback on what I noted above, please let me know!
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
