Adventures with compression: Part 3
Published on under the Compression category. Toggle Memex mode
I am continuing down the rabbit hole of compression. I aim to compress 1GB of Wikipedia -- the enwik9 dataset -- into as small a file as possible. I need to compress the file below 113 MB to beat the current record.
I have written two posts describing various methods I have attempted to compress text data. I have decided to turn "Adventures with Compression" into a series in which I share notes from my learnings at each stage of my journey. I will likely make mistakes along the way. If you notice a flaw in my thinking, let me know; compression is a relatively new subject to me. I am excited to see what I learn.
In this post, I am going to talk about my adventures in chunking a document to compress it further.
Chunking
So far, I have a compression algorithm that turns a file into numbers. The most common words get lower numbers. Then, the numbers are allocated sequentially. I can improve this by allocating numbers in order of word count. This is an optimization I might come back to later if I determine the word-level tokenizing I am doing right now is worth exploring. Part of me thinks it is now the time to start using byte-pair encoding or another tokenization method.
Here is an example of the numbers in a compressed file:
152 153 109 109 109 154 109 155 109 142 109 109 ... 1137 1138 45 1139 1140 224 1141 1142 1143 1144 1145 234
These numbers go into the thousands for enwik8, a 100 MB version of the enwik9 dataset. In my last post, I noted that I wanted to reduce the size of the numbers. I took a few mathematical approaches which did not work out.
Yesterday, I had an idea: what if I ensure numbers can be no higher than 999? This would allow me to remove the last digit (for a token in the region 1000-9999) and two digits (for a token in the region 10000-99999).
To pursue this goal, I decided to process the original file in chunks. I wrote an algorithm that split up the document into the minimum number of chunks so that there was no token over 999. For each chunk, I kept a separate dictionary and allocated tokens in the region of 1 - 999. All compressed text was concatenated and saved into a file with a " | " delimiter (although, in hindsight, an out-of-distribution number would be more appropriate, to keep the data types consistent).
Once I had the separate dictionaries, I transformed them to make one big dictionary. The dictionary took this form:
{ word: [token]}
Where the index value of each token was equal to the chunk it was in. Here is an example:
"<page>": [91, 92, 95, 93, 99, 96, 96, 98, 87]
In the first chunk, <page> has the value 91. In the second chunk, <page> has the value 92. And so on. If a word was not in an epoch, I filled the value with a 0. This would allow me to preserve the property that the index value associated with each number was the chunk for which the token number could be used.
With this approach and subsequent compression of my numbers using brotli, I was able to get the enwik8 dataset to 12MB. This was my best attempt at compressing the raw text yet. Because all numbers are below 999, there are more instances of every number. This, I think, can be better compressed by brotli.
However, there was a problem with this approach that I did not notice until I was getting toward the end of my implementation: the dictionary size to reconstitute the file was massive. So massive that I was running into out-of-memory errors processing the data. The main issue was the 0s. The file was ~7,000 chunks. That meant every word needed 7,000 values to be represented. I was using regular Python array operations. At this stage, I realised NumPy would be needed. I have not gotten further since.
For fun, I decided to save the file without the 0s and the dictionary was 20MB. This would mean that my compressed data, in full, was 30MB. With that said, my dictionary file cannot be used without the 0s, or else I would have to change the structure to:
"<page>": [[91, 92, 95, 93, 99, 96, 96, 98, 87], [1, 3, 4, 5, ...]]
Where the first list stores the token values and the second list stores the epochs with which each token is associated. This is adding more numbers that could all be 0s. I am curious about how to work with sparse data structures. In theory, I could write all my data to a file stream sequentially and delete each value as it is written. NumPy may have some efficient operations I can do, too.
To decompress the file, I would:
- Split up the file into its chunks using the chunk delimiter.
- For each number in the chunk, query every word in the dictionary.
- Take the n-th value in the dictionary, where n is the chunk being processed. If the value is equal to the current token, translate the text to the original word and save the word in a string. This string contains the decompressed text.
- Do this for every word in the chunk, then for every chunk.
As I write, I realise this is incredibly inefficient. If there are 100,000 tokens in the vocabulary, I would need to compare each token in the file for a maximum of 100,000 times. To counter this, I could build a reverse dictionary in memory that takes the form:
{ chunk: {token: word}}
The file could be decompressed as follows:
- Split up the file into its chunks using the chunk delimiter.
- Get the dictionary for the chunk.
- Get the value associated with the numeric token for each token in the chunk and save it to a string that stores the decompressed text.
- Do this for every word in the chunk, then for every chunk.
I am unsure how large this data structure would get, though. Efficiencies could be made. I am hesitant to go any further in this direction because the 20 MB dictionary feels almost like a non-starter. I would have to reduce my file even more (which could be done by decreasing the range of numbers that can be used in compression, I think). Or, I would have to reduce my dictionary. I don't know how to do this yet.
When I was on a walk earlier today, I thought about ways to restructure this dictionary. I could not think of an obvious way to compress the information in the dictionary. There are likely clever ways to compress sparse matrices, a data type that fits the description of my token values where most values are 0s. But, this doesn't deal with the fact that without the 0s the file is still large. I need to compress all the other data in the dictionary too.
Next steps
Through the chunking exercise, I rephrased the status quo. In my previous experiments, I was trying to compress the tokenized numbers. In this post, my problem was not the numbers. The problem was keeping the dictionary small. I am thus now thinking about ways to restructure the dictionary so that it will compress better. The ideas that have thus far come to mind -- delta encoding, run length encoding -- don't entirely add up. More thought is needed.
Is it worth continuing with the chunking approach? I'm not entirely sure, but I am curious to see the results. If it doesn't work out, my next approach will probably be to use a GPT tokenizer to tokenize data and work with the results.
I am excited to see what I learn next. If you have any ideas or suggestions, let me know by emailing me at readers [at] jamesg [dot] blog. A few readers have already emailed in some ideas, for which I am sincerely grateful. To all of those who emailed in, I get back to you in the next few days.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
