Adventures with compression: Part 4
Published on under the Compression category. Toggle Memex mode
I am working toward compressing 1GB of Wikipedia to as small a file as I can. The current record for compressing the file, enwik8, is 113 MB. In pursuing this goal, I have learned a lot about compression and mathematics. For example, I have been asking myself questions like "can I compress data by calculating logarithms of numeric representations of the data and losslessly return the result?" I don't think this is possible in the general case, but I am learning.
The steps that I am following to compress the Wikipedia file are, generally:
- Tokenize the text data. This involves assigning a number to each word or sequence of characters. More common words or sequences should have lower numbers. Thus, "coffee coffee coffee" could be come "425 425 425".
- Post-process the tokenized data. I have been trying to figure out how I can compress the tokenized numbers further.
- Save the tokenized data, ideally in a binary format.
- Use an existing compression tool,
brotli, to compress the file further. I haven't been writing Huffman trees yet, an effective algorithm used to compress data. This step allows me to evaluate if I am heading in the right direction. If my tokenized and compressed file, when compressed again withbrotliis smaller than my previous attempt, I know I am heading in the right direction.
Yesterday I explored an avenue I have not thought about since the beginning of the project: tokenization.
Tokenizing a file: What I learned
My original approach to tokenizing a file was to split the file up into words and assign numbers to each word based on how common the word was. A common word like "a" would thus have a lower number than a word like "coffee". I thought to myself yesterday: what about the other methods of tokenizing data? I wanted to see if there was a more efficient way I could tokenize my data.
There is plenty of prior art. Tokenizers are used in language models. For example, when you use a GPT (i.e. GPT-3.5), the GPT tokenizer will turn your words into specific numbers. These numbers are tracked in a dictionary. A word is often many tokens. For example, "Alex's" might be "Alex" and "'s'", with "Alex" having the number 192 and "'s'" having the number 42. A language model tool then feeds the token numbers through a GPT model. The model returns numbers which can then be converted back to text.
I decided to try using the GPT-2 tokenizer to tokenize my text. Hugging Face, a company that builds natural language processing tools and operates a community of language models, maintains a library called transformers which implements the tokenizer. The transformers.OpenAIGPTTokenizerFast method lets you calculate GPT tokens.
I used the GPT-2 tokenizer to tokenize my blog. I then applied a median transformation, which calculates the median of all the numbers and replaces each number with how far it is away from the median. This transformation wasn't relevant to this test in particular, but the code was left over in the file in which I was experimenting.
I tried to compress all my blog posts instead of the Wikipedia file. My blog posts, together, are 2.9MB uncompressed. Here are the results when I used the GPT-2 tokenizer and compressed the results with brotli:
Compressed text: 0.69 MB
Compressed dictionary: 0.24 MB
When I tokenized by word using my algorithm and compressed the results with brotli, I got the following results:
Compressed text: 0.55 MB
Compressed dictionary: 0.52 MB
I was excited when I saw the GPT numbers, but then I saw that many tokens were marked as <unk> when I reversed the tokenization. <unk> means a token is not in the dictionary.
I expected the dictionary to be smaller with the GPT tokenizer because it uses byte-pair encoding, which is an efficient way of encoding text. With that said, I forgot one crucial point: the GPT tokenizer, out of the box, doesn't work on all data. You need a tokenizer that has seen the words in your vocabulary.
From this point, I learned a key lesson: check what tokens a pre-trained tokenizer supports. It looks like you can pass in your vocabulary to the GPT tokenizer in transformers. I should look into doing that. I may (should?) be able to do compression with the GPT tokenizer, but I need to feed it my data first.
In addition, Hugging Face has a library for this called tokenizers which lets you train a tokenizer on your data. I plan to experiment with this package to try byte-pair encoding. tokenizers substantially reduces the amount of code you need to write to train tokenizers.
Next steps
My main learning was that a pre-trained tokenizer needs to know the vocabulary with which I am working, otherwise information will be lost. From the beginning, I wrote my own tokenizers, so I was unsure what I needed to do to use an existing one. Suffice to say, I am intrigued by the smaller dictionary size from the byte-pair encoding. Dictionary size has been difficult with my word-based compression. There are over 1,000,000 unique tokens in the Wikipedia dictionary. I am excited to explore byte-pair encoding more.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
