Information coding: Encoding sorted lists of information
Published on under the Compression category. Toggle Memex mode
In my adventures with compression, I was thinking about ways in which I could represent whether a word was in a chunk. A file would have many chunks. A mapping would store whether each chunk contained a word and, if so, what the code was for that word. The chunking approach I developed did not work, but I did come away with a valuable learning: you can use a sorted list to track preferences and numeric tokens to concisely represent what each value means.
This will be more intuitive with an example.
Consider a web application that lets you order songs in a Taylor Swift album. Let's say you have five songs on the web page:
- Midnight Rain
- Question...?
- Anti-Hero
- Maroon
- Bejeweled
A user can freely rearrange these.
Now suppose you want to create shareable links for users. How can you do this? You could have a database that stores each user's result but that involves overhead. You could also encode the user's sort order in a URL, like this:
/example/?result=maroon,bejewled,anti-hero,question,midnight-rain
The order of each item is:
- Maroon
- Bejeweled
- Anti-Hero
- Question...?
- Midnight Rain
An even more concise representation would be to store indices for the original order on the web page. Then, you can provide the indices in the query string, ordered:
/example?result=4,5,3,2,1
Each number represents the index of the song in the starting order. The position at which the number appears is the user's order.
| Original Order | User's Order | Index in Original List |
|---|---|---|
| Midnight Rain | Maroon | 4 |
| Question...? | Bejeweled | 5 |
| Anti-Hero | Anti-Hero | 3 |
| Maroon | Question...? | 2 |
| Bejeweled | Midnight Rain | 1 |
I like this pattern of:
- Having an ID for a value.
- Using a sorted list to connect the ID for a value with another piece of information (i.e. a user's preference of songs, to use the example above).
In my compression example, every word had an ID. But the ID was different depending on which "batch" the word was in. The document was split into at least five batches. So, I had a structure like this:
{"coffee": [0, 0, 501, 241, 0]}
The 0s indicate a batch without the word. The numbers indicate the numeric ID used for the word in each batch. The list is ordered, so 501 is the ID for "coffee" in the third batch; 241 is the ID for "coffee" in the fourth batch. This didn't turn out to be helpful for compression since I had to store a lot of 0s for my use case, but the general pattern is still interesting.
For the lyrics example above, being able to show a URL with four numbers instead of five (at least!) words, is substantially more efficient. There is a loss of information for the user and I do love descriptive URLs. With that said, for a shareable link for quiz state a short URL is better than a long one.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.
