Word surprisal for automated linking and glossaries
Published on under the Coding category.Toggle Memex mode
Indexes has been on my mind of late. Yousef's introducing me of "word surprisal" statistical measure (entropy) formed the basis of what later became my personal website index. "Word surprisal" refers to how "surprising" a word is in the context of a larger body of text. The metric is probabilistic. Generally, the more common words are, the less surprising they are in a text. Surprisal works well when you have a broad, large corpus of text with which you can compare words. In my experiments, I found datasets from the New York Times to be a great starting point, although I am curious about comparing word surprisals across different corpora.
The program that generates my website index uses surprisal, alongside Named Entity Recognition, to find candidates to feature in the index. I take a limited number of surprising words per article, and use that information to generate my index. This approach has worked well. I discuss it more in my website index announcement post. The source code for my index is open source if you are interested in reviewing my implementation in detail.
I was thinking earlier today that word surprisal could also be used for automatically linking to either entries in the index. For example, if a word is above a certain "surprisal" threshold, one could add a link to the corresponding index entry. Or, if a word appears in the index at all, one could link to the index entry. This creates a bi-directional link between posts and the index. You can find the index through a post. The index will link to related articles. Then, you can click on each article to learn more. An interesting way to explore content!
There should likely be a limit on the number of words that are linked in a post. For instance, if there are 10 surprising words in a 200 word post, perhaps only the most surprising should be selected (i.e. with a hard limit of two linked words or terms per 100 words).
I can see this feature powering a glossary, too. You could use word surprisal to find words that may benefit from a definition or a link to one, such as may be the case when using technical jargon (based on the assumption that the corpus with which you calculate surprisals is not biased toward the jargon on your site). One could then manually add definitions. You can use linguist.link, a web app I maintain for calculating various Natural Language Processing statistics about a web page, to find surprising words (Note: linguist.link only words on English texts).
Furthermore, one could automatically link to definitions, if one has a dictionary that maps words to definitions. In the case of technical jargon, you could use word surprisal to find jargon terms, then build up a personal mapping of those words to relevant terms. This serves as an interesting way to add links to content where such links may not exist but would be beneficial to readers for context. More thought would need to be put into implementation. I am only scratching the surface here!
Word surprisals continue to be of great interest. I wonder what other applications there are! If there are any of which you can think, please send me an email. I would love to chat with more people about this topic.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.