This week, I have been tinkering with Natural Language Processing (NLP) to build an index for my personal website. While I was creating this project, I explored various techniques to help me find words relevant enough to be featured in the index.
I came across one metric, word surprisal, that was of particular use in building the index. Surprisal refers to how "surprising" (common) a word is relative to a corpus of text. The metric works well when you compare a word to a large corpus of text. In my case, I used a dataset of New York Times articles. I thought to myself: wouldn't it be cool to be able to calculate surprisal on different articles via a web service? Then my mind jumped to the idea of a website that provides general NLP insights for a web page.
Thus, the idea behind linguist.link was born.
linguist.link is a web application that calculates language statistics and insights using the text on a web page. The statistics supported at the time of writing this post are:
- Most surprising words (a curious way to find interesting words to expand your vocabulary!);
- Average reading time (calculates as the number of words in a post divided by 200);
- Reading score (calculated using the Flesch-Kincaid readability score);
- Most common words on the page;
- Most common bigrams (sequences of two words), trigrams (three words), and quadgrams (four words), and;
- All words on a page with a gradient background depending on how surprising they are.
To use linguist.link, prepend the following text to any URL:
For example, here is the linguist.link breakdown for this web page.
Note: If an article has a paywall, linguist.link will not work.
How it Works
When linguist.link starts, the application calculates word surprisal on a large corpus of New York Times articles. When a user requests analytics for a URL, linguist.link retrieves the web page, cleans the page to retrieve relevant words for analysis (a process referred to as stopword removal in NLP), then calculates the aforementioned statistics.
To retrieve the text on a web page, I am using a Python wrapper around Mozilla's
readability.js project, a standalone version of the codebase used to power Mozilla's Reader Mode. This mode aims to extract article body text to offer a simpler representation of a web page ideal for reading. This project was a great boon (hooray for open source!). I did not have to write any logic that extracts the body of the page. The Python wrapper for readability.js, maintained by the Alan Turing Institute, meant I could use readability.js in a Python application, my chosen tech stack due to the robust NLP toolset available for the language.
linguist.link makes heavy use of the nltk library for finding n-grams (i.e. bigrams) and the most common words on a page. To calculate readability score, I make use of a Python library called
cmudict, which calculates the number of syllables in a given word. This metric is then used to implement the Flesch-Kincaid readability score.
One of my favourite metrics on linguist.link is a heatmap showing word surprisal plotted against every word on a web page. Each word is given a dark green background; the darker the background, the more surprising the word is.
This section is a work in progress. Some words don't have a background because I need to improve my language preprocessing pipeline. Line breaks are not preserved from the original article.
I have plans to add more statistics to linguist.link. At the top of my mind is listing named entities on a page (i.e. people, organizations). For this, I plan to experiment with both nltk and BERT-based Named Entity Recognition (NER) models. If you have any suggestions for statistics to add to this project, feel free to email me at readers [at] jamesg [dot] blog to let me know. The source code for linguist.link is available on GitHub.
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at firstname.lastname@example.org.