Creating an index for my personal website using NLP
Published on under the IndieWeb category.Toggle Memex mode

At Homebrew Website Club this week, we discussed book indexes (with the Chicago Manual of Style nearby as a reference, of course!). This got me thinking about what a web index may look like for a blog: a page formatted like a book index where you can see articles that mention unique concepts on your blog. Book indexes are useful tools for navigating information. Given a concept, the index tells you where to find out more. I was curious about what this may look like for a blog.
Well-written book indexes take a long time to produce. The Chicago Manual of Style notes that one could expect to spend three weeks or more working on an index for a book. Unlike books, however, materials on the World Wide Web can change often. I could add more contents to my blog; I could revise content. This makes sitting down and compiling a manual index impractical. I would have to update the index with new posts, and make changes if I revise existing contents.
Thus, I thought about what an automated solution would look like. This solution should take all of the posts on my blog and produce an index in similar style to a book. I experimented with this idea after Homebrew Website Club and, after discussions with a few people, I was able to get an index from which one can look up concepts on my site and find posts that reference those contents.
You can view the index on my Index page. This page is a work in progress as I experiment more with the rules that define the index.
Candidate selection
A good index should make it easy to find information associated with a concept in a work. For instance, I can go to the Chicago Manual of Style with a query like "footnote" and find an entry in the index that points me to the right place in the book. With that said, not every word is included in the index. There is a selection process. An index with too many words is unwieldy. An index with too few words leaves one with an incomplete impression about the contents of a work.
To select candidates for my index, I first downloaded a dataset of NYT articles from Kaggle. I then calculate the probability that each word comes up in the index. Then, I use this information to calculate a metric called "surprisal" (entropy), which reflects how "surprising" it is that a given word comes up. I can use this metric to find uncommon words that can be used in the index.
Ingesting content
With the surprisal of each word in the NYT corpus ready, I could start ingesting my content. I use the spacy
Python library to tokenize content. This process splits up sentences into "tokens" which are then used for processing the content. I remove tokens that are exclusively punctuation (i.e. full stops), tokens that start with punctuation (i.e. -ly), non-ASCII characters, and numbers. There are other rules I use, too. The outcome of this process is to have a set of words that I can pass through the surprisal index I built.
If a token like "-ly" got through, for example, it would probably be surprising because it is uncommon, but the token isn't useful for inclusion in an index.
After cleaning content, I take the top 10 most surprising words that have a "surprisal" metric of over 8. This number is relative to the corpus with which one is working. 8 was a good number for filtering out words that aren't as interesting. I add the first five to the index that will be displayed on the final index web page. The next five are only added if they appear at least once in the index. I have applied these rules to help ensure the index does not get too big.
Aside: Ironically, the surprisal method helped me find typos. For example, indistringuisable
was identified as a "surprising" word according to my heurisitics, but the word contains a spelling error (an errant "r"). Indeed, I need to run a spell checker over my blog!
I also look for all named entities using nltk
's Named Entity Recognition features. I chunk them together so that, for example, Taylor
and Swift
become Taylor Swift
. All named entities are added to the index.
The final format for the index is:
{"word": [("article title", "article url"), ...]}
I apply this information to a HTML template to create the index.
Cleaning the index
After calculating an index for all of my posts, I need to clean the index. This involves:
- Deduplicating words with different capitalizations;
- Deduplicating terms with different punctuations (i.e. U.S. vs. US), and;
- Ordering the index from A to Z.
At the end, I have an index that is ready for display on a web page. I realised that navigating the index is quite difficult because there are a lot of terms, so I added letter landmarks (i.e. A, B, C). I added links to these landmarks at the beginning of the index to aid in navigation. Thus, if I am looking for a term that begins with "I", I can find it without having to scroll past a lot of text and find the beginning of the I section.
The screenshot below shows some of the terms under the "I" section as an example of the index in action:
The titles are linked to the URL of the article, allowing one to navigate directly to the page on which a word was used. Book indexes refer to page numbers instead of section titles, but that was not an option for my blog since I do not have opage numbers. Indeed, I could not think of a representative numeric identifier for each post that would be so elegant as listing the titles themselves.
Showing the titles had a cool side effect of providing some level of context into the word usage. I did want my index to disambiguate usage of words like a book index (i.e. Homebrew Website Club may have listings such as "origin of" or "meeting times"), but this proved a difficult challenge. I experimented with Word Sense Disambiguation (WSD), but that was not exactly for what I was looking. WSD tries to find the appropriate definition for a word in context. I wanted to know how the word was used in context.
For this, I experimented with BERT-based models with some success. But, running these models on my Mac took some time and the results were not always of the best quality. I'm sure I could get better results with more experimentation, but after seeing the results of showing article titles I found that titles were sufficient for my index. (I did try GPT-3.5 on one word in context. GPT returned a concise, index-like summary. But, again, showing titles worked well for my use case and achieved the outcome of making it easier to understand how a particular reference connects to the term under which it is listed).
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.