Weighing search results on my personal search engine
Published on under the Blog Search Engine (Series) category.Toggle Memex mode

As you might know, Google weighs various factors when ranking a web page to certain extents. Factors like whether a keyword is in a title are a strong indicator that the article is likely to be relevant to a target keyword. Thus, that article will be considered more relevant than, say, an article that only mentions the keyword once or twice. I will leave the exact logic behind Google to people who know more about this topic than I.
Anyway, I was thinking about how my search results were ranked. Truthfully, I could not explain my ranking algorithm very well. I used the "ORDER BY rank" statement that comes with SQLite's full text search (version 5) which lets you order by relevance but I didn't really read into if there was a more effective way until now.
In this post I will share a few learnings from weighing search results on my personal search engine.
Adding weights for search data
At the start of my move to weighing data I collected from a page, I included the following details in my index:
- Page title
- Page meta description
- URL
- Category
- Published date
- Keywords (as determined by TextRank, which I discussed in my last blog post)
As you can imagine, these values should be treated differently in terms of their ranking importance. Keywords appearing in URL slugs should not be ranked equally to keywords that appear in title tags. That's why I decided to read into how I could apply weights.
I experimented with a Python library but I then read that SQLite's full text search already implemented the BM25 algorithm, a search algorithm that tries to determine the "best match" for a given query. I will not attempt to explain it but what matters is that the algorithm helps me rank information by what the algorithm deems is important. I can also apply weights to each value in the database.
To do this, I rewrote my search query:
SELECT title, highlight(posts, 1, "", "") description, url, category, published, keywords, bm25(posts, 10.0, 5.0, 1.0, 0.0, 0.0, 7.5) as score FROM posts ORDER BY score;
This query is an excerpt of the actual query I use to find a document in my text search. The highlight() code lets me highlight instances of a keyword in a description. Other than that, everything looks like standard SQL until you get to the bm25() text. The bm25() function is implemented in FTS5 (full text search 5) and lets you create a "score" by which you can order posts.
I used this function to assign weights to each value, which are:
- Title: 10
- Description: 5
- URL: 1
- Category: 0
- Published date: 0
- Keywords: 7.5
Titles are given more weight than any other ranking factor. Meta descriptions are considered half as important in the ranking algorithm than keywords. Categories and published dates are not ranking factors. Keywords are given more weight than a meta description but less than a title. URLs are given a 1 so they are a ranking factor but they are not very influential.
These numbers are arbitrary but they let me implement the rules I want such as "title should be more important than meta description" and "URL should be a ranking factor but not influence results as much as other ranking factors."
Improving the weights by saving headings
Up until now, my search engine did not have support for indexing headings. This was to save on space in the database. However, as I have been building this program I have gotten used to its performance and at my scale I am happy with how the indexing algorithm performs. I decided that I would index headings to help make my search results more relevant.
Headings are important because they are a signal as to what could be discussed in an article. "weights for search data" is in the first h2 of this article and conveys some context about what the article is about. If I were to search for "weights for search data," I might not get a result unless that specific term was identified as a keyword by TextRank.
I updated my crawler to index all headings that are in article
tags (the main container for content on my blog posts) or headings in a div
with the ID main
. This ID is applied on every page but it means I can grab headings from a page even if an article
tag is not present.
Here is the code I use to discover headings:
page_desc_soup = BeautifulSoup(page.content, "lxml")
if page_desc_soup.find("article", {"class": "h-entry"}):
page_text = page_desc_soup.find("article", {"class": "h-entry"})
else:
page_text = page_desc_soup.find("div", {"id": "main"})
heading_info = {
"h1": [],
"h2": [],
"h3": [],
"h4": [],
"h5": [],
"h6": []
}
for k, v in heading_info.items():
if page_text.find_all(k):
for h in page_text.find_all(k):
v.append(h.text)
In short, this code:
- Gets the main text on a page.
- Creates a dictionary with each heading type as a key and maps a blank list to each key.
- For each item in the dictionary, I find all headings that match the key value (i.e. "h1" or "h2") in the page text. I then add that heading to the list mapped to the heading type (so that h1s go in the h1 list and so on).
To support saving these headers in the index, I added new database columns for each type of header and I updated my INSERT and UPDATE statements to support headers.
Wrapping up
These updates are the biggest updates I have made to the search engine ranking logic since the implementation of TextRank. With search weights, I can serve content that is more likely to be relevant to a searcher's intent by weighing various ranking factors differently. By indexing heading values, I have expanded the potential for an article to be found that meets a searcher's intent.
If you haven't already, you can give my search engine a go at search.jamesg.blog. Let me know what you think. Did a query not meet your intent? Tell me about it. Do you see room for improvement in the ranking algorithm? I want to know. I am keen to keep the discussion going about this project.
Tagged in search engine.
Other posts in this series
Check out the other posts I have written as part of this series.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.