Building a search engine for my blog
Published on under the Blog Search Engine (Series) category.

TL;DR: I now have a search engine on my blog at search.jamesg.blog. Check it out!
A few months ago, I added a feature that let visitors to this site search for a page on the site. I used some JavaScript to craft a request to Google that would let you search for my site using the "site:" search filter. I then started to think whether this was the best approach, so I moved to DuckDuckGo. Not fully satisfied with relying on a third-party for search, I considered whether to create my own search engine for the blog. This was not a light decision because, as I shall discuss later, it would mean a big infrastructure change. But I made the leap and now I have a search engine for my blog.
The limitations of static sites
This site is built on Jekyll, a static site generator. Static websites such as those built with Jekyll feature pages that have already been written and/or generated. A The two big advantages to this approach is speed and simplicity: pages load fast and I have a good understanding of how the site is generated. However, using a static site means that I cannot add dynamic features like search without relying on JavaScript.
Relying on JavaScript was not an option I was willing to explore for a search engine. This could be done but it would be very hard to engineer an elegant solution that offered what I wanted in a search engine. I instead decided that building a web application that was hosted on a sub domain would be a better approach. I ended up deciding that search.jamesg.blog would be a dynamic site powered by Python Flask, a back-end web framework, and jamesg.blog would stay as is. I did not want to change my existing site because everything is just the way I like it. Jekyll is perfect for my use case. Using Flask on a server seemed like the best option available to me.
Building the index
All search engines keep an index. This refers to a record of pages that have been tracked. Google tracks pages on millions or billions of websites (most likely) but I don't need to do this for my search engine. I just need to index all of the pages on my site. To do this, I decided to keep things simple and rely on the sitemap.xml file I have on my blog, which is actually one way search engine crawlers—which build an index of a site—discover URLs.
To build an index, I used sqlite3, a simple database system that works well with Python and doesn't require too much work to set up. sqlite3 lets me store all of my index in one .db file that is stored with my project. My site isn't that large so this solution is fine. I then go through every page in my sitemap.xml file using BeautifulSoup so that I can discover all the URLs on my site. I then use the Python requests library to request each URL and get the following pieces of information about each page on my site:
- Meta description
- Title
Altogether, my index stores the URL, meta description, and title associated with each page on my site. This is all I need and is the data through which you can search when using my search engine.
I am using a tool called FTS5 to build the search index. This is important because it provides full-text search, thus making it easy for me to search for particular terms that appear both in titles, meta descriptions, and URL slugs.
After building the first index, I realised that I would need to exclude some URLs (404 page, webmentions, and pagination pages). I wrote a simple "if" statement to make sure that I only indexed URLs that were not my 404 page, a webmention, or a pagination page.
Here is the Python script that builds the index (with some code removed that is not relevant to this post, but the script will still work anyway):
from bs4 import BeautifulSoup
import requests
import sqlite3
import os
def build_index():
connection = sqlite3.connect("new_search.db")
with connection:
cursor = connection.cursor()
cursor.execute("CREATE VIRTUAL TABLE posts USING FTS5(title, description, url)")
feed = requests.get("https://jamesg.blog/sitemap.xml")
soup = BeautifulSoup(feed.content, "lxml")
urls = soup.find_all("loc")
for u in urls:
page = requests.get(u.text)
page_desc_soup = BeautifulSoup(page.content, "lxml")
meta_description = page_desc_soup.find("meta", {"name":"og:description"})
try:
cursor.execute("INSERT INTO posts VALUES (?, ?, ?)", (page_desc_soup.title.text, meta_description["content"], u.text))
except:
print("error with {}".format(u.text))
os.replace("search.db", "new_search.db")
print("done")
build_index()
This code:
- Imports the relevant libraries.
- Connects to the new_search.db database.
- Creates a new table called posts.
- Gets all URLs in my website sitemap.xml file.
- Iterates over each URL and makes a request to that URL.
- Saves the meta description, page title, and URL of each page in the sitemap into the posts table.
- Replaces the search.db file with the contents of the new_search.db file.
My script doesn't take too long to run but does still take some time because it makes web requests to each page on the site. This is not a problem for me though.
I create the new index in the new_search.db file, which is not consumed by the web application that powers the search engine. I then replace the search.db file, which is consumed by the search application, with the new_search.db file. This means that there should not be any search issues when I rebuild the site.
The index does not include the text from each post at the moment, although it could. I still need to decide whether or not I want to do this but for now you can search through post titles, URLs, and meta descriptions.
Building the web server
With a basic index ready, I started to craft a web server using Python Flask. I actually started with one endpoint that let me submit a form that then sent a POST request to my server and returned all URLs that met a query in JSON format. This helped me get the logic right before I thought about what the most intuitive way for visitors to interact with my site would be. One key part of this stage was understanding how to clean whatever data a visitor submitted in the search form. I decided to filter all characters aside from letters, numbers, and spaces, which should prevent against any security issues with the form.
After I had finished the basic form, I then created some HTML based on my existing website so that the search engine would fit in. The HTML does not include the navigation bar, announcement bar, or footer from the james.blog website but does include a new navigation bar that takes you back to jamesg.blog. I did not want to include the jamesg.blog navigation bar, announcement bar, and footer because they might change. If I change the navigation bar, announcement bar, or footer on jamesg.blog, I would also have to make those changes in the search HTML.
I wrote two pages:
- A homepage with a big search bar.
- A results page that displays results (which also has a search bar so you can write a new search query or edit your existing search query.)
To get the results page ready, I first worked to get data to show up on the page. Then I added the search bar so you can edit / write a new query and some other style rules so that the page looks more visually appealing.
After getting the search functionality ready, I configured my meta tags and ran a few tests to make sure I could retrieve the information I wanted from the search bar. One big limitation to note is that all search results pop up on the same page because I have not implemented pagination. I did not want to think through pagination logic for this project so I decided that showing all of the results from a query on one page would be fine.
You can make a search directly from the search.jamesg.blog homepage or you can make a search using the ?query= parameter on the /results page. Here's an example:
https://search.jamesg.blog/results?query=Coffee
The search bar on the homepage sends a GET request to the /results page which formats a query in this way.
The final search engine
The final search engine is available at search.jamesg.blog. If you type in "Building a" you should find this blog post, for example. You'll be able to search across the site for most pages, not just blog posts (although, as aforementioned, some pages are excluded, and images and static assets do not currently show up in the search engine). I have set up a scheduled task on PythonAnywhere, which I am using to host this project, so that the index is rebuilt every day. This means that the search index should be no more than one day out of date (which is fine since I usually only post new content once a day at most).
Let me know what you think about this project by sending me a webmention. I had a lot of fun building the search engine! I should close by saying that my interest in building my own search engine stemmed from a discussion we had at the London Homebrew Website Club a few weeks ago where one community member, petermolnar.net, shared his own search solution. Peter's project stayed in the back of my mind and made me think engineering my own search engine would be an option when I realised I didn't want to rely on third party services for search.
Tagged in search engine.
Other posts in this series
Check out the other posts I have written as part of this series.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.