Building a search engine for my blog: Part II
Published on under the Blog Search Engine (Series) category.Toggle Memex mode

I recently launched a search engine for my blog, allowing visitors to find pages on my website. The search engine lets you search through page "h1"s (a new change, previously the engine supported "title" tags instead), meta descriptions, and URLs to find the resources that you need. For instance, you could search "Aeropress" and get a list of all of my Aeropress-related posts.
The first version of the search engine met a key need: allowing me to find resources on my site. Whereas external search engines crawl and index at their own cadence—over which I have little control—my search engine lets me reindex my blog whenever I want. But then I started to think about how I could extend the project.
I decided on two key features I wanted to add:
- Image search
- Advanced filtration options
In this post, I'll briefly chat through how I built these features and what they mean to me.
Image searching
I have a lot of images on my blog. I must have at least one hundred images. I add an image to all of the blog posts I write and I have been doing that for months. I sometimes have a need to retrieve a specific image, such as if I want to use an image again or if I want to share an image with someone else. I decided to create an index of images so I could find images easily.
To do this, I had to create a new table in my search index: images. This table supports four values: the URL associated with the page on which the image is used, the alt text of the image, the image source, and when the page on which the image was found was published (if applicable).
I had to consider how I would actually discover images to add to the index. My first thought was to create an image sitemap but I realised that I already request all the pages on my blog anyway during indexing. The images I would want to show up are those that are used on the blog. I decided that I would find all images on every page I index rather than create a new image sitemap on my blog.
(There are advantages to having an image sitemap such as being able to include all images on my blog, even those that are not used, but I did not take that approach for this version of the search engine.)
After writing some code to index the images, I realised that there were issues with duplicate images. Some images appear on multiple pages and I don't want them to appear more than once in the index. For instance, I have Webmention and IndieWeb buttons at the bottom of every page. Without a duplication check, I ended up with hundreds of these images indexed.
Here the logic for my image indexing:
images = page_desc_soup.find_all("img")
for i in images:
try:
if i["src"] not in image_urls:
image_urls.append(i["src"])
if published_on:
cursor.execute("INSERT INTO images VALUES (?, ?, ?, ?)", (u.text, i["alt"], "https://jamesg.blog" + i["src"], published_on["datetime"].split("T")[0]))
else:
cursor.execute("INSERT INTO images VALUES (?, ?, ?, ?)", (u.text, i["alt"], "https://jamesg.blog" + i["src"]))
except:
print("error with processing {} image".format(i["src"]))
Briefly, this code:
- Finds all images on a page.
- Iterates over all images and checks if I have already indexed the image. image_urls is a blank list declared earlier in my code that is populated with images URLs as I iterate over them.
- Adds the image to the image_urls list if it does not exist.
- Inserts the image into the images database. A published date is added if one can be found on the page.
- If there is an error, a print() statement is run informing me there is an error.
There is more to the indexing than what is shown above but the code you can see should give you an idea as to how the image indexing works.
Then I had to support image searching on the front-end. To do this, I added support for a ?type=image query parameter which, when used, lets you search through images on my blog. I added two tabs to the search engine that let you easily navigate between searching for a post and an image, just like what you would see on Google when you run a search.
Here is how the search bar appears on a search results page on my search engine:
With all of this code written, I could now do a search like "Aeropress" to find all of the images whose alt text, URL, or image source contain "Aeropress".
Image searches
Advanced filtration options
I do not have category pages for every category on my blog. Instead, I have a list of all my posts which are grouped by category. This is the best solution because I do not want to generate numerous category pages every time I build my blog. But, I also wanted to be able to search for posts in a category through my search engine so that I could find posts that met a particular query.
The goal was to support syntax like this:
aeropress category:"Coffee"
This is similar to how Google's advanced filtration works in syntax.
I decided to support three advanced filtration options:
- category: Get posts in a category.
- before: Get posts published before a specified date.
- after: Get posts published after a specified date.
To add support for these filtration options, I wrote some code that identifies the advanced filtration options I wanted to support. Then I wrote some code that would make the requisite changes to my database query. For instance, if you decided to search for a post before a date, my database query would include:
AND published <= date(?)
This lets me find posts published before or on a particular date. I wrote the requisite syntax to support finding posts in a particular category and posts published after a particular date too. There is also a lot of cleaning that goes on in the background so that only the characters needed to interpret a query are supported. The entire search query is cleaned before any database queries are made.
The following line of code lets me remove all text in a search query that is not an alphanumerical character or a space:
cleaned_value_for_query = ''.join(e for e in query_with_handled_spaces if e.isalnum() or e == " ")
"query_with_handled_spaces" refers to the original query made by a visitor which is handled in a certain way at the start of the search process.
During testing, I realised there was an issue with searching for images by category. This is because I do not index the categories of pages on which images have been discovered. This was intentional. As a result, I wrote some code that ensures that category searches in image queries are handled gracefully because they are not supported.
Here is an example search that uses the advanced filtration options:
All of the above advanced filtration options are documented on a dedicated page on the search engine.
Wrapping up
With all of this code written, I now have a more functional search engine. My search engine lets you find a blog post or page on my site, use advanced filtration options during your search, and search for images. I am happy with how this project has turned out and I can see myself using the search engine quite a bit if I am looking for a post or an image.
Tagged in search engine.
Other posts in this series
Check out the other posts I have written as part of this series.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.