I have been building a search engine for my blog. The search engine is designed to make it easy to find content on my blog. I started with a simple program that indexed content so I could keep track of my posts. I built a web interface to query the index so anyone could find search results. Then I started to add more features to make the search engine more intuitive and better at discovering content. This has been such a fun project to work on because I have had to spend a lot of time planning technical solutions and then I have implemented them, two tasks I enjoy.
In this guide, I am going to explain, briefly, how my blog search engine works.
I initially got the idea of building my own blog search engine from petermolnar.net at a Homebrew Website Club meeting. Peter showed his text search engine and it got me thinking. The idea went to the back of my mind for a while until I noticed that DuckDuckGo, which I relied on for site search at the time, did not index a piece of my content for which I was searching. I decided to ask whether I could build my own search engine.
The main reasons I have my own blog search engine (versus relying on an existing service) are:
- I can index content on my own schedule so the search engine is more likely to be updated.
- I can present content in exactly the way that I want without relying on a user interface from a site like Google or DuckDuckGo.
There are secondary reasons, too, and ones that are just coming to me now. Having my own blog search engine lets me offer better privacy to those who want to search my site. I don't keep track of queries or any visitor data because that data doesn't play a role in how I rank my content. My content is ranked based on content factors which I discussed in my last blog post in this series.
There are three parts that make up my search engine:
- Crawling algorithm
- Indexing algorithm
- Searching interface and querying
All of these are parts of any search engine. I am not reinventing the wheel. Search engines need to crawl to discover content. They need to index to keep track of content. They need an interface through which people can submit a query. However, I want to talk about them in the context of my search engine, which doesn't rely on nearly as many factors as other search engines to work. I'll discuss the three parts mentioned above individually.
The crawling algorithm is the means through which I find content on my website. The algorithm starts with my robots.txt file. If a
Sitemap: directive is found, I will add that sitemap (or those sitemaps) to the list of sitemaps to read. I then read all of the URLs in my sitemap and add them to a queue for indexing. Otherwise, I will go to the homepage of my blog and start crawling. If I have already crawled my site—the usual behavior—I will look in the database first so I can discover URLs I have already visited. This is important because it means that I can keep crawling URLs even if I remove them from my sitemap (unless they are marked as noindex, in which case I don't crawl them).
The crawling algorithm then searches through pages on my blog. I look for all links and if I find a link that has not already been indexed, I add it to the indexing queue for later processing. I also search through all images. If I find an image, I will create a smaller version of that image (a thumbnail) and save it into a static assets folder. This powers my "image search/"
The indexing algorithm is an extension of the crawler. In fact, I keep them in the same program file. The program looks for various pieces of information on a page, from the title to the meta description to all headings, and saves those pieces of information in a database. I spoke about all of my ranking factors in my last blog post about my search engine. I rely heavily on semantic HTML (using the right HTML tags for the job) and microformats, a form of structured data that lets me find information like the category associated with a blog post and the day on which the post was published.
Structured data and semantic HTML makes my job as someone making a search engine for their blog a lot easier. Because structured data and semantic HTML are standardised within their respective guidelines, I know how to look for information. I am using semantic HTML and microformats on my blog so it's easy for me to find the content I then want to add to the index (basically a database with information about each page I have discovered). If you are going to build a search engine for your own blog, I'd recommend brushing up on your markup so it is semantically correct. It will make your job so much easier. (As a bonus, othe search engines like Google will probably find it easier to crawl your site as a result.)
During indexing, I use TextRank to identify relevant keywords in an article so that a query a visitor submits is more likely to return a result. I wrote more about this in my blog post on TextRank and my search engine.
Once I have all of the information I need, I add it to my database. I have two columns in the database: one for posts and one for images. I store these separately because they each contain different meta information. For instance, I index alt text for images but that is not appropriate for posts. I do save the URL on which an image was discovered so that I can easily run queries between the two tables.
The searching interface is the part of the search engine that you will see. The crawling and indexing algorithms run every day (or on demand) behind the scenes. The information these algorithms find are queried by the searching interface. The "searching interface" is really just the search form on my search pages that lets you submit a query. This interface started off very simple, letting you look for a specific term in the index that appeared in the title, URL or meta description of a post. Now, I take into account title tags, headings, meta descriptions, URLs, and keywords when running a query.
The server behind the searching interface turns your text into an SQL query that is run either on my posts or images database table, depending on what you are querying. This involves quite a bit of work (just like everything) but it's all done behind the scenes.
You can run advanced filters on searches like
textrank category:"Coffee". At the time of writing, this does not return anything because I don't have any text that matches TextRank in the Coffee category on my blog. But if you changed the category filter to say "IndieWeb", a result would be returned. A list of these filters are available on my search engine homepage. They are designed to make it easier to find information that meets your needs.
This post gives you a high level about how my search engine works. I rely mainly on content for ranking because I control the search engine. I understand how the code I have written works—although I am still learning about how some external libraries work—so I am able to customise the search engine to my needs. I am even using the crawling algorithm to help me identify errors on my site such as links that return 404s.
There is a lot more going on behind the scenes than I have written about in this article. I have written other blog posts on this topic which you can find by searching for
search category:"Search Engine" on my blog search engine. I am not going to write about all of the logic because the search engine is quite complicated but I do enjoy sharing some information about how it works.
In short, my blog search engine is content-focused. It was designed to make it easier for you to find content on my website. I hope it achieves that goal. It certainly has for me. I use it quite frequently to find posts.
If you have any comments or feedback, I would love to hear from you. I am keen to improve the search engine so that it is relevant and as easy to use as possible and I'll be able to do that better with help from my readers.
Check out the other posts I have written as part of this series.
Respond to this post by sending a Webmention.