NB: Since writing this post, I have moved my blog search engine into IndieWeb Search. The functionality should be similar but not exactly the same. With that said, the logic below is still accurate and I hope will be interesting to you.
I have been working on a feature for my blog search engine that aims to directly answer a question you have. This is a feature that many modern search engines like Google have. Google calls direct answers "featured snippets" wherein the search engine serves you content directly from a page to address your query. This way of rendering search results makes it easier for someone to find exactly what they want to know. If you want to know "What is the capital of Timor Leste?," it's more convenient to see that as the first search result rather than having to click on an article or two to find your answer.
I wanted to add a direct answer feature to my blog so that it is easier to find the content for which you are looking on a blog post. Rather than having to search through a blog post for an answer, you can find the answer to the question you have, the context surrounding that answer, and a link to a page where you can find out more information. In short, I see direct answers as a way to help you find answers to questions you have in less time than you otherwise would. And I see direct answers as a way to help point you to the most relevant article my search engine can serve you.
In this article I am going to briefly explain how my direct answers work.
My search engine crawls (reads) my site every day and saves all web pages into a database. This database is called the "index". I preserve the HTML of each web page which means I can later read the structure of each page in the index. This is important. Without indexing the HTML, I could not use the semantics on a web page to find information. This is what all search engines do. They use HTML tags to determine what information means based on the HTML tags in which the information is found. Information in a
<table> tag is a table. Information in a
<h1> tag is a main heading. I use all of the information I index to process queries that people make on my search engine.
Let's say you want to learn about the origin of Little Fitzoy's name. You might do so by searching "Little Fitzroy name origin".
Previously, my search engine would give you a link to my interview with Little Fitzroy. But you have to skim the page to find what you are looking for. That's where my direct answering logic comes in.
Now let me explain what happens with my direct answer logic.
NB: The logic in this section is simplified to the main points. I do not explain every part of my direct answer logic because it is quite complicated and I have written this article as a high-level overview.
Suppose you have searched "Little Fitzroy name origin"... Let's talk about what happens when you do this.
My search engine first finds all of the articles that meet your query (i.e. those that mention Little Fitzroy) and then orders the queries based on factors like where the text for which you are searching appears on the page (is the text in a h1? is it in the page title?). Then, behind the scenes, my search engine removes all of the keywords that appear in the title of the top-ranking page. I do this because the top ranking page is likely to be most relevant to the user and it's unlikely every word in a user's query will appear exactly in a document. "Little Fitzroy name origin" does not appear in my interview with Little Fitzroy but that by no means says that the article does not include information on this query.
The title of the top-ranking article in this case is "Coffee Chat with Cathryn from Little Fitzroy".
If I remove all of these words from my original question, I am left with "name origin". So I now have two crucial pieces of information: a page for which to search for the term (my interview article with Little Fitzroy), and the term for which to search ("name origin"). With this information, I can get to work addressing the user's query. I remove all stopwords (i.e. "and", "if", "but") from the query because they give me very little information that can be used to answer a question. Then I search in various places on the top-ranking page for each of the keyword terms. These places include headings and
<strong> tags (which I use in my interviews to mark up questions).
My search engine looks for the terms "name" and "origin" on the Little Fitzroy interview article and finds a
<strong> tag that says "Why did you decide to open Little Fitzroy? What is the origin of the name?" This question contains both "origin" and "name", the keywords for which I am searching.
My search engine then uses BeautifulSoup, a HTML processing tool built for Python, to find information that is contextually related to the tag in which the intent is addressed. In this case, information is in a strong tag, which means that relevant context is: (i) the text in the
<strong> tag itself; (ii) probably the text below it too. My program gathers both of these pieces of information and leaves me with this text:
Why did you decide to open Little Fitzroy? What is the origin of the name?
Simply put, I wanted there to be somewhere to get a good cup of coffee in my neighbourhood. Not that there weren’t many wonderful cafes in the area with amazing food, drink, and ambience. But I always thought that not being able to get a good cup of coffee early in the morning was always something that was lacking in Scotland. I’m pretty proud to look at my morning custom and think that I’ve provided something that other people were looking for too.
My search engine has found an answer to the query. Now that the answer has been found, I serve it up at the top of the search results page. Here is what my search results page looks like for this query:
I cannot guarantee that the search engine will answer your query. My search engine can only be used for searching my site and the content I have here is limited compared to a big search engine. With that said, I do have enough content that the search engine is likely to give you at least one direct answer every so often.
My search engine makes a number of assumptions and is built specifically around how I mark up blog posts. I don't support reading tables because I rarely use them on my blog. I do not process synonyms as doing so would take a large amount of work and I don't think this feature would add much value to my search engine. This is all fine because I built this search engine for me and to help visitors find information on my site.
I would love for you to try out my search engine at search.jamesg.blog and tell me what you think. Did you get a direct answer to your question? Were you satisfied with the answer? Did something appear differently to how you expected? Was your query met with a result that did not match your intent? The feedback you provide will help me make my search engine better for everyone using it.
Check out the other posts I have written as part of this series.
Respond to this post by sending a Webmention.