Scaling IndieWeb Search
Published on under the IndieWeb category.

IndieWeb Search, a search engine that indexes sites owned by IndieWeb community members and other related sites, has saved over 421,000 web documents. I maintain IndieWeb Search as a passion project. I wanted a place to search content from IndieWeb sites so that I could find articles and guides that were applicable to the community. Using the search engine, I'm able to get direct answers to niche, community-relevant queries such as how to set up webmentions with a static site or understanding what IRC is.
IndieWeb Search has grown a lot since I first started the project. Before the engine even existed, I used the crawler on which IndieWeb Search was based to index my own personal site. At the time, my site only had a few hundred pages. I learned the basics of web crawling and indexing content. I set up a working architecture for the search engine. Then I had to iterate when I realised my little search engine run on SQLite would not scale for indexing multiple websites. I went to Elasticsearch and spent a lot of time beefing up the search crawler, improving its speed, precision, efficiency, and accuracy.
I am now at another one of those points where I need to ask myself: how can I efficiently scale the search engine?
IndieWeb Search is running on a server where a significant portion of free memory is taken up by the search engine and its associated web applications. The crawler runs on the same server as the rest of the search engine. This which means that now crawling can impact the stability of the search engine if I don't check in after a crawl has completed. I have a fast crawler but I cannot run it as much as I would like (or automate it) at the moment.
To solve this problem, I likely need to move the crawler to another server. This would give me room to crawl. To do so, I would have to copy the open source code over to a new server and start crawling. I could use scp to transfer search files to the main server. I used a similar setup back when everything was smaller and on two cheaper servers than the one I host on now.
This does not solve the problem of memory usage. As anyone who has used Elasticsearch likely knows, 8GB of RAM, the amount available on the IndieWeb Search server, is not enough when you start indexing lots of documents. For now, this is enough RAM. But if a few hundred thousand more pages were added, I would likely need more RAM. Even now, more RAM would be useful. More RAM would increase the reliability of the engine and stop the over-eager OOM killer from interfering if I ran any memory-intensive program on the search server. This is a real world consideration because the index needs to be read to build the search link graph.
Expanding the server to more RAM likely seems the best option in the mean time. I don't expect the engine would need more than 16 GB of RAM at any point in its history unless I was to massively expand the scope of the project. I could also set up another server. Then I would have to get into clustering and learn how to make two Elasticsearch instances work together. This is not impossible by any means. It's just a lot of work for this project.
Another option is to stop using Elasticsearch in favour of my own index, hoping my index uses less RAM. I have open sourced a barebones search index that can return search results quickly. I don't know if my search index could compete with Elasticsearch's performance. I am certain that it would take a massive amount of work to implement all of its features, something I don't need to or want to do. But I think I have a good shot. To test whether my index could help scale the search engine, I would likely export the entirety of the search engine data, import it to my indexer, and record profiles for memory, CPU, and program times.
There are no easy answers to how I can scale IndieWeb Search. At this very minute, I don't need to do anything. The search engine technically works fine. But to achieve greater reliability and to improve the cadence on which new search results are added to the index, I will have to think through all of the points made above in more detail.
If you have any ideas on what I could do to scale IndieWeb Search's infrastructure, feel free to let me know by sending me an email at readers@jamesg.blog. If your site supports webmentions, you could even write up a comment or suggestion on your own site and send me a webmention to let me know that you have written a response to this article.
Tagged in search engines.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.