Parse IndieWeb Search in 15 lines of Python code
Published on under the IndieWeb category.Toggle Memex mode

IndieWeb Search was intentionally designed to be open. All search result pages are structured as a microformats h-feed. This means you can use any microformats parser to read a search results page and extract information about each search result. Every search page also comes with jf2, RSS, and JSON Feed versions. You can also parse these programatically to get information about search results. Or you could even subscribe to a search page in a feed reader that supports any of the aforementioned formats.
Offering programmatic access to the search engine means that anyone can build interfaces and experiences upon the existing index without having to do any web crawling or indexing. For example, I recently used IndieWeb Search to add a "discover" feature to my feed reader. This feature lets me query IndieWeb Search to discover new blogs, all from my feed reader.
I recently asked myself: in how few lines of code can I parse an IndieWeb Search Result? I spent some time in Python thinking about this question and I ended up writing a script that extracts search results in 15 lines of code.
Here is my script:
import mf2py
url = "https://indieweb-search.jamesg.blog/results?query=jekyll webmentions"
data = mf2py.parse(url=url)
h_feed = [item["children"] for item in data['items'] if item.get('type', [])[0] == 'h-feed']
if len(h_feed) > 0:
for item in h_feed[0]:
entry = item["properties"]
message_to_show = f"{entry['name'][0]} - {entry['url'][0]}"
print(message_to_show)
This code:
- Imports mf2py, a Python microformats parser.
- Parses a search results page with mf2py.
- Finds the h-feed item on the search page, if one is available.
- If a search feed is found, the program iterates over every item in the feed.
- Prints a message showing the title of the search result and URL to which the search result points.
Each search result contains more than just the page title and URL. You could get the meta description for a post. You may even get the author of a post if one is provided in the search result.
Here is what the code above returns:
With the code displayed earlier, you can start to get creative. You could integrate the search engine with a feed reader. Or you could create a search engine that shows results in a different way. You can show multiple pages of results. Feel free to play around with the code above and see what you can create.
Wrapping Up
The code above illustrates how powerful open search can be. Search developers can create interfaces that let people programmatically access their content if they choose. I took this path in IndieWeb Search to encourage people to build on top of the engine. I have integrated IndieWeb Search into two applications I have made, independent of the main project codebase.
Of course, there is a limitation to querying the search engine like I have described above: you are relying on the search engine to be active. Making data programmatically accessible does not solve this problem. The solution to this would be some kind of peer-to-peer search index. I am not qualified to write on that topic just yet but peer-to-peer search does interest me.
If you build something on top of IndieWeb Search or have some thoughts on this article, do let me know by emailing me at readers@jamesg.blog.
Tagged in search engines.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.