I have worked on numerous projects that involve finding all of the URLs in the sitemaps associated with a website. For example, one of the first steps the IndieWeb Search crawler takes when it starts crawling a website is to find all of the URLs in all of the sitemaps. Separately, I have written scripts that validate the status codes of all the URLs in a sitemap.
As I have been working on projects that need to discover URLs from sitemaps, I have written a lot of duplicate code. This made me wonder: what if I made a Python library with some simple utilities for finding URLs in sitemaps? Today, I brought this idea that has been in my head for months to fruition.
I have launched
getsitemap on PyPi. The library is in
version 0.1.1. It comes with one function that discovers all of the URLs in the sitemaps it can
find. This function:
- Searches for a Sitemap: directive in a robots.txt file to find sitemaps to crawl.
- Adds a /sitemap.xml file to the list of sitemaps to crawl, if it is not already in that list.
- Removes duplicate sitemaps.
- Crawls every sitemap recursively.
If a sitemap is an index of multiple sitemaps, the library will crawl each file recursively until all URLs in all sitemap files have been retrieved. If a sitemap is a list of URLs, this recursion does not happen. All discovered URLs are added to a list that is associated with the sitemap in which the URL is found.
The retrieval function returns either a flat list of deduplicated URLs found in sitemaps or a flat dictionary of all URLs found in each sitemap. These are useful in different cases. For instance, you may use the dictionary to keep an eye on how many URLs appear in each sitemap, or to perform some kind of validation (i.e. check for non-200 status codes) on a set of URLs. You might use the flat list to perform an action on all URLs.
The number of HTTP requests this library performs increases for sites with lots of sitemaps.
Foreseeing this, the library uses Python's
concurrent.futures library to make concurrent
requests. Concurrent requests are made for each list of sitemaps found in an index file and in
the robots.txt file. This concurrency helps to massively decrease the time it takes for the
library to return a result.
I made this library as a simple solution to a problem that kept coming up for me. Instead of writing similar sitemap processing code in multiple places, I now have a library I can import into all of my projects.
If you have any suggestions on how
getsitemap can be improved, feel free to leave a comment
on the project GitHub page.
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at firstname.lastname@example.org.