A one-liner to get root paths in my sitemap
Published on under the IndieWeb category.
It has been a few weeks since I last blogged. I have been busy learning, working, and travelling. Today I have a blog post to share with you, or at least I will by the time I finish hitting keys in sequence to produce this post.
Today's post is about a simple one-liner I wrote in bash. I have been playing around a lot with bash lately as I try to build a better sense of the right tools for various tasks. Why write a Python script when you can do what you want to do in the command line, using fewer lines of code? Particularly for text processing, the command line and tools like grep and awk save so much time. Over the last few weeks I have ran into four or five situations where a few commands and pipes turn an otherwise difficult task or at the least tedious task into something simple.
One such example is that I wanted to get a list of all the root paths in my website sitemap. I could have done this by parsing the XML with a parser, writing a script to get each URL, then using a library to get the root paths. For a full-scale project, such as expansion to my search engine scripts, I probably would take this approach. But my use case was just to get a list of the root paths for me to look at. The easiest way to do this, I found, was to use a few commands.
I wrote this command:
cat _site/sitemap.xml | grep "https://jamesg.blog" | awk -F/ '{ print $4 }' | uniq
This command:
- Echoes the contents of my sitemap file generated by my static site geneartor.
- Prints out every line that contains my blog domain name.
- Separates each string by / (forward slash) and gets the fourth item.
- Removes duplicate results.
The fourth item in each URL is the text after https://jamesg.blog/ and before the next /. In other words, the fourth item is the root path for a URL (i.e. the /coffee/ part of /coffee/maps/). Note: This script relies on URLs being on their own line for the output to be pretty. Some additional work would be needed if the sitemap URLs were not on their own lines.
This command gave me a list of a few paths. I then reviewed them manually to see whether any of the root paths did not exist. This was part of my effort to make sure the root of every path had a URL that resolved. I found one or two instances where this was not the case, prompting me to make the requisite changes. After these changes, I feel more comfortable with the URL structure of many paths in my website.
While this command is simple, I am always amazed by what one can do from a terminal directly. This was a fun little command that really helped me out.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.