Saving my blog posts in one single .docx file
Published on under the IndieWeb category.Toggle Memex mode

I have been thinking about how best to print my blog. In the first print run, I collated all of my coffee blog posts into a single Google Doc, formatted the doc appropriately, and then used Lulu, an online publishing platform, to print my book. The trouble with this process was that the work was very manual. I had to copy and paste my blog posts from my blog into the Google Doc by hand.
Lately, I have been interested in doing another print run. I started copying my blog posts into a Google Doc last week but I found myself quickly fatigued. I thought there must be an easier way. The advantage to the Google Docs approach over any other is that I could easily hand-select which posts I wanted to include. I had to glance at all of my posts before deciding to put them into the doc. However, I was unable to finish copying all of my blog posts into Google Docs.
I searched for a new solution. I considered whether I could turn all of my markdown files that contain blog posts into a single markdown file. I thought about whether I could restyle Typora, the markdown editor I use, to return a formatted document when I went to export the markdown. Then I found a command line tool that would move all of my files into one and save it as a .docx file. The command line tool is called pandoc.
pandoc lets me take one or more files and save them in a particular format. I chose .docx because I wanted to be able to edit my document later. I needed freedom to change the typeface, add a table of contents, add page numbers, and make any other required changes to format my book. While I have not begun this work just quite yet, I know I will have to undertake these tasks soon.
I played around a bit with pandoc and realised that saving all of my files into one big markdown file is not enough. All of my blog posts contain YAML front matter and the title of each post is saved in front matter.
Consider the post I wrote about coffee origins from which I have tried coffee. The start of this document contains the following lines of front matter:
---
layout: post
categories: ["Coffee", "Post"]
title: "Coffee origins I have tried so far"
---
As you can imagine, I did not want this text in my final book. And because my title is in front matter, I knew I would have to extract the title from the front matter before I removed it.
I came up with this script to prepare my files for pandoc:
for f in Projects/blog/website/_posts/*;
do
cp $f BookParts/
done
for f in BookParts/*;
do
file="basename $f"
front_matter="sed -n '/---/,/---/p' $f | wc -l"
title="sed -n '4 p' BookParts/$file | cut -d '"' -f 2"
echo "# $title" > BookParts/new/$file
tail -n +$(($front_matter+3)) BookParts/$file >> BookParts/new/$file
done
The first four lines of text run through all of the blog posts in my Jekyll _posts/ folder and copies them to the BookParts/ directory, which is where I was going to create all of the markdown files I needed to make my final book.
The next lines of text do as follows:
- Loop through each file in the BookParts/ directory (which is all of my blog posts).
- Get the name of each file.
- Find out how many lines of text there are in front matter (which is contained between "---" tags).
- Retrieve the title from the front matter (which is always on line 4).
- Add the title of the doc to a brand new doc in the BookParts/new/ directory. This new document is where the final blog post will be saved.
- Add +3 to the number of lines of front matter in each file (which represent the blank line after each piece of front matter, an image, and another blank line). Now that I come to think about it, some posts do not have images. That means I'll need to revise this line accordingly. If your markdown begins the line after your front matter ends, you only need to add 1 to the front_matter variable above.
- Append the contents of the file in the BookParts/ directory (excluding front matter) into the new/ directory.
I added another line of code at the end of the file to clean up the BookParts/ directory to remove the .md files I did not need. Then I added one more line of code that converted the contents of the BookParts/new/ directory into a .docx file:
rm BookParts/*.md
pandoc -s BookParts/new/* -o FinalBook.docx
While I was writing this script, I noticed that the first line of each file was being added immediately at the end of the previous file. This meant that the text that would later become headings was mixed with paragraph text. To solve this problem, I added a new line to the end of each blog post (before I run my program):
for f in Projects/blog/website/_posts/; do echo >> $f; done
I am not done. I need to make sure that no files without images have been cut in a way I did not expect (which is down to the way I calculate front matter / unnecessary text in the "tail" command). I also need to format the final document. And I need to print the document. For now, however, I am very happy with the results. I know how to convert all of my blog posts into a .docx file in one command.
Tagged in blogging.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.