The second Build a Website in an Hour event finished this evening. Everyone's projects were amazing! Someone worked on a new website and refining another. Someone else spent time learning more about HTML and CSS. Someone else built a website for a project they are working on. I loved the vast range of different projects participants built!
Earlier this week I thought to myself "I don't have an idea!" for the event. As Thursday approached, I spent some time reflecting on what I should build. I decided to build a project that lets you click your fingers then talk to navigate to a web page. I decided to call the project awsnap.js (thank you Tantek for the name idea!).
Ana's Web Speech API talk from the State of the Browser Conference came to mind. I wanted to do something with the transcription APIs available in some browser. Then I started thinking about tensorflow.js, a machine learning framework you can use in the browser. Charlie Gerard gave a talk at Beyond Tellerrand where she shared her approach of putting ML in the middle of a project to add a new dynamic. I decided to apply that to my general interest in doing a project with transcription.
I arrived at the idea to build a website where you could move through pages using your voice. Instead of clicking a link, I wanted to enable someone to say "go to", followed by a word or phrase that is featured in a link, and then be taken to the corresponding page.
You can see a video of the project in action below:
I envisioned two components: (i) a way to recognize "go to" and; (ii) a way to transcribe the audio. Technically, these can be separate, but I really wanted an excuse to learn tensorflow.js. I later realized that the two-step approach is prudent: the "go to" recognition can happen on device but web transcription happens in the cloud. Thus, I could ensure audio is only transcribed when a keyword is said.
We started a timer in the call and I got building.
Starting with Web Speech
To let a user navigate a link using their speech, I needed a way to know what they say when they enabled the speaking navigation mode. That is where the Web Speech API comes in. The Web Speech API has two components: speech transcription and speech synthesis. The former lets you turn words said aloud into text; the latter reads out text with a text-to-speech system. The transcription feature in the Web Speech API runs in the cloud. It has limited support across browsers. Safari and Chrome both support the transcription feature. Firefox, unfortunately, does not presently support transcription via the Web Speech API.
I started the event with a simple web page and code that enables the Web Speech API and transcribes my voice. I knew that having a web app that transcribes my voice would be a cool thing to show off, irrespective of how the machine learning and link clicking part went. I was using new technologies across the project so I was a bit worried I wouldn't finish in the hour.
With the transcription ready, I started work on the next steps: building an audio classifier and highlighting links.
Building the audio classifier
I wanted transcription to begin only when a person said "to go".
To recognize "go to" without using the Web Speech API, I needed an audio classification system. You can build one with tensorflow.js without having to write any model training code! You can record audio in the browser and then train the model by clicking a button using Teachable Machine. This was perfect for me since I have limited experience with TensorFlow and I only had an hour in which to work. Teachable Machine then gives you the model weights created in the training process and some boilerplate code to get started.
I also trained my model to identify finger clicks and claps. The former becomes significant later in this post; the latter has no implemented use right now but would be fun as another control action (i.e. if you clap, you go back a page).
I copied the snippet from Teachable Machine into my code. It took me a few moments to get everything set up. I learned I needed a web server, so I used the
python3 -m http.server command to set up a server that served all the files in the folder in which I was working. I configured my code to load the model I downloaded from Teachable Machine. I then tested the results. The model was working well enough for me to continue.
The snippet from Teachable Machine includes code to listen on an infinite loop and classify according to the classes a model was trained to identify. I wrote some logic that triggered the Web Speech API transcription whenever the "go to" class was identified by the model. Then, I had to figure out how to find links based on what I said.
In experimentation, the model struggled to identify the "to go" class accurately. I did not have time to overcome this in the hour, but I made some changes later to address the issue.
Finding links on the page
By now, I had two components:
- A verbal keyword ("go to") to initiate going to a page, and;
- The ability to transcribe audio and get the results.
There is one component missing, which is finding a link that reflects what a user asked to click on.
I needed to plan for two cases: one where a user says a single word or two and another where a user says the full title of a post. In the version I worked on during the event, I used a library that calculates Levenshtein distances to find the distance between all of the link anchor text on a page and the result of the audio transcription. I took the link whose anchor text had the lowest Levenshtein distance. Then, I rendered the corresponding page in an iframe on the top of the page.
This approach worked but it proved to be brittle. If the Levenshtein distance is not incredibly low, the results were not great. What I needed was a fuzzy search, which provides more intelligent approximate pattern matching. This didn't come to mind until after testing the Levenshtein approach. I didn't have enough time to experiment with fuzzy search in the hour, but I did use it in refining the project.
Refining the project
After the event was over, I continued to refine my project. I decided to remove the Levenshtein code in favour of a fuzzy searching library, fuzzysort. When the page loads, I retrieve the anchor texts as I previously did then I load it into a fuzzysort object. This object can then take a search query and return the most related results. In further testing, this method achieved significantly better results than my previous one.
I also decided to stop using the "to go" cue and go for the more fun finger click cue. I had already trained the model to identify finger clicks. Now, you can use my project to click your fingers and say a word or a phrase to navigate to the corresponding link on a page. Want to read my blog post on my IndieWeb principles poster? Click your fingers in the demo then say "principles poster." Curious about my thoughts on social readers. Click your fingers then say "Social readers a new way of thinking" when on the demo page.
With these two changes, the user experience both feels more playful, returns higher quality results, and behaves more consistently.
I have more ideas on what I want to do, including adding more specific, visual instructions on how to use the software and exploring whether I can incorporate this tool into a browser extension. I am excited to continue working on it!
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at firstname.lastname@example.org.