How I, Spy Works (My 60 Minute Website)
Published on under the IndieWeb category.Toggle Memex mode

During the "Build a Website in an Hour" meetup last weekend, I worked on a project that I named I, Spy: The coolest game on the web. [^1] The game is like the familiar I, Spy, with two twists: (i) the game is digital, and; (ii) you guess with photos instead of words.
Each day, a prompt and an image are chosen. For example, the prompt could be a photo of a cat, with the label "cat." Players are invited to take a photo (or upload a photo). This photo is compared to the prompt photo. If the player's photo is close to the daily prompt, the app says "Warmer"; otherwise, the app says "Colder". A "Warmer" label means you are getting closer to photographing the object in the day's prompt.
You win once you photograph the day's prompt.
If the day's prompt is a photo of a cat, you will win when you take a photo of a cat.
But how does this work? That's a good question! The heart of this application is a tool called CLIP, an open source computer vision technology developed by OpenAI. CLIP is a versatile tool with a range of use cases. In this guide, we'll focus on how I used CLIP to compare images for the I, Spy game.
You can read the full source code for my application in the webispy GitHub repository. You can play the game online, too.
Image Comparison
I, Spy relies on two pieces of information:
- A prompt image, and;
- An image from a user.
These two images are "embedded" using CLIP. An embedding is a special numeric representation of an image that contains rich semantic information. Embeddings can be compared using distance algorithms such as cosine similarity to measure the similarity between two images.
I, Spy calculates the embedding for the prompt image when the game is started. There is a POST API that accepts a user image. When this API is called, an embedding for the player's image is calculated. The application then compares both embeddings. This returns a similarity score.
The similarity number has to be taken with a pinch of salt: even if two photos contain a cat, the score may still be low (i.e. 60%) because the backgrounds are very different. Thus, I, Spy sets a similarity threshold of 65%. If the embeddings for the player image and the prompt image are 65% similar or greater, the player wins.
The full back-end for the web application was under 50 lines of code at the end of the 60 Minute Website meetup. (NB: This number may increase as I add new functionality to the app.)
The Gameplay Experience
I, Spy is a single-page game. The game starts with an introduction, then offers two methods of participating:
- Take a photo, or;
- Upload a photo. For this option, a player must first give their consent for the app to access their webcam.
The first method works using the mediaDevices()
browser API. You can see my implementation in the application source code. First, the app requests consent to use the webcam, then a video element is set to stream the contents of the webcam:
if (navigator.mediaDevices.getUserMedia) {
navigator.mediaDevices.getUserMedia(constraints)
.then(function (stream) {
// get a user to take a photo
var video = document.querySelector("video");
video.srcObject = stream;
video.onloadedmetadata = function (e) {
video.play();
};
...
}
Then, the button is updated to say "Take Photo". Behind the scenes, a listener is added to the button that will create a canvas and set the contents of the canvas to a frame from the video
element we created earlier. The Canvas API offers a function, canvas.toDataURL("image/png")
, that lets you convert an image to base64 data. This is then converted into a byte stream which is sent to the server. This byte stream is then sent to the server.
The server calculates the difference between the prompt image (i.e. a cat) and the photo a user has taken. This happens in the following Python code:
``` STARTER_IMAGE = "cats.jpeg"
device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open(STARTER_IMAGE)).unsqueeze(0).to(device)
... user_image = request.files["file"] user_image = Image.open(user_image)
user_image = preprocess(user_image).unsqueeze(0).to(device)
with torch.no_grad(): # compare "user image" to "starter image" image_features = model.encode_image(image) user_image_features = model.encode_image(user_image)
similarity = cosine_similarity(
image_features.cpu(), user_image_features.cpu()
)
```
Here, we:
- Calculate an embedding for the prompt image (i.e. a cat). This code is earlier in the app, then a web app is created that can take requests.
- Retrieve the image the player has uploaded.
- Calculate an embedding for the image the player has uploaded.
- Calculate the cosine similarity between the prompt image and the image the player has uploaded.
This returns a single, numeric score between 0 and 1. The closer the score is to 1, the more semantically similar the two images are.
Back on the client, there is logic that determines if a guess is "Warmer" or "Colder" depending on whether the similarity of an image was higher or lower than the previous image a user uploaded. You can read this code in full in the project index page. The image a user has uploaded or photo a user has taken is displayed in the browser alongside the label "Warmer" or "Colder".
When a user uploads a photo where the cosine similarity is 65% or higher (the threshold we discussed earlier), the player wins. A pop up appears showing that they have won, and invites the player to play again.
Conclusion
I had a lot of fun working on this project. My first approach, which involved more convoluted embedding calculations, turned out not to work. After some thought, I arrived at the above solution, comparing the cosine similarity between two images. There are some limitations to this game. For example, CLIP will struggle with small objects that are not prominent in an image, unless your threshold is lower.
You can play the game online or view the source code. If you have any questions about how this project works, feel free to send me an email at readers [at] jamesg [dot] blog.
[^1]: I only had an hour to build a website! Don't judge me on the name.
Responses
Comment on this post
Respond to this post by sending a Webmention.
Have a comment? Email me at readers@jamesg.blog.