Shopping with machine learning

8 minute read

Recently I was looking to buy a watch on Amazon, but was rather overwhelmed by the 29 different brands, 22 colors, and 13 overlapping categories I could use to filter the seemingly endless stream of watches. After working out by trial and error which combination of filters somewhat captured what I was looking for, I still ended up looking through a lot of watches that I certainly did not have in mind.

Surely there has to be a more efficient and intuitive way to find a watch that I liked. Why not make a system that, given an item you like, displays to you similarly looking ones?

Recommendation systems kind of does this, but they tend to be based on input from many users more or less explicitly saying “I like this” and generally suffer from the “cold start” problem, meaning that they work poorly without a lot of data. And as it is, I do not have access to a lot of information about which watches people like, and on top of that I am under the impression that one can indeed tell whether two watches look similar or not without the opinion of a lot of people.

So in an attempt to prove my point I spent quite a few hours on putting together, and in the following I will describe the problem of finding visually similar products automatically and one approach to solve it.

Algorithm requirements

In order for an algorithm to be able to tell which items are similar, and by how much, it requires a metric. A metric is a function that takes two items as input and in return gives you a non-negative number describing the distance between the two items. So if it gives a value close to 0, relatively speaking, that means that the two items given to it are very similar.

Given images of, say, watches how does one define such a metric?

We humans are intuitively good at quickly looking at things and telling which ones we think look alike and which ones do not. But taking the approach of building an online Mechanical Turk wouldn’t quite be in line with what this blog is about.

Instead one could perhaps take the Euclidean distance of the high-dimensional vectors that images can be expressed as? It works quite well for computing distances in kilometers and miles in the real world after all.

Here is a short Python script that loads 4 images of watches, computes the pixel-wise Euclidean distance between the first watch and the other three, and saves the resulting “difference” images:

import math
import numpy as np
from PIL import Image

N = 4 # Number of images to load
# Load the images
imgs = ['watch-{0}.jpg'.format(i+1)).convert('RGB') for i in range(N)]

# Bring the pixel values into the range [0,1]
for i in range(N):
	imgs[i] = np.asarray(imgs[i], dtype="float32")
	imgs[i] = imgs[i] / np.max(imgs[i])

# Compute the Euclidean difference between the images
diff_imgs = [imgs[0] - imgs[i+1] for i in range(N-1)]
diff_imgs = [np.square(diff_imgs[i]) for i in range(N-1)]
eucl_diffs = [math.sqrt(np.sum(diff_imgs[i])) for i in range(N-1)]

# Get the pixel values of diff_imgs back into the range (0,255) for plotting
for i in range(N-1):
	diff_imgs[i] = diff_imgs[i] - np.min(diff_imgs[i])
	diff_imgs[i] = diff_imgs[i] / np.max(diff_imgs[i]) * 255.0
	diff_imgs[i] = Image.fromarray(diff_imgs[i].astype('uint8'))
	print('Euclidean distance between images 1 and {0}: {1}'.format(i, int(eucl_diffs[i])))

The result of running the above script can be viewed below (with some formatting). The darker the resulting image is, the more similar the two input images are, according to the Euclidean distance.

difference between images
The Euclidean distance between these two images is 64
difference between images
The Euclidean distance between these two images is 49
difference between images
The Euclidean distance between these two images is 95 even though the watches in them are exactly the same

In case you didn’t notice it when you looked above, the final two images that are compared are actually of the same watch. But even though the images are of the same thing, the Euclidean distance between these two images is the greatest of the thee image pairs compared! So naively using the Euclidean distance on a pixel-per-pixel basis as the metric seems to yield bad results when the images are taken at different zoom-levels.

This issue also arises if the watches are not perfectly centered in the image, are rotated, taken from different angles, or under different lightning conditions.

However, the Euclidean distance is not completely useless for this task, and it certainly has its merits when applied under the right circumstances as we will see shortly.


In general it turns out that it is very hard to explicitly define an algorithm that given thousands of pixels, tells you how similar the two watches the pixels represent actually are. With careful engineering one probably can achieve decent results, but even if one manages to create such an algorithm after hours upon hours of work, what about one that works well for dresses? Or shoes?

Rather than figuring out by oneself what is important in an image to distinguish it from another, this task can be left to the machines to figure out.

The machine learning algorithm I used for this task is the t-distributed stochastic neighbor embedding (t-SNE), developed by Geoffrey Hinton and Laurens van der Maaten, who has kindly made implementations available in more than 8 languages.

Exactly how t-SNE works is it outside the scope if this writing, but in short the algorithm maps data points from a high-dimensional space (such as images) into one of much lower dimensionality. It does this while trying to preserve the structure between data points from the high-dimensional space, as measured by the Euclidean distance between the points (following a linear transformation with principal component analysis), and emphasising low distances between similar points in the lower-dimensional space such that they cluster together.

If you are unfamiliar with principal component analysis, here is a nice visualisation.

As an example I ran t-SNE on the images I have gathered and had each of the watches mapped to a point in a 2-dimensional space. From this space I sampled a grid of points and found the watch with the nearest corresponding point in the space (using Euclidean distance), and this was the result:

plot of watches in a 2-dimensional grid

This small 2D-plot manages to capture a few different characteristics of the watches in it:

  • In the lower left corner there are mostly watches with metal straps and bright dials
  • In the upper left corner the watches also have metal straps but darker dials
  • Around the lower middle there are pocket watches next to very colorful ones with plastic straps


Using t-SNE is far from the only way to create a system that will help you find similarly looking things, and likely not the best way either. Two other approaches that I can think of are:

  • Different variations of deep learning with convolutional neural networks (CNNs), as these are incredibly good at semantic understanding of images; In 2015 the winning entry of the Large Scale Visual Recognition Challenge’s object classification task had an error rate of 3.6%, and was based on an exotic variation of CNNs using so-called residual learning. In 2010 the winning entry for the same task had an error rate of 28%, almost 8 times greater, and it can be discussed whether CNNs have surpassed human performance at this particular task of understanding images.
  • Using feature detectors, which are algorithms that extract and describe the areas around points of interest in the image. These are generally robust against various of the issues that I described such as rotation and scaling. Counting the number of matching feature points in two different images, or computing and comparing histograms of these extracted features, can also be used as a means of telling how similar a pair of images are

These methods can also be combined with t-SNE to potentially yield even better results.

However, it seems to me that t-SNE on its own does a decent job, and you can judge for yourself on