A about a year ago I read two blog posts about generating fonts with deep learning; one by Erik Bernhardsson and TJ Torres at StitchFix. Inspired by their work I figured that I wanted to give fonts a go as well, so I set up a variational autoencoder* that would learn a low-dimensional representation of the word “Endless” from 1,639 different fonts, and was capable of generating very smooth interpolations between the different fonts, as can be seen in the animation below.
*A variational autoencoder is in short a model that learns to take a high-dimensional input, transformation it into a lower dimensionality and then transform this back into the original. By doing this the model effectively learns how to boil the input down to its essentials, which in this case allows one to interpolate smoothly between the original fonts, creating completely new ones in the process.
However, generating different interpolations between fonts of a single word is not quite creating an entirely new font. It very much lacks the expressiveness of a full alphabet.
So why not just take what Erik and TJ have made and simply use that to generate new fonts? Because their models are lacking something: Even though they manage to capture the styles of individual characters very well, they do not incorporate the styling found between pairs of characters, namely the intended spacing in between them, known as kerning.
For those that are unfamiliar with the term, kerning is the spacing between specific pairs of characters, which makes fonts look nice and more readable. For instance, in “LEEWAY” there is no overlap between “L” and “E”, nor “E” and “E”, whereas “W” and “A”, and “A” and “Y” are overlapping. Another example, taken directly from Wikipedia, is the four bigrams below, with and without kerning applied:
So I want to build a model which incorporates both intra-character and inter-character styling.. but how to do this?
Requirements to get there
I am going to list out the requirements in an odd order, by leaving the first requirement until the end, as it actually was how I went about thinking about this problem:
The second requirement is how I will make the model learn about kerning. The best way of achieving this I could think of was to use bigrams: If a model learns to reproduce all combinations of two characters, along with the spacing encoded in the original fonts, it has effectively learned how to kern as intended.
However, just producing bigrams is not quite equal to creating a fully fledged font, as these bigrams cannot readily be used to write sentences (e.g. “CO”+“OM”+“MP”+“PU”+“UT”+“TE”+”ER” is quite different from “COMPUTER”). So something needs to automatically overlay these bigrams, which is the third and final requirement of my solution.
Having figured out how I want to make a model that can learn how to kern, I need to have a representation of what font style and which bigram I want the model to generate. This will be the input to the model, denoted X, and needs to be composed of two parts, namely the style and bigram. The encoding for “font style” has to be in a continuous space as I want to be able to interpolate between the styles of the fonts, creating a continuum of different font-styles which I can sample to generate completely new fonts. And the encoding for bigrams needs to enable me to create all combinations of characters, such that I end up with a font that can be used for whatever purpose. This is the first requirement of the solution.
In short, the three requirements are:
- Figure out the encoding of the input, X, such that it incorporates the styles of fonts and bigrams
- Teach a model to generate bigrams of various fonts
- Automatically overlay the generated bigrams to form words and sentences
Encoding of the input
As mentioned, the encoding of X will need to consist of two parts: Bigram and style.
Encoding of the bigram
I want my generated fonts to be able to write all the letters of the English alphabet in upper case (A..Z) as well as a hyphen (-). This adds up to 27 different characters, and we need to be able to represent two times this as we are generating bigrams.
How about simply encoding it as two times a number in the range 1 to 27 where the characters are numbered according to the order I just described? Then 15 and 11 would correspond to “O” and “K”.
Yeah, no. This is bad idea because even though machine learning algorithms are “smart” enough to be taught certain things, they operate within the world of arithmetics, meaning that it would interpret the characters “A” and “B” as being very similar, but “B” and “P” as being quite dissimilar due to their numerical values in this encoding would be 1 and 2, and 2 and 16, respectively.
Instead I am going with something called one-hot encoding for representing the characters in the bigram. One-hot encoding is a list of a single one and N-1 zeros, for an alphabet of length N. For our alphabet of 27 characters, each position of the zeroes/one correspond to a specific character as the example here should hopefully show:
00000000 00000000 00000000 100 ABCDEFGH IJKLMNOP QRSTUVWX YZ-
The 25th position in the one-hot vector is a 1, meaning that the vector represents the 25th character of the alphabet, which is “Y”.
So to represent a bigram of an alphabet with 27 characters such as ours, we will have 52 zeroes and 2 ones. For example, one will encode “ET” as follows:
00001000 00000000 00000000 000 00000000 00000000 00010000 000
One-hot encoding ensures that the machine sees absolutely zero correlation between characters, which is what I want.
That being said, one can make the argument that some characters look more alike than others, such as “M” and “N” but certainly not “Q” and “T”, and thus maybe one should use a representation that is a combination of these two types of encoding, allowing for some correlation between the characters.
Encoding of the font-style
That was representing the bigrams, but we also need something representing the style of the font we want the model to generate. As I want to be able to interpolate between existing fonts to generate completely new ones, the encoding of a given type of font has to be in a continuous space.
A quick and dirty solution is to use an algorithm called t-Distributed Stochastic Neighbor Embedding (t-SNE), which simply put can take images of the 1,639 fonts and map these to a z-dimensional space, Z, where z is much smaller than the dimensionality of the imagery, while trying to have the proximity of the points in the Z-space correspond to the similarity of the original images of the fonts. When z is set to be 2, one can plot the fonts at their corresponding positions in this space:
The dimensionality of the space Z can theoretically be any integer from 1 to the original number of dimensions in the images (which was 2304 in this case), but setting the number of dimensions, z, closer to 1 means that more information about the font-similarities is lost, while each of the dimensions encode more information about the similarity of the fonts. And on the contrary if a larger z is chosen, meaning that each axis encode relatively less information about the similarity of the fonts, but on the other hand introduce less error from the reduction of dimensionality. I chose to set z to 10, as a trade-off between the pros and cons.
Going back to why I wanted to have this in a continuous space, let us assume that I had chosen z = 2. Using t-SNE I can once again map the 1,639 fonts to a 2-dimensional space. This time I plot the location of the fonts in this space rather than the fonts as well, as it allows for seeing how t-SNE cluster them together:
Each of these 1,639 points correspond to a specific, original font. Zooming in on two points in this 2-dimensional space, we can see their corresponding fonts:
As Z is a continuous space we can interpolate between the original fonts, and below is an example from the final model where I interpolated between two fonts:
As you can see, this will allow us to create a continuum of fonts which are novel combinations of the original fonts and their looks!
Now we have settled on how to encode the input to capture both style and which characters to produce, and with z = 10 plus two times 27 characters with one-hot encoding, we end up with the input to the model being a 64-dimensional vector. With that in place, it is time to look at the model and how it is taught to generate these bigrams.
The Model and training hereof
In order to teach any kind of model to generate bigrams, we need some training data. For this problem, it comes quite easy, as I happen to already have the 1,639 fonts lying around on a USB-stick from when I wrote “Endless” in “endlessly” many interpolations of fonts (sorry, I couldn’t resist the pun). With this, I can generate both the 64-dimensional input vectors and the corresponding expected output.
For any of the 1,639 fonts, the 10-dimensional style-vector is simply the point in the space Z which t-SNE told me corresponds to the given font, and the bigram is the 54-dimensional one-hot encoding corresponding to the two characters it should output. Concatenate these two vectors and you have the input.
And the outputs are simply 41 by 65 pixel images generated from the 1,639 fonts, with all 27*27 combinations of the two characters in the bigrams.This yields roughly 1.2 million inputs, x, and corresponding outputs, y, to train the model on, over and over.
A very simplistic view of the model is this:
Given an input vector, x, the model produces an output image, ỹ, of the estimated corresponding bigram.
The model, which happens to be a deep neural network, is trained by having it produce an estimated ỹ from x. ỹ is then subtracted from the expected image, y, and the mean absolute error is then used to update the model such that it produces a ỹ closer to y.
For this particular problem, the model generally seems to converge after having iterated over a few million examples of inputs and expected outputs, which takes roughly 24 hours on a computer with a powerful GPU.
At this point we can create bigrams (and monograms, because I made a model for that too) in a continuum of styles. Here’s a video of interpolation between 20+ different styles for the upper case alphabet + hyphen:
Overlaying bigrams into n-grams
With a model capable of producing bigrams all we need now before we can write full words in the generated fonts is to be able to stitch them together automatically (because we are somewhat lazy and don’t want to do this manually).
Given a word, such as “HELLO”, what I did was to break this down into bigrams: “HE”, “EL”, “LL”, “LO” and then have an algorithm figure out the pairwise overlap between them. E.g. for “HE” and “EL” it should figure out that the two E’s should overlap.
For this purpose I used an algorithm called Simulated Annealing, and programmed it to maximise the overlap between the black parts of the image (corresponding to where the characters actually are and ignoring the white space surrounding them).
In short, simulated annealing works by taking a random action, and if some condition is met it updates the state according to this action. An action in this case is moving (/translating) the second bigram along the x and y axes, as well as scaling it up/down in the x and y axes. The state is the information about how much the second bigram currently has been moved and scaled. The condition is two things: If the random action improves the state with regards to the objective (which is maximising the overlap over the two bigrams) it always applies the action and updates the state, but if the action worsens the state with regards to the objective (i.e. make the bigrams overlap less) it randomly decides whether to update the state with the action. This depends on the “temperature” of the algorithm, which is an exponentially decreasing number, and the algorithm becomes less likely to accept a worsening of the state with lower temperatures.
Why am I describing how this particular algorithm works, but not the others? Because I made the visualisation below of the annealing process, which makes little sense without some understanding of how the algorithm works!
With this in place we can now pairwise match the bigrams and chain them together to create a full word or sentence (where hyphens replaces spaces).
Breaking the sentence “MACHINE-LEARNING” into bigrams and getting the neural network to produce images of this bigrams in a random font yields this:
And matching them using simulated annealing we get this:
Et voilà! That’s how you generate almost perfect fonts!
… Just kidding. The annealing process often mess up, especially with hyphens:
And less often with other characters than hyphens, but when it does, it’s quite amusing:
This is not a fault of the algorithm simulated annealing, but rather the objective I defined as naively maximising the overlap of the fonts.
Here’s an example of successful matching of the bigrams of a long list of interpolated fonts writing the nonsense word “WAVTSRXA”:
While this does a pretty decent job at generating new fonts with proper kerning, there is room for improvement. An obvious improvement that could be implemented is the mapping from a font to the low-dimensional style-space, Z, should be done somehow other than with t-SNE. t-SNE doesn’t allow for adding new fonts without completely remapping all the fonts and retraining the neural network. An autoencoder is likely well-suited for this purpose.
And in order for this to be truly useful on a large scale, the generated images of the fonts have to be converged into a vectorised font-format such as .woff, .otf, .ttf, or you-name-it. Having this in place would actually remove the need for the fuzzy bigram matching with simulated annealing, as the kerning is already incorporated into the bigrams.
Hopefully this piece went to show a framework for generating novel and fully fledged fonts with proper kerning of the characters, and hopefully it was interesting and somewhat enlightening at the same time!
Finally, the code used to do all of this can be found here on Github.