An Intuitive Explanation of Connectionist Temporal Classification

Text recognition with the Connectionist Temporal Classification (CTC) loss and decoding operation

8 min readJun 10, 2018

If you want a computer to recognize text, neural networks (NN) are a good choice as they outperform all other approaches at the moment. The NN for such use-cases usually consists of convolutional layers (CNN) to extract a sequence of features and recurrent layers (RNN) to propagate information through this sequence. It outputs character-scores for each sequence-element, which simply is represented by a matrix. Now, there are two things we want to do with this matrix:

train: calculate the loss value to train the NN
infer: decode the matrix to get the text contained in the input image

Both tasks are achieved by the CTC operation. An overview of the handwriting recognition system is shown in Fig. 1.

Let’s have a closer look at the CTC operation and discuss how it works without hiding the clever ideas it is based on behind complicated formulas. At the end, I will point you to references where you can find Python code and the (not too complicated) formulas, if you are interested.

Fig. 1: Overview of a NN for handwriting recognition.

Why we want to use CTC

We could, of course, create a data-set with images of text-lines, and then specify for each horizontal position of the image the corresponding character as shown in Fig. 2. Then, we could train a NN to output a character-score for each horizontal position. However, there are two problems with this naive solution:

it is very time-consuming (and boring) to annotate a data-set on character-level.
we only get character-scores and therefore need some further processing to get the final text from it. A single character can span multiple horizontal positions, e.g. we could get “ ttooo” because the “o” is a wide character as shown in Fig. 2. We have to remove all duplicate “t”s and “o”s. But what if the recognized text would have been “too”? Then removing all duplicate “o”s gets us the wrong result. How to handle this?

Fig. 2: Annotation for each horizontal position of the image.

CTC solves both problems for us:

we only have to tell the CTC loss function the text that occurs in the image. Therefore we ignore both the position and width of the characters in the image.
no further processing of the recognized text is needed.

How CTC works

As already discussed, we don’t want to annotate the images at each horizontal position (which we call time-step from now on). The NN-training will be guided by the CTC loss function. We only feed the output matrix of the NN and the corresponding ground-truth (GT) text to the CTC loss function. But how does it know where each character occurs? Well, it does not know. Instead, it tries all possible alignments of the GT text in the image and takes the sum of all scores. This way, the score of a GT text is high if the sum over the alignment-scores has a high value.

Encoding the text

There was the issue of how to encode duplicate characters (you remember what we said about the word “too”?). It is solved by introducing a pseudo-character (called blank, but don’t confuse it with a “real” blank, i.e. a white-space character). This special character will be denoted as “-” in the following text. We use a clever coding schema to solve the duplicate-character problem: when encoding a text, we can insert arbitrary many blanks at any position, which will be removed when decoding it. However, we must insert a blank between duplicate characters like in “hello”. Further, we can repeat each character as often as we like.

Let’s look at some examples:

“to” → “---ttttttooo”, or “-t-o-”, or “to”
“too” → “---ttttto-o”, or “-t-o-o-”, or “to-o”, but not “too”

As you see, this schema also allows us to easily create different alignments of the same text, e.g. “t-o” and “too” and “-to” all represent the same text (“to”), but with different alignments to the image. The NN is trained to output an encoded text (encoded in the NN output matrix).

Loss calculation

We need to calculate the loss value for the training samples (pairs of images and GT texts) to train the NN. You already know that the NN outputs a matrix containing a score for each character at each time-step. A minimalistic matrix is shown in Fig. 3: there are two time-steps (t0, t1) and three characters (“a”, “b” and the blank “-”). The character-scores sum to 1 for each time-step.

Fig. 3: Output matrix of NN. The character-probability is color-coded and is also printed next to each matrix entry. Thin lines are paths representing the text “a”, while the thick dashed line is the only path representing the text “”.

Further, you already know that the loss is calculated by summing up all scores of all possible alignments of the GT text, this way it does not matter where the text appears in the image.

The score for one alignment (or path, as it is often called in the literature) is calculated by multiplying the corresponding character scores together. In the example shown above, the score for the path “aa” is 0.4·0.4=0.16 while it is 0.4·0.6=0.24 for “a-” and 0.6·0.4=0.24 for “-a”. To get the score for a given GT text, we sum over the scores of all paths corresponding to this text. Let’s assume the GT text is “a” in the example: we have to calculate all possible paths of length 2 (because the matrix has 2 time-steps), which are: “aa”, “a-” and “-a”. We already calculated the scores for these paths, so we just have to sum over them and get 0.4·0.4+0.4·0.6+0.6·0.4=0.64. If the GT text is assumed to be “”, we see that there is only one corresponding path, namely “--”, which yields the overall score of 0.6·0.6=0.36.

We are now able to compute the probability of the GT text of a training sample, given the output matrix produced by the NN. The goal is to train the NN such that it outputs a high probability (ideally, a value of 1) for correct classifications. Therefore, we maximize the product of probabilities of correct classifications for the training dataset. For technical reasons, we re-formulate into an equivalent problem: minimize the loss of the training dataset, where the loss is the negative sum of log-probabilities. If you need the loss value for a single sample, simply compute the probability, take the logarithm, and put a minus in front of the result. To train the NN, the gradient of the loss with respect to the NN parameters (e.g., weights of convolutional kernels) is computed and used to update the parameters.

Decoding

When we have a trained NN, we usually want to use it to recognize text in previously unseen images. Or in more technical terms: we want to calculate the most likely text given the output matrix of the NN. You already know a method to calculate the score of a given text. But this time, we are not given any text, in fact, it is exactly this text we are looking for. Trying every possible text would work if there are only a few time-steps and characters, but for practical use-cases, this is not feasible.

A simple and very fast algorithm is best path decoding which consists of two steps:

it calculates the best path by taking the most likely character per time-step.
it undoes the encoding by first removing duplicate characters and then removing all blanks from the path. What remains represents the recognized text.

An example is shown in Fig. 4. The characters are “a”, “b” and “-” (blank). There are 5 time-steps. Let’s apply our best path decoder to this matrix: the most likely character of t0 is “a”, the same applies for t1 and t2. The blank character has the highest score at t3. Finally, “b” is most likely at t4. This gives us the path “aaa-b”. We remove duplicate characters, this yields “a-b”, and then we remove any blank from the remaining path, which gives us the text “ab” which we output as the recognized text.

Fig. 4: Output matrix of NN. The thick dashed line represents the best path.

Best path decoding is, of course, only an approximation. It is easy to construct examples for which it gives the wrong result: if you decode the matrix from Fig. 3, you get “” as the recognized text. But we already know that the probability of “” is only 0.36 while it is 0.64 for “a”. However, the approximation algorithm often gives good results in practical situations. There are more advanced decoders such as beam-search decoding, prefix-search decoding or token passing, which also use information about language structure to improve the results.

Conclusion and further reading

First, we looked at the problems arising with a naive NN solution. Then, we saw how CTC is able to tackle these problems. We then examined how CTC works by looking at how it encodes text, how loss calculation is done and how it decodes the output of a CTC-trained NN.

This should give you a good understanding of what is happening behind the scenes when you e.g. call functions like ctc_loss or ctc_greedy_decoder in TensorFlow. However, when you want to implement CTC yourself, you need to know some more details, especially to make it run fast. Graves et al. [1] introduce the CTC operation, the paper also shows all the relevant math. If you are interested in how to improve decoding, take a look at the articles about beam search decoding [2][3]. I implemented some decoders and the loss function in Python and C++, which you can find on github [4][5]. Finally, if you want to look at the bigger picture of how to recognize (handwritten) text, look at my article on how to build a handwritten text recognition system [6].

[1] Original paper containing all the math
[2] Vanilla beam search decoding
[3] Word beam search decoding
[4] Python implementation of decoders
[5] Implementation of word beam search decoding
[6] Text recognition system

And finally, an overview of my other Medium articles.