Beam Search Decoding in CTC-trained Neural Networks

A fast and well-performing algorithm with integrated language model to decode the neural network output in the context of text recognition

Harald Scheidl
9 min readJul 10, 2018

Neural networks (NN) consisting of convolutional NN layers and recurrent NN layers combined with a final connectionist temporal classification (CTC) layer are a good choice for (handwritten) text recognition.

The output of the NN is a matrix containing character-probabilities for each time-step (horizontal position), an example is shown in Fig 1. This matrix must be decoded to get the final text. One algorithm to achieve this is beam search decoding which can easily integrate a character-level language model.

Fig. 1: Output matrix of NN consisting of two time-steps (t0, t1) and two characters (“a”, “b”) plus the CTC blank (“-”). The numbers indicate the probability of seeing the character at the given time-step.

We will start our discussion with a recap of CTC and best path decoding. Then we will discuss the building blocks (basic algorithm, CTC scoring, language model) of the CTC beam search decoding algorithm. Finally, I will point you to a Python implementation which you can use to do your own tests and experiments.

A short reminder how CTC works

Reading the article “An Intuitive Explanation of Connectionist Temporal Classification” helps you to understand the following discussion. Here I will give a short recap.

CTC allows training text recognition systems with pairs of images and ground truth texts. Text is encoded in the NN output matrix by paths, which contain one character per time-step, e.g. ‘ab’ or ‘aa’ are possible paths in Fig. 1. I will show text in double quotes “text” and paths in single quotes ‘path’.

A path encodes text in the following way: each character of a text can be repeated an arbitrary number of times. Further, an arbitrary number of CTC blanks (non-character, not to be confused with a white-space character, denoted as “-” in this article) can be inserted between characters. In the case of repeated characters (e.g. “pizza”), at least one blank must be placed in between these repeated characters on the path (e.g. ‘piz-za’).

Here are examples of texts with corresponding paths:

  • “to” → ‘-t-o---’, ‘tttttttt-ooo-’, ‘to’, …
  • “hello” → ‘h-elllll-ll-ooo’, ‘hel-lo’, …
  • “a” → ‘aa’, ‘a-’, ‘-a’, …

As you see, there may be more than one path corresponding to a text. When we are interested in the probability of a text, we have to sum over the probabilities of all corresponding paths. The probability of a single path is the product of the character-probabilities on this path, e.g. for the path ‘aa’ in Fig. 1 it is 0.2·0.4=0.08.

Best path decoding

Best path decoding is the simplest method to decode the output matrix:

  • Concatenate most probable characters per time-step which yields the best path.
  • Then, undo the encoding by first removing duplicate characters and then removing all blanks. This gives us the recognized text.

Let’s look at an example: the matrix is shown in Fig. 2. The highest-scoring character is blank for both time-steps t0 and t1. So, the best path is ‘--’. We then undo the encoding and get the text “”. Further, we can calculate the probability of the path by multiplying the character-probabilities, which is 0.8·0.6=0.48 in this example.

Fig. 2: Concatenate most probable characters per time-step to get best path.

Best path decoding is fast, we only have to find the character with the highest score for each time-step. If we have C characters and T time-steps, the algorithm has a running time of O(T·C).

Why best path decoding can fail

Best path decoding is both fast and simple, which are of course nice properties. But it may fail in certain situations like the one shown in Fig 2. In Fig. 3 all paths corresponding to the text “a” are shown: ‘aa’, ‘a-’ and ‘-a’. The probability of the text “a” is the sum over all probabilities of these mentioned paths: 0.2·0.4+0.2·0.6+0.8·0.4=0.52. So, “a” is more probable than “” (0.52>0.48). We need a better algorithm than best path decoding which can handle such situations.

Fig. 3: All paths corresponding to text “a”.

A basic version of beam search decoding

Beam search decoding iteratively creates text candidates (beams) and scores them. Pseudo-code for a basic version is shows in Fig 4.: the list of beams is initialized with an empty beam (line 1) and a corresponding score (2). Then, the algorithm iterates over all time-steps of the NN output matrix (3–15). At each time-step, only the best scoring beams from the previous time-step are kept (4). The beam width (BW) specifies the number of beams to keep. For each of these beams, the score at the current time-step is calculated (8). Further, each beam is extended by all possible characters from the alphabet (10) and again, a score is calculated (11). After the last time-step, the best beam is returned as a result (16).

Fig 4: Basic version of beam search.

Let’s visualize how the algorithm decodes our example NN output with BW 2 and alphabet {“a”, “b”}. Fig. 5 shows both the NN output to be decoded and the tree of beams. The algorithm starts with an empty beam “”, which corresponds to the root node of the tree. The beam is then both copied and extended by all possible characters from the alphabet. This gives us the beams “a”, “b” and “”. Later, we will take a closer look at how to calculate the beam-scores. For now, we use our intuition and see that there is only one path corresponding to each beam: ‘a’ with probability 0.2, ‘b’ with 0 and ‘-’ with 0.8.

In the next iteration, we just keep the 2 best beams (according to BW) from the previous time-step, i.e. we throw away the beam “b”. Then, we again both copy and extend the surviving beams and get “aa”, “ab”, “a”, “a”, “b”, “”. If two beams are equal as it is the case for “a”, we simply merge them: we add up the scores and only keep one of the beams. We again use our intuition to compute the scores. Each beam containing a “b” has a probability of 0. “aa” also has 0 probability because to encode a text with repeated characters, we have to put a blank in between (e.g. ‘a-a’), which is not possible for a path of length 2. Finally, what remains are the beams “a” and “”. We already calculated the probabilities for them: 0.52 and 0.48.

We finished the last iteration and the final step of the algorithm is to return the beam with the highest score, which is “a” in this example.

Fig. 5: NN output and tree of beams with alphabet = {“a”, “b”} and BW = 2.

Scoring the beams

We didn’t talk about how to score the beams yet. We split the beam-score into the score of paths ending with a blank (e.g. ‘aa-’) and paths ending with a non-blank (e.g. ‘aaa’). We denote the probability of all paths ending with a blank and corresponding to a beam b at time-step t by Pb(b, t) and by Pnb(b, t) for the non-blank case. The probability Ptot(b, t) of a beam b at time-step t is then simply the sum of Pb and Pnb, i.e. Ptot(b, t)=Pb(b, t)+Pnb(b, t).

Fig. 6 shows what happens when we extend a path. There are three main cases: extend by blank, extend by repeating last character and extend by some other character. When we collapse the extended paths, we either get the unchanged (copied) beam (“a” → “a”), or we get an extended beam (“a” → “aa” or “ab”). We can use this information the other way round too: if we extend a beam, we know which paths we have to consider to calculate the score.

Fig. 6: The effect of appending a character to paths ending with blank and non-blank.

Let’s look at how to iteratively compute Pb and Pnb. Note that we are always adding instead of assigning the computed values (+= instead of =), this implicitly implements the beam merging discussed earlier. All Pb and Pnb values are initially set to 0.

Copy beam

To copy a beam, we can extend corresponding paths by a blank and get paths ending with a blank: Pb(b, t)+=Ptot(b, t-1)·mat(blank, t).

Further, we may extend paths ending with a non-blank by the last character (if the beam is non-empty): Pnb(b, t)+=Pnb(b, t-1)·mat(b[-1], t), where -1 indexes the last character in the beam.

Extend beam

There are two cases. Either we extend the beam by a character c different from the last character, then there is no need for separating blanks in the paths: Pnb(b+c, t)+=Ptot(b, t-1)·mat(c, t).

Or the last character b[-1] is repeated, then we must ensure that the paths end with a blank: Pnb(b+c, t)+=Pb(b, t-1)·mat(c, t).

We don’t have to care about Pb(b+c, t) because we added a non-blank character.

Character-level language model

A character-level language model (LM) scores a sequence of characters. We restrict our LM to score single characters (unigram LM) and pairs of characters (bigram LM). We denote a unigram probability of the character c as P(c) and the bigram probability of characters c1, c2 as P(c2|c1). The score of a text “hello” is the probability of seeing a single “h”, and the probability of seeing a pair “h” and “e” next to each other, and a pair “e” and “l” next to each other, …

The probability of a character sequence c1, c2, c3, … is: P(c1, c2, c3, …)=P(c1)·P(c2|c1)·P(c3|c2)·…

Training such a LM from a large text is easy: we simply count how often a character occurs and divide by the total number of characters to get the unigram probability. And we count how often a pair of characters occurs and normalize it to get the bigram probability.

Putting it all together

The CTC beam search algorithm is shown in Fig. 7. It is similar to the already shown basic version, but includes code to score the beams: copied beams (lines 7–10) and extended beams are scored (15–19). Further, the LM is applied when extending a beam b by a character c (line 14). In case of a single-character beam, we apply the unigram score P(c), while for longer beams, we apply the bigram score P(b[-1], c). The LM score for a beam b is put into the variable Ptxt(b). When the algorithm looks for the best scoring beams, it sorts them according to Ptot·Ptxt (line 4) and then takes the BW best ones.

Fig. 7: CTC beam search with character-level LM.

The running time can be derived from the pseudo code: the outer-most loop has T iterations. In each iteration, the N beams are sorted, which accounts for N·log(N). The BW best beams are selected and each of them is extended by C characters. Therefore we have N=BW·C beams and the overall running time is O(T·BW·C·log(BW·C)).

Implementation

A Python implementation of beam search decoding (and other decoding algorithms) can be found in the CTCDecoder repository: the relevant code is located in src/BeamSearch.py and src/LanguageModel.py. TensorFlow provides the ctc_beam_search_decoder operation, however, it does not include a LM.

Evaluation

Decoding a NN on the IAM dataset gives a character error rate of 5.60% with best path decoding and 5.35% with beam search decoding. The running time increases from 12ms to 56ms per sample.

Here is a sample from the IAM dataset (see Fig. 8) to get a better feeling for how beam search improves the results. Decoding is done with best path decoding and beam search with and without LM.

Ground truth:        "the fake friend of the family, like the"
Best path decoding: "the fak friend of the fomly hae tC"
Beam search: "the fak friend of the fomcly hae tC"
Beam search with LM: "the fake friend of the family, lie th"
Fig. 8: Sample from IAM dataset.

Conclusion

CTC beam search decoding is a simple and fast algorithm and outperforms best path decoding. A character-level LM can easily be integrated.

References

--

--

Harald Scheidl
Harald Scheidl

Written by Harald Scheidl

Interested in computer vision, deep learning, C++ and Python. https://githubharald.github.io

Responses (8)