FAQ: Build a Handwritten Text Recognition System using TensorFlow

5 min readSep 13, 2018

This article is a follow-up of the article about how to implement a text recognition model using TensorFlow. It is based on an older code version of the SimpleHTR repository.

There were some questions which I want to discuss here. Let’s have a look at the following three ones:

How to recognize text in your images/datasets?
How to recognize text contained in lines or full pages?
How to compute a confidence score for the recognized text?

1 How to recognize text in your images/datasets?

The pre-trained model was trained on the IAM-dataset. One sample from IAM is shown in Fig. 1. The model not only learns how to read text, but it also learns how the dataset samples look like. If you look through the IAM word images, you will notice these patterns:

Images have high contrast
Words are tightly cropped
Bold writing style

If you feed an image with a very different style, you might get a bad result. Let’s take the image shown in Fig. 2.

Fig. 2: A sample for which the model recognizes the text “.”.

The model recognizes the text “.” in this image. The reason is that the model has never seen images like this:

Low-contrast
Much space around the word
Lines very thin

Let’s look at two approaches to improve the recognition result.

1.1 Preprocess images

Let’s start with a hacky way to make images look more like the samples from the IAM dataset. We will try this with the image from above (see Fig. 3). First, let’s crop it. The model still recognizes “.”. Then, we increase the contrast. Now, the model gives a much better result: “tello”. This is almost correct. If we thicken the lines by applying a morphological operation, the model is finally able to recognize the correct text: “Hello”.

Fig. 3: Preprocessing steps and the recognized text for each of them.

The cropping can be done with a word-segmentation algorithm. Increasing the contrast and applying the morphological operation is achieved by the following Python code.

1.2 Create IAM-compatible dataset and train model

The best way to get good reading results is of course to retrain the model on your data. While creating a dataset can be quite some effort, it definitely is worth it. You need image-text pairs which you have to convert into an IAM-compatible format. The following code shows how to do this conversion. The getNext() method of the DataProvider class returns one sample (text and image) per call. The createIAMCompatibleDataset() function creates the file words.txt and the directory sub, in which all images are put. You have to adapt the getNext() method if you want to convert your dataset (at the moment it simply creates machine-printed text for the provided words to show an example usage).

After the conversion, copy both the file words.txt and the directory sub into the data directory of the SimpleHTR project. Then you can train the model by executing python main.py --train.

2 How to recognize text in lines or full pages?

The model is able to input images of size 128×32 and is able to output at most 32 characters. So, it is possible to recognize one or maybe two words with that model. However, longer sentences or even full pages can’t be read directly:

Lines: either segment the lines into words, or make the text recognition model larger so that it can handle full lines
Full pages: segment pages into the individual words, and then read each of them separately

Let’s look at preprocessing first, which can be used for both line-level and page-level text processing.

2.1 Pre-process images

If the words of the line are easy to segment (large gaps between words, small gaps between characters of a word), then a simple word-segmentation method like the one proposed by R. Manmatha and N. Srimal (see Fig. 4 for an example) can be used. The segmented words are then fed individually into the text recognition model. For more complex documents, a word segmentation approach based on deep-learning might be used instead.

Fig. 4: Word-segmentation on page-level.

2.2 Extend model to fit complete text-lines

If working on line-level, you can easily extend the model such that larger images are fed and longer character strings are output.

Table 1 shows an architecture which I used for text-line recognition. It allows a larger input image (800×64) and is able to output larger character strings (up to 100 in length). Additionally, it contains more CNN layers (7) and uses batch normalization in two layers.

Table 1: Architecture for reading on line-level. Use option 1 (LSTM) for the recurrent network. Abbreviations: bidirectional (bidir), batch normalization (BN), convolutional layer (Conv).

3 How to compute a confidence score for the recognized text?

The easiest way to get the probability of the recognized text is to use the CTC loss function. The loss function takes the character-probability matrix and the text as input and outputs the loss value L. The loss value L is the negative log-likelihood of seeing the given text, i.e. L=-log(P). If we feed the character-probability matrix and the recognized text to the loss function and afterwards undo the log and the minus, we get the probability P of the recognized text: P=exp(-L).

The following code shows how to compute the probability of the recognized text for a toy example.

Conclusion

The article showed how to handle different datasets and reading on line-level or even page-level. Further, an easy way to compute the confidence score was discussed.

References

And finally, an overview of my other Medium articles.