How tesseract ocr works?

 

Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. The neural network system in Tesseract pre-dates tensorflow but is compatible with it, as there is a network description language called Variable Graph Specification Language.
We use CNN(Convolutional Neural Network) to recognize an image containing a single character. RNNs and LSTM is a popular form of RNN. Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM.


Process of tesseract ocr

LSTMs are great at learning sequences. But it slows down a lot when the number of states is too large.

Working:

The first step is a connected component analysis in which outlines of the components are stored.

This is a computationally expensive design decision at the time, but has a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text.

Tesseract is probably the first OCR engine able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into Blobs.

Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text.

Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells.

Proportional text is broken into words using definite spaces and fuzzy spaces.

Recognition then proceeds as a two-pass process.

  • In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page. Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page
  • A second pass is run over the page, in which words that were not recognized well enough are recognized again. A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate small-cap text.

1.Word Finding
2.Character Finding
3.Character Classification


How tesseract uses LSTM?
 

In above image we can visualize how tesseract uses LSTM. The input image is processed in rectangles or boxes and line by line. Then feeding this rectangles to LSTM model gives output.
Tesseract achieves better performance after adding a new training tool and training the model with a lot of data and fonts But still tesseract struggles to work on handwritten text and weird fonts.







Comments

Popular posts from this blog

Properties In Pytesseract

OCR with Pytesseract and OpenCV