How tesseract ocr works?
We use CNN(Convolutional Neural Network) to recognize an image containing a single character. RNNs and LSTM is a popular form of RNN. Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM.
The first step is a connected component analysis in which outlines of the components are stored.
This is a computationally expensive design decision at the time, but has a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text.
Tesseract is probably the first OCR engine able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into Blobs.
Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text.
Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells.
Proportional text is broken into words using definite spaces and fuzzy spaces.
Recognition then proceeds as a two-pass process.
- In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page. Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page
- A second pass is run over the page, in which words that were not recognized well enough are recognized again. A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate small-cap text.
1.Word Finding
2.Character Finding
3.Character Classification
In above image we can visualize how tesseract uses LSTM. The input image is processed in rectangles or boxes and line by line. Then feeding this rectangles to LSTM model gives output.
Comments
Post a Comment