Posts

OCR with Pytesseract and OpenCV

Image
  OCR with Pytesseract and OpenCV Pytesseract is a wrapper for Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Preprocessing for Tesseract The main objective of the Preprocessing phase is to make as easy as possible for the OCR system to distinguish a character/word from the background. Some of the most basic and important Preprocessing techniques are:- Binarization. Skew Correction. Noise Removal. Thinning and Skeletonization. Binarization:   In layman’s terms Binarization means converting a coloured image into an image which consists of only black and white pixels (Black pixel value=0 and White pixel value=255). As a basic rule, this can be done by fixing a threshold (normally threshold=127, as it is exactly half of the pixel range 0–255). If the pixel value is greater than the thresho...

Applications and Limitations of tesseract

Application of Tesseract License Plate Recognition: Tesseract OCR An Optical Character Recognition Engine (OCR Engine) automatically recognize text in vehicle registration plates   . It first detect and localize a license plate in an input image/frame. It then e xtract the characters from the license plate and then finally  a pply some form of Optical Character Recognition (OCR) to recognize the extracted characters. Handwriting Recognition: The printed or handwritten document is first scanned and then   separates each character and  after applying OCR it matches it to what it thinks is the most likely letter on a database Limitation of Tesseract Tesseract gives best result when there is a clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to have such types of setup. There are a variety of reasons you might not get good quality output from Tesseract like if the image has noise on the background. The bette...

Introduction to tesseract-ocr

Image
  OCR Tesseract is an optical character recognition engine with open source code and this is the most popular OCR-library . OCR engine transform a two-dimensional image of text,  that would  contain machine printed or handwritten text from its image representation into  electronic text  . OCR generally consists of sub-processes: ·         Pre-processing ·         Text detection ·         Text recognition ·         Post-processing This sub-processes can of course vary depending on the use-case but these are generally  the steps needed to perform optical character recognition. Tesseract Tesseract was developed by Hewlett Packard Labs. In 2005, it had been open sourced by HP together with the University of Nevada, Las Vegas. Since 2006 it's been actively developed by Google and lots of open source contrib...

Properties In Pytesseract

Image
  Getting boxes around text Using Pytesseract, you can get the bounding box information for your OCR results using the function image_to _boxes() of pytesseract  library on the preprocessed image this would help you to dectect the word in the image For Input : -              Output would look like : To have boxes aroud each and every word instead of the function image_to_boxes() , we can use image_to_word() function . Page segmentation modes Page of text can be analysed in several ways . The tesseract api provides several page segmentation modes. List of the supported page segmentation modes - 1.            Orientation and script detection (OSD) only. 2.             Automatic page segmentation with OSD. 3.             Automatic page segmentation, but no OSD, or OCR. 4.  ...

How tesseract ocr works?

Image
  Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. T he neural network system in Tesseract pre-dates tensorflow but is compatible with it, as there is a network description language called Variable Graph Specification Language. We use CNN( Convolutional Neural Network) t o recognize an image containing a single character.  RNNs and LSTM is a popular form of RNN.  Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM. Process of tesseract ocr LSTMs are great at learning sequences. But it  slows down a lot when the number of states is too large. Working: The first step is a connected component analysis in which outlines of the components are stored. This is a computationally expensive design decision at the time, but has a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and rec...