Introduction to tesseract-ocr

December 17, 2020

OCR

Tesseract is an optical character recognition engine with open source code and this is the most popular OCR-library. OCR engine transform a two-dimensional image of text, that would contain machine printed or handwritten text from its image representation into electronic text .

OCR generally consists of sub-processes:

· Pre-processing

· Text detection

· Text recognition

· Post-processing

This sub-processes can of course vary depending on the use-case but these are generally the steps needed to perform optical character recognition.

Tesseract

Tesseract was developed by Hewlett Packard Labs. In 2005, it had been open sourced by HP together with the University of Nevada, Las Vegas. Since 2006 it's been actively developed by Google and lots of open source contributors. It are often used directly, or using an API to extract printed text from images. It supports a wide variety of languages. It are often used with the prevailing layout analysis to acknowledge text within an outsized document, or it are often utilized in conjunction with an external text detector to recognize text from a picture of one text line.

Installing Tesseract

Tesseract library is shipped with a handy command-line tool called tesseract. User can use this tool to perform OCR on images and the output is stored in a text file. If user want to integrate Tesseract in their C++ or Python code, user will use Tesseract’s API.

Steps to install Tesseract on Ubuntu:

· sudo apt install tesseract-ocr

· sudo apt install libtesseract-dev

·sudo pip install pytesseract

Installing tesseract on Windows is easy with the precompiled binaries and also edit “path” environment variable and add tesseract path.

Search This Blog

Tesseract OCR