Introduction to tesseract-ocr
Tesseract is an optical
character recognition engine with open source code and this is the most popular
OCR-library. OCR engine transform a two-dimensional
image of text, that would contain machine
printed or handwritten text from its image representation into electronic text .
OCR generally consists of sub-processes:
·
Pre-processing
·
Text detection
·
Text recognition
·
Post-processing
This sub-processes can of
course vary depending on the use-case but these are generally the steps needed
to perform optical character recognition.
Tesseract
Tesseract was developed by Hewlett Packard Labs. In 2005, it had been open sourced by HP together with the University of Nevada, Las Vegas. Since 2006 it's been actively developed by Google and lots of open source contributors. It are often used directly, or using an API to extract printed text from images. It supports a wide variety of languages. It are often used with the prevailing layout analysis to acknowledge text within an outsized document, or it are often utilized in conjunction with an external text detector to recognize text from a picture of one text line.
Installing Tesseract
Tesseract library is shipped with a handy command-line tool called tesseract. User can use this tool to perform OCR on images and the output is stored in a text file. If user want to integrate Tesseract in their C++ or Python code, user will use Tesseract’s API.
Steps to install Tesseract on Ubuntu:
· sudo apt install tesseract-ocr
· sudo apt install libtesseract-dev
·sudo pip install pytesseract
|
Installing tesseract on Windows is easy with the
precompiled binaries and also edit “path” environment variable and add
tesseract path. |
Comments
Post a Comment