Properties In Pytesseract

Getting boxes around text

Using Pytesseract, you can get the bounding box information for your OCR results using the function image_to _boxes() of pytesseract library on the preprocessed image this would help you to dectect the word in the image

For Input : -

Output would look like :

To have boxes aroud each and every word instead of the function image_to_boxes() , we can use image_to_word() function .

Page segmentation modes

Page of text can be analysed in several ways . The tesseract api provides several page segmentation modes.

List of the supported page segmentation modes -

1. Orientation and script detection (OSD) only.

2. Automatic page segmentation with OSD.

3. Automatic page segmentation, but no OSD, or OCR.

4. Fully automatic page segmentation, but no OSD. (Default)

5. Assume a single column of text of variable sizes.

6. Assume a single uniform block of vertically aligned text.

7. Assume a single uniform block of text.

8. Treat the image as a single text line.

9. Treat the image as a single word.

10. Treat the image as a single word in a circle.

11. Treat the image as a single character.

12. Sparse text. Find as much text as possible in no particular order.

13. Sparse text with OSD.

14. Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

Detect only digits

on processing this image the text extracted from image would be :

‘Customer name Hallium Energy services

Project NEHINS-HIB-HSA

lavoice no 43876324

Dated 17%h Nov2018

Pono 76496234

Now to just have digits from it we change its configuration to

custom_config = r'--oem 3 --psm 6 outputbase digits'

on this the output would be

. 43876324

172018

0 76496234

Whitelisting characters

If you only want to detect certain characters from the given image and ignore the rest. You can specify your whitelist of characters by using the configuration

custom_config = r'-c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz --psm 6'

Blacklisting character

If you don’t want some expression or character to turn up in your text (the OCR will return wrong text in place of blacklisted characters otherwise), you can blacklist those characters by using the following config.

custom_config = r'-c tessedit_char_blacklist=”expression” --psm 6'

Detect in multiple languages

You can check the languages available by typing this in the terminal

$ tesseract --list-langs

To download tesseract for a specific language use

$ sudo apt-get install tesseract-ocr-LANG

where LANG is the three letter code for the language you need.

Only languages that have a .traineddata file format are supported by tesseract.

To specify the language you need your OCR output in, use – l LANG argument in the config

custom_confi = r ’ l-eng – psm 6 ‘

Using tessdata_fast

If you want to speed up the process ,you can replace testdata language models with tessdata_fast models which are 8-bit integer versions of the tessdata models.

This repository contains fast integer versions of trained models for the Tesseract Open Source OCR Engine.
These models only work with the LSTM OCR engine of Tesseract 4
These are a speed/accuracy compromise as to what offered the best "value for money" in speed vs accuracy.
For some languages, this is still best, but for most not.
The "best value for money" network configuration was then integerized for further speed.
Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04.
Fine tuning/incremental training will NOT be possible from these fast models, as they are 8-bit integer.
When using the models in this repository, only the new LSTM-based OCR engine is supported. The legacy tesseract engine is not supported with these files, so Tesseract's oem modes '0' and '2' won't work with them.

Search This Blog

Tesseract OCR