Properties In Pytesseract
Getting boxes around text
Using Pytesseract, you can get the
bounding box information for your OCR results using the function image_to
_boxes() of pytesseract library on the
preprocessed image this would help you to dectect the word in the image
For Input : -
Page of text can be analysed in several ways . The
tesseract api provides several page segmentation modes.
List of the supported page segmentation modes -
1. Orientation and script detection (OSD)
only.
2. Automatic page segmentation with OSD.
3. Automatic page segmentation, but no OSD,
or OCR.
4. Fully automatic page segmentation, but no
OSD. (Default)
5. Assume a single column of text of variable
sizes.
6. Assume a single uniform block of
vertically aligned text.
7. Assume a single uniform block of text.
8. Treat the image as a single text line.
9. Treat the image as a single word.
10. Treat
the image as a single word in a circle.
11. Treat
the image as a single character.
12. Sparse
text. Find as much text as possible in no particular order.
13. Sparse
text with OSD.
14. Raw
line. Treat the image as a single text line, bypassing hacks that are
Tesseract-specific.
Detect
only digits
on processing this image the text extracted from image would be :
‘Customer name Hallium Energy services
Project NEHINS-HIB-HSA
lavoice no 43876324
Dated 17%h Nov2018
Pono 76496234
Now to just have digits from it
we change its configuration to
custom_config =
r'--oem 3 --psm
6 outputbase digits'
on this the
output would be
. 43876324
172018
0 76496234
Whitelisting characters
If you only want to detect certain
characters from the given image and ignore the rest. You can specify your
whitelist of characters by using the configuration
custom_config = r'-c
tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz --psm 6'
Blacklisting character
If you don’t want
some expression or character to turn up in your text (the OCR will return wrong
text in place of blacklisted characters otherwise), you can blacklist those
characters by using the following config.
custom_config = r'-c tessedit_char_blacklist=”expression”
--psm 6'
Detect in multiple languages
You can check the languages available by typing this in
the terminal
$ tesseract --list-langs
To download tesseract for a specific language use
$ sudo apt-get install
tesseract-ocr-LANG
where LANG is the three letter code for the language you
need.
Only
languages that have a .traineddata file format are supported by tesseract.
To specify the language you need your OCR output in, use –
l LANG
argument in the config
custom_confi = r ’ l-eng – psm 6 ‘
Using tessdata_fast
If you want to speed up the process ,you can replace testdata language models with
tessdata_fast models which are 8-bit integer versions of the tessdata models.
- This repository contains fast integer versions of trained models for the Tesseract Open Source OCR Engine.
- These models only work with the LSTM OCR engine of Tesseract 4
- These are a speed/accuracy compromise as to what
offered the best "value for money" in speed vs accuracy.
- For some languages, this is still best, but for
most not.
- The "best value for money" network configuration was then integerized for further speed.
- Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04.
- Fine tuning/incremental training will NOT be
possible from these fast models, as they are 8-bit integer.
- When using the models in this repository, only the
new LSTM-based OCR engine is supported. The legacy tesseract engine is not
supported with these files, so Tesseract's oem modes '0' and '2' won't work
with them.
Comments
Post a Comment