Show Tesseract OCR for PHPA wrapper to work with Tesseract OCR inside PHP.
InstallationVia Composer:
‼️ This library depends on Tesseract OCR, version 3.02 or later. Note for Windows usersThere are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
⚠️ Recent versions of
Capture2Text stopped shipping the Note for macOS usersWith MacPorts you can install support for individual languages, like so:
But that is not possible with Homebrew. It comes only with English support by default, so if you intend to use it for other language, the quickest solution is to install them all: UsageBasic usageuse thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('text.png')) ->run(); Other languagesuse thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('german.png')) ->lang('deu') ->run(); Multiple languages
use thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('mixed-languages.png')) ->lang('eng', 'jpn', 'spa') ->run(); Inducing recognitionuse thiagoalessio\TesseractOCR\TesseractOCR; echo (new TesseractOCR('8055.png')) ->allowlist(range('A', 'Z')) ->run(); Breaking CAPTCHAsYes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look at this comment: #91 (comment) APIrunExecutes a $ocr = new TesseractOCR(); $ocr->run(); $ocr = new TesseractOCR(); $timeout = 500; $ocr->run($timeout); imageDefine the path of an image to be recognized by $ocr = new TesseractOCR(); $ocr->image('/path/to/image.png'); $ocr->run(); imageDataSet the image to be recognized by //Using Imagick $data = $img->getImageBlob(); $size = $img->getImageLength(); //Using GD ob_start(); // Note that you can use any format supported by tesseract imagepng($img, null, 0); $size = ob_get_length(); $data = ob_get_clean(); $ocr = new TesseractOCR(); $ocr->imageData($data, $size); $ocr->run(); executableDefine a custom location of the echo (new TesseractOCR('img.png')) ->executable('/path/to/tesseract') ->run(); versionReturns the current version of echo (new TesseractOCR())->version(); availableLanguagesReturns a list of available languages/scripts.
foreach((new TesseractOCR())->availableLanguages() as $lang) echo $lang; More info: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages-and-scripts tessdataDirSpecify a custom location for the tessdata directory.
echo (new TesseractOCR('img.png')) ->tessdataDir('/path') ->run(); userWordsSpecify the location of user words file. This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by Useful when dealing with contents that contain technical terminology, jargon, etc.
echo (new TesseractOCR('img.png')) ->userWords('/path/to/user-words.txt') ->run(); userPatternsSpecify the location of user patterns file. If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.
echo (new TesseractOCR('img.png')) ->userPatterns('/path/to/user-patterns.txt') ->run(); langDefine one or more languages to be used during the recognition. A complete list of available languages can be found at: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages Tip from
@daijiale: Use the combination echo (new TesseractOCR('img.png')) ->lang('lang1', 'lang2', 'lang3') ->run(); psmSpecify the Page Segmentation Method, which instructs More info: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method echo (new TesseractOCR('img.png')) ->psm(6) ->run(); oemSpecify the OCR Engine Mode. (see echo (new TesseractOCR('img.png')) ->oem(2) ->run(); dpiSpecify the image DPI. It is useful if your image does not contain this information in its metadata. echo (new TesseractOCR('img.png')) ->dpi(300) ->run(); allowlistThis is a shortcut for echo (new TesseractOCR('img.png')) ->allowlist(range('a', 'z'), range(0, 9), '-_@') ->run(); configFileSpecify a config file to be used. It can either be the path to your own config file or the name of one of the predefined config files: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs echo (new TesseractOCR('img.png')) ->configFile('hocr') ->run(); setOutputFileSpecify an Outputfile to be used. Be aware: If you set an outputfile then the option In combination with echo (new TesseractOCR('img.png')) ->configFile('pdf') ->setOutputFile('/PATH_TO_MY_OUTPUTFILE/searchable.pdf') ->run(); digitsShortcut for echo (new TesseractOCR('img.png')) ->digits() ->run(); hocrShortcut for echo (new TesseractOCR('img.png')) ->hocr() ->run(); Shortcut for echo (new TesseractOCR('img.png')) ->pdf() ->run(); quietShortcut for echo (new TesseractOCR('img.png')) ->quiet() ->run(); tsvShortcut for echo (new TesseractOCR('img.png')) ->tsv() ->run(); txtShortcut for echo (new TesseractOCR('img.png')) ->txt() ->run(); tempDirDefine a custom directory to store temporary files generated by tesseract. Make sure the directory actually exists and the user running
echo (new TesseractOCR('img.png')) ->tempDir('./my/custom/temp/dir') ->run(); withoutTempFilesSpecify that echo (new TesseractOCR('img.png')) ->withoutTempFiles() ->run(); Other optionsAny configuration option offered by Tesseract can be used like that: echo (new TesseractOCR('img.png')) ->config('config_var', 'value') ->config('other_config_var', 'other value') ->run(); Or like that: echo (new TesseractOCR('img.png')) ->configVar('value') ->otherConfigVar('other value') ->run(); More info: https://github.com/tesseract-ocr/tesseract/wiki/ControlParams Thread-limitSometimes, it may be useful to limit the number of threads that tesseract is allowed to use (e.g. in this case). Set
the maxmium number of threads as param for the echo (new TesseractOCR('img.png')) ->threadLimit(1) ->run(); How to contributeYou can contribute to this project by:
Just make sure you take a look at our Code of Conduct and Contributing instructions. Licensetesseract-ocr-for-php is released under the MIT License. Made with in Berlin |