1. Possibly related issue
#1712.
2. Summary
PDF viewers incorrect select words from PDF, that create by Tesseract.
3. Data
Example files from my book:
4. Steps to rperoduce
I download 64-bit Windows version from here, how described in official Tesseract wiki → in installation process I select Russian (rus) additional language → I install Tesseract → I add path with tesseract.exe as user PATH environment variable → I run command:
tesseract KiraProcessedTIF.tif KiraSuperhero -l rus pdf
5. Expected behavior
For KiraCorrectOCR text select correctly in any program:


6. Actual behavior
For KiraSuperhero Tesseract select not full word:


It reproduced for any word in KiraSuperhero.
7. Not helped
I reproduce actual behavior for KiraSuperhero in any PDF viewer.


8. Environment
- Windows 10 Enterprise LTSB 64-bit EN
D:\SashaDebugging\KiraGoddess>tesseract --version
tesseract v5.0.0-alpha.20190708
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
Thanks.
1. Possibly related issue
#1712.
2. Summary
PDF viewers incorrect select words from PDF, that create by Tesseract.
3. Data
Example files from my book:
KiraProcessedTIF.tif— TIF imageKiraSuperhero.pdf— PDF, that create TesseractKiraCorrectOCR.pdf— PDF with correct OCR for comparing4. Steps to rperoduce
I download 64-bit Windows version from here, how described in official Tesseract wiki → in installation process I select Russian (
rus) additional language → I install Tesseract → I add path withtesseract.exeas userPATHenvironment variable → I run command:5. Expected behavior
For
KiraCorrectOCRtext select correctly in any program:6. Actual behavior
For
KiraSuperheroTesseract select not full word:It reproduced for any word in
KiraSuperhero.7. Not helped
I reproduce actual behavior for
KiraSuperheroin any PDF viewer.8. Environment
Thanks.