Skip to content

bug(pdf): incorrect selected area #20

@Kristinita

Description

@Kristinita

1. Possibly related issue

#1712.

2. Summary

PDF viewers incorrect select words from PDF, that create by Tesseract.

3. Data

Example files from my book:

4. Steps to rperoduce

I download 64-bit Windows version from here, how described in official Tesseract wiki → in installation process I select Russian (rus) additional language → I install Tesseract → I add path with tesseract.exe as user PATH environment variable → I run command:

tesseract KiraProcessedTIF.tif KiraSuperhero -l rus pdf

5. Expected behavior

For KiraCorrectOCR text select correctly in any program:

KiraCorrectOCR Марк

KiraCorrectOCR самоосмысление

6. Actual behavior

For KiraSuperhero Tesseract select not full word:

KiraSuperhero Марк

KiraSuperhero самоосмысление

It reproduced for any word in KiraSuperhero.

7. Not helped

I reproduce actual behavior for KiraSuperhero in any PDF viewer.

  • Firefox:

Firefox

  • PDF-XChange Editor:

PDF-XChange

8. Environment

  • Windows 10 Enterprise LTSB 64-bit EN
D:\SashaDebugging\KiraGoddess>tesseract --version
tesseract v5.0.0-alpha.20190708
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions