Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect PDFs requiring force OCR #1014

Open
Fogapod opened this issue Feb 18, 2025 · 2 comments
Open

Automatically detect PDFs requiring force OCR #1014

Fogapod opened this issue Feb 18, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request layout

Comments

@Fogapod
Copy link

Fogapod commented Feb 18, 2025

Question

nocontent.pdf

Libreoffice shows each letter as a separate "bezier curve".
Metadata says its generated by microsoft print to pdf:

/Producer (Microsoft: Print To PDF)
/Title (Large Language Model Market Size And Share Report, 2030)

Docling/easyocr/dlparse_v2 extracts the following text:

<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
1S, often lata Iling of et dynamics

<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->

(unsure if <!-- missing-text --> is a bug)

force_ocr option results in correct parsing.

The question is what is this document and is there a way to know in advance it needs force_ocr?

On a side note this document also takes unusually long to process especially on force_ocr.

Docling version

Docling version: 2.22.0
Docling Core version: 2.18.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.3.1
Python: cpython-312 (3.12.7)
Platform: Linux-6.13.2-arch1-1-x86_64-with-glibc2.41

Python version

3.12.7

@Fogapod Fogapod added the question Further information is requested label Feb 18, 2025
@pavel-denisov-fraunhofer
Copy link
Contributor

As a very rough workaround, one could get the page's text:

        if page._backend is not None:
            bitmap_rects = page._backend.get_bitmap_rects()
            page_text = page._backend.get_text_in_rect(
                BoundingBox(
                    l=0,
                    t=0,
                    r=page.size.width,
                    b=page.size.height,
                    coord_origin=CoordOrigin.TOPLEFT,
                )
            )

And check if it's too short here:

if self.options.force_full_page_ocr or coverage > max(
BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold
):

For example, len(page_text) < 100.

The downside is that OCR would run even on pages having no text at all. This could be improved by checking text only in the bounding boxes where the layout model detected some text, but currently the layout model runs after the OCR in the pipeline, so its output is not available. I'm not sure if switching their order would break anything.

@PeterStaar-IBM
Copy link
Contributor

@Fogapod We are aware of this particular issue with pdf's and are working towards a solution, in which we detect text-blocks and will run OCR if no text-cells are detected.

@PeterStaar-IBM PeterStaar-IBM added enhancement New feature or request layout and removed question Further information is requested labels Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request layout
Projects
None yet
Development

No branches or pull requests

4 participants