Automatically detect PDFs requiring force OCR #1014

Fogapod · 2025-02-18T18:41:37Z

Question

Libreoffice shows each letter as a separate "bezier curve".
Metadata says its generated by microsoft print to pdf:

/Producer (Microsoft: Print To PDF)
/Title (Large Language Model Market Size And Share Report, 2030)

Docling/easyocr/dlparse_v2 extracts the following text:

<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
1S, often lata Iling of et dynamics

<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->
<!-- missing-text -->

(unsure if  is a bug)

force_ocr option results in correct parsing.

The question is what is this document and is there a way to know in advance it needs force_ocr?

On a side note this document also takes unusually long to process especially on force_ocr.

Docling version

Docling version: 2.22.0
Docling Core version: 2.18.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.3.1
Python: cpython-312 (3.12.7)
Platform: Linux-6.13.2-arch1-1-x86_64-with-glibc2.41

Python version

3.12.7

The text was updated successfully, but these errors were encountered:

pavel-denisov-fraunhofer · 2025-02-20T08:06:31Z

As a very rough workaround, one could get the page's text:

        if page._backend is not None:
            bitmap_rects = page._backend.get_bitmap_rects()
            page_text = page._backend.get_text_in_rect(
                BoundingBox(
                    l=0,
                    t=0,
                    r=page.size.width,
                    b=page.size.height,
                    coord_origin=CoordOrigin.TOPLEFT,
                )
            )

And check if it's too short here:

docling/docling/models/base_ocr_model.py

Lines 82 to 84 in dfcc30d

    
           if self.options.force_full_page_ocr or coverage > max( 
        
               BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold 
        
           ):

For example, len(page_text) < 100.

The downside is that OCR would run even on pages having no text at all. This could be improved by checking text only in the bounding boxes where the layout model detected some text, but currently the layout model runs after the OCR in the pipeline, so its output is not available. I'm not sure if switching their order would break anything.

PeterStaar-IBM · 2025-02-21T06:35:31Z

@Fogapod We are aware of this particular issue with pdf's and are working towards a solution, in which we detect text-blocks and will run OCR if no text-cells are detected.

Fogapod added the question Further information is requested label Feb 18, 2025

PeterStaar-IBM assigned PeterStaar-IBM and cau-git Feb 21, 2025

PeterStaar-IBM added enhancement New feature or request layout and removed question Further information is requested labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect PDFs requiring force OCR #1014

Automatically detect PDFs requiring force OCR #1014

Fogapod commented Feb 18, 2025 •

edited

Loading

pavel-denisov-fraunhofer commented Feb 20, 2025

PeterStaar-IBM commented Feb 21, 2025

Automatically detect PDFs requiring force OCR #1014

Automatically detect PDFs requiring force OCR #1014

Comments

Fogapod commented Feb 18, 2025 • edited Loading

Question

Docling version

Python version

pavel-denisov-fraunhofer commented Feb 20, 2025

PeterStaar-IBM commented Feb 21, 2025

Fogapod commented Feb 18, 2025 •

edited

Loading