Text in PDF Recognized as Image Instead of Text During Parsing #573

kurekj · 2024-12-11T12:25:15Z

kurekj
Dec 11, 2024

When parsing CVs using Docling on Ubuntu with Python 3.11, some portions of the PDF (e.g., containing text) are incorrectly treated as images instead of being recognized as text. This occurs despite enabling OCR and trying different OCR engines and settings.

Environment:
Docling version: 2.10.0
Docling Core version: 2.9.0
Docling IBM Models version: 2.0.7
Docling Parse version: 3.0.0

Operating System: Ubuntu
Python version: 3.11

Relevant Code:
IMAGE_RESOLUTION_SCALE = 10.0

pipeline_options = PdfPipelineOptions()
#pipeline_options = PdfPipelineOptions(backend=DoclingParseV2DocumentBackend)
#pipeline_options = PdfPipelineOptions(backend=DoclingParseV2PageBackend)

pipeline_options.do_ocr = True
#pipeline_options.do_table_structure = True
#pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
#pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
#pipeline_options.ocr_options.bitmap_area_threshold=0.05

# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
#ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
#ocr_options = RapidOcrOptions(force_full_page_ocr=True)
#ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
#pipeline_options.ocr_options = ocr_options

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

realCommitment · 2025-02-21T07:36:14Z

realCommitment
Feb 21, 2025

@kurekj - any luck?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text in PDF Recognized as Image Instead of Text During Parsing #573

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Text in PDF Recognized as Image Instead of Text During Parsing #573

kurekj Dec 11, 2024

Replies: 1 comment

realCommitment Feb 21, 2025

kurekj
Dec 11, 2024

realCommitment
Feb 21, 2025