You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The downside is that OCR would run even on pages having no text at all. This could be improved by checking text only in the bounding boxes where the layout model detected some text, but currently the layout model runs after the OCR in the pipeline, so its output is not available. I'm not sure if switching their order would break anything.
@Fogapod We are aware of this particular issue with pdf's and are working towards a solution, in which we detect text-blocks and will run OCR if no text-cells are detected.
Question
nocontent.pdf
Libreoffice shows each letter as a separate "bezier curve".
Metadata says its generated by microsoft print to pdf:
Docling/easyocr/dlparse_v2 extracts the following text:
(unsure if
<!-- missing-text -->
is a bug)force_ocr
option results in correct parsing.The question is what is this document and is there a way to know in advance it needs force_ocr?
On a side note this document also takes unusually long to process especially on force_ocr.
Docling version
Python version
3.12.7
The text was updated successfully, but these errors were encountered: