You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried loading a pdf file with multiple headings / sections. But seems like docling always extracts it to markdown with H2 (##) only. Am I doing something wrong here? I have tried with multiple PDFs.
@nikhildigde Yes, this is known for now. Basically, we need to infer the table-of-contents in order to get the right level of the headers. For the moment, the header level can be inferred for docx, html and md, but not yet for pdf. After we refactor the reading-order model, this is the next issue we want to handle.
@PeterStaar-IBM thank you for the response and explanation. Sorry to ask, but do you have any ETA for this fix? Also, if there is no table of contents will this not work? I thought it would need some model training to get this right?
Bug
I tried loading a pdf file with multiple headings / sections. But seems like docling always extracts it to markdown with H2 (##) only. Am I doing something wrong here? I have tried with multiple PDFs.
docling_test.pdf
...
Steps to reproduce
import logging
import time
from pathlib import Path
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem
from docling.datamodel.base_models import FigureElement, InputFormat, Table
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
_log = logging.getLogger(name)
IMAGE_RESOLUTION_SCALE = 2.0
def main():
logging.basicConfig(level=logging.INFO)
...
Docling version
docling 2.15.1
docling-core 2.15.1
docling-ibm-models 3.2.1
docling-parse 3.1.1
...
Python version
Python 3.11.11
...
The text was updated successfully, but these errors were encountered: