Export to markdown only contains H2 headers #1023

nikhildigde · 2025-02-19T16:45:09Z

Bug

I tried loading a pdf file with multiple headings / sections. But seems like docling always extracts it to markdown with H2 (##) only. Am I doing something wrong here? I have tried with multiple PDFs.

docling_test.pdf

...

Steps to reproduce

import logging
import time
from pathlib import Path

from docling_core.types.doc import ImageRefMode, PictureItem, TableItem
from docling.datamodel.base_models import FigureElement, InputFormat, Table
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(name)

IMAGE_RESOLUTION_SCALE = 2.0

def main():
logging.basicConfig(level=logging.INFO)

input_doc_path = Path("/Users/nikhildi/Downloads/solution.pdf")

output_dir = Path("scratch")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng"])
pipeline_options.generate_picture_images = False

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

start_time = time.time()

conv_res = doc_converter.convert(input_doc_path)
md_filename = output_dir / f"test.md"

print(conv_res.document.save_as_markdown(filename= md_filename, image_placeholder=""))

...

Docling version

docling 2.15.1
docling-core 2.15.1
docling-ibm-models 3.2.1
docling-parse 3.1.1
...

Python version

Python 3.11.11
...

The text was updated successfully, but these errors were encountered:

PeterStaar-IBM · 2025-02-21T06:28:58Z

@nikhildigde Yes, this is known for now. Basically, we need to infer the table-of-contents in order to get the right level of the headers. For the moment, the header level can be inferred for docx, html and md, but not yet for pdf. After we refactor the reading-order model, this is the next issue we want to handle.

nikhildigde · 2025-02-21T06:53:04Z

@PeterStaar-IBM thank you for the response and explanation. Sorry to ask, but do you have any ETA for this fix? Also, if there is no table of contents will this not work? I thought it would need some model training to get this right?

PeterStaar-IBM · 2025-02-21T07:05:29Z

@nikhildigde As soon as we can, we will probably start with an approximate solution and then gradually improve it.

nikhildigde · 2025-02-21T07:09:09Z

Ok. Thank you for the great work. Appreciate it!

nikhildigde added the bug Something isn't working label Feb 19, 2025

PeterStaar-IBM added enhancement New feature or request and removed bug Something isn't working labels Feb 21, 2025

PeterStaar-IBM self-assigned this Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export to markdown only contains H2 headers #1023

Export to markdown only contains H2 headers #1023

nikhildigde commented Feb 19, 2025

PeterStaar-IBM commented Feb 21, 2025

nikhildigde commented Feb 21, 2025

PeterStaar-IBM commented Feb 21, 2025

nikhildigde commented Feb 21, 2025

Export to markdown only contains H2 headers #1023

Export to markdown only contains H2 headers #1023

Comments

nikhildigde commented Feb 19, 2025

Bug

Steps to reproduce

Docling version

Python version

PeterStaar-IBM commented Feb 21, 2025

nikhildigde commented Feb 21, 2025

PeterStaar-IBM commented Feb 21, 2025

nikhildigde commented Feb 21, 2025