Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export to markdown only contains H2 headers #1023

Open
nikhildigde opened this issue Feb 19, 2025 · 4 comments
Open

Export to markdown only contains H2 headers #1023

nikhildigde opened this issue Feb 19, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@nikhildigde
Copy link

Bug

I tried loading a pdf file with multiple headings / sections. But seems like docling always extracts it to markdown with H2 (##) only. Am I doing something wrong here? I have tried with multiple PDFs.

docling_test.pdf

...

Steps to reproduce

import logging
import time
from pathlib import Path

from docling_core.types.doc import ImageRefMode, PictureItem, TableItem
from docling.datamodel.base_models import FigureElement, InputFormat, Table
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions, TesseractOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(name)

IMAGE_RESOLUTION_SCALE = 2.0

def main():
logging.basicConfig(level=logging.INFO)

input_doc_path = Path("/Users/nikhildi/Downloads/solution.pdf")

output_dir = Path("scratch")
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng"])
pipeline_options.generate_picture_images = False

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

start_time = time.time()

conv_res = doc_converter.convert(input_doc_path)
md_filename = output_dir / f"test.md"

print(conv_res.document.save_as_markdown(filename= md_filename, image_placeholder=""))

...

Docling version

docling 2.15.1
docling-core 2.15.1
docling-ibm-models 3.2.1
docling-parse 3.1.1
...

Python version

Python 3.11.11
...

@nikhildigde nikhildigde added the bug Something isn't working label Feb 19, 2025
@PeterStaar-IBM PeterStaar-IBM added enhancement New feature or request and removed bug Something isn't working labels Feb 21, 2025
@PeterStaar-IBM PeterStaar-IBM self-assigned this Feb 21, 2025
@PeterStaar-IBM
Copy link
Contributor

@nikhildigde Yes, this is known for now. Basically, we need to infer the table-of-contents in order to get the right level of the headers. For the moment, the header level can be inferred for docx, html and md, but not yet for pdf. After we refactor the reading-order model, this is the next issue we want to handle.

@nikhildigde
Copy link
Author

@PeterStaar-IBM thank you for the response and explanation. Sorry to ask, but do you have any ETA for this fix? Also, if there is no table of contents will this not work? I thought it would need some model training to get this right?

@PeterStaar-IBM
Copy link
Contributor

@nikhildigde As soon as we can, we will probably start with an approximate solution and then gradually improve it.

@nikhildigde
Copy link
Author

Ok. Thank you for the great work. Appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants