-
I'm referring to this GitHub repo: https://github.com/Sims2k/pdf_data_pipeline.git When running our PDF extraction and subsequent chunking process using Docling, we expect each document item to include provenance metadata—with page numbers (under the property “page_no”) extracted from the PDF. However, the final chunks and embedding metadata show that the page number information is missing or empty. This issue prevents us from accurately associating chunks with their corresponding source pages. We expect an output similar to:
Here, the “prov” field of each doc item should contain the “page_no” attribute. Unfortunately, in our current output the prov lists are empty even though the pipeline is explicitly configured to capture page numbers. Reproduction Steps
This setup is meant to enable extraction of metadata (including page numbers) from PDFs. Chunking Process:
However, despite this logic, the page_numbers printed are empty.
Impact Without page number information, it becomes challenging to trace back each chunk to its original location in the source document. This negatively impacts our ability to debug issues related to specific sections of PDFs, and hinders downstream processing or display of contextual information. pdf_extraction.py: Sets up the PDF extraction with options for capturing metadata. It appears that either the extraction process is not embedding the page numbers into the provenance (prov) objects or they are being lost during chunking. Any assistance in resolving or clarifying how the page numbers should be captured in the processed documents would be greatly appreciated. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @Sims2k As you see below [1], we tried with a test document on our side but we cannot directly reproduce the reported behavior. Could you provide a specific document & minimal snippet for reproducing? (Side remark, unrelated to the original topic: consider using [1] The following snippet returns page numbers with no problems: from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
DOC_SOURCE = "https://arxiv.org/pdf/2206.01062"
converter = DocumentConverter()
doc = converter.convert(source=DOC_SOURCE).document
chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)
for i, chunk in enumerate(chunk_iter):
if i < 10:
page_numbers = sorted(
set(
prov.page_no
for item in chunk.meta.doc_items
for prov in item.prov
if hasattr(prov, "page_no")
)
)
print(f"Chunk {i}, text: {repr(chunker.serialize(chunk)[:40])}…, Page Numbers: {page_numbers}") Output: Chunk 0, text: 'DocLayNet: A Large Human-Annotated Datas'…, Page Numbers: [1]
Chunk 1, text: 'ABSTRACT\nAccurate document layout analys'…, Page Numbers: [1]
Chunk 2, text: 'CCS CONCEPTS\n· Informationsystems → Docu'…, Page Numbers: [1]
Chunk 3, text: 'KEYWORDS\nPDF document conversion, layout'…, Page Numbers: [1]
Chunk 4, text: 'ACMReference Format:\nBirgit Pfitzmann, C'…, Page Numbers: [1]
Chunk 5, text: '1 INTRODUCTION\nDespite the substantial i'…, Page Numbers: [2]
Chunk 6, text: '1 INTRODUCTION\nIn this paper, we present'…, Page Numbers: [2]
Chunk 7, text: '2 RELATED WORK\nWhile early approaches in'…, Page Numbers: [2]
Chunk 8, text: '3 THE DOCLAYNET DATASET\nDocLayNet contai'…, Page Numbers: [2, 3]
Chunk 9, text: '3 THE DOCLAYNET DATASET\nWe did not contr'…, Page Numbers: [3] |
Beta Was this translation helpful? Give feedback.
@Sims2k
Bingo, yes, it definitely does — in a nutshell, starting from a DoclingDocument: