Page Numbers Not Appearing Correctly in Provenance Metadata #1012

Sims2k · 2025-02-18T12:50:08Z

Sims2k
Feb 18, 2025

I'm referring to this GitHub repo: https://github.com/Sims2k/pdf_data_pipeline.git

When running our PDF extraction and subsequent chunking process using Docling, we expect each document item to include provenance metadata—with page numbers (under the property “page_no”) extracted from the PDF. However, the final chunks and embedding metadata show that the page number information is missing or empty. This issue prevents us from accurately associating chunks with their corresponding source pages.
Expected Behavior
For example, when running the following code:
print(list(chunk_iter)[11])

We expect an output similar to:

{
  "text": "In this paper, we present the DocLayNet dataset. [...]",
  "meta": {
    "doc_items": [{
      "self_ref": "#/texts/28",
      "label": "text",
      "prov": [{
        "page_no": 2,
        "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...}
      }],
      ...
    }],
    "headings": ["1 INTRODUCTION"]
  }
}

Here, the “prov” field of each doc item should contain the “page_no” attribute. Unfortunately, in our current output the prov lists are empty even though the pipeline is explicitly configured to capture page numbers.

Reproduction Steps
PDF Extraction:
In data-pipeline/pdf_extraction.py, we use the following configuration for the DocumentConverter:

  pipeline_options = PdfPipelineOptions(
       artifacts_path=artifacts_path,
       do_table_structure=True,
       table_structure_options=dict(
           mode=TableFormerMode.FAST
       ),
       enable_remote_services=False
   )

   doc_converter = DocumentConverter(
       format_options={
           InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
       }
   )

This setup is meant to enable extraction of metadata (including page numbers) from PDFs.

Chunking Process:
In data-pipeline/chunking.py, after converting a PDF to a document, the code uses HybridChunker to partition the document. In the process, we inspect the provenance information via:

   page_numbers = sorted(
       set(
           prov.page_no
           for item in chunk.meta.doc_items
           for prov in item.prov
           if hasattr(prov, "page_no")
       )
   )
   print(f"Chunk {i} Page Numbers: {page_numbers}")

However, despite this logic, the page_numbers printed are empty.
Embedding Process:
In data-pipeline/embedding.py, the metadata for each chunk is prepared as follows:

   meta = {
       "filename": chunk.meta.origin.filename,
       "page_numbers": sorted(
           set(
               prov.page_no
               for item in chunk.meta.doc_items
               for prov in item.prov
           )
       ) or None,
       "title": chunk.meta.headings[0] if chunk.meta.headings else None,
   }

Impact

Without page number information, it becomes challenging to trace back each chunk to its original location in the source document. This negatively impacts our ability to debug issues related to specific sections of PDFs, and hinders downstream processing or display of contextual information.
Environment
Docling version: ^2.22.0
Python version: 3.12
Operating System: Windows 11
Additional Information
Please refer to the relevant sections in the code in :

pdf_extraction.py: Sets up the PDF extraction with options for capturing metadata.
chunking.py: Uses the HybridChunker to break documents into chunks and attempts to retrieve page numbers from provenance.
embedding.py: Assembles metadata (including page numbers) for each chunk before embedding.

It appears that either the extraction process is not embedding the page numbers into the provenance (prov) objects or they are being lost during chunking. Any assistance in resolving or clarifying how the page numbers should be captured in the processed documents would be greatly appreciated.

Thank you!

Answered by vagenas

Feb 18, 2025

@Sims2k

I’m also wondering if the process of converting the original PDFs to Markdown files and storing them in a directory as an intermediate step could be affecting this metadata. Our extraction pipeline converts PDFs to Markdown (using export_to_markdown) and saves them in the “data-pipeline/extracted-pdfs” directory. Later, the chunking process reads these Markdown files to produce chunks. Could this intermediate conversion and storage step be contributing to the loss of the page number metadata?

Bingo, yes, it definitely does — in a nutshell, starting from a DoclingDocument:

exporting to file formats like Markdown or HTML is naturally lossy, in that only certain layout information…

View full answer

vagenas · 2025-02-18T13:52:17Z

vagenas
Feb 18, 2025
Maintainer

Hi @Sims2k

As you see below [1], we tried with a test document on our side but we cannot directly reproduce the reported behavior.

Could you provide a specific document & minimal snippet for reproducing?

(Side remark, unrelated to the original topic: consider using chunker.serialize(chunk) instead of chunk.text in order to get the context-enriched text [docs][example].)

[1] The following snippet returns page numbers with no problems:

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

DOC_SOURCE = "https://arxiv.org/pdf/2206.01062"

converter = DocumentConverter()
doc = converter.convert(source=DOC_SOURCE).document

chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)

for i, chunk in enumerate(chunk_iter):
    if i < 10:
        page_numbers = sorted(
            set(
                prov.page_no
                for item in chunk.meta.doc_items
                for prov in item.prov
                if hasattr(prov, "page_no")
            )
        )
        print(f"Chunk {i}, text: {repr(chunker.serialize(chunk)[:40])}…, Page Numbers: {page_numbers}")

Output:

Chunk 0, text: 'DocLayNet: A Large Human-Annotated Datas'…, Page Numbers: [1]
Chunk 1, text: 'ABSTRACT\nAccurate document layout analys'…, Page Numbers: [1]
Chunk 2, text: 'CCS CONCEPTS\n· Informationsystems → Docu'…, Page Numbers: [1]
Chunk 3, text: 'KEYWORDS\nPDF document conversion, layout'…, Page Numbers: [1]
Chunk 4, text: 'ACMReference Format:\nBirgit Pfitzmann, C'…, Page Numbers: [1]
Chunk 5, text: '1 INTRODUCTION\nDespite the substantial i'…, Page Numbers: [2]
Chunk 6, text: '1 INTRODUCTION\nIn this paper, we present'…, Page Numbers: [2]
Chunk 7, text: '2 RELATED WORK\nWhile early approaches in'…, Page Numbers: [2]
Chunk 8, text: '3 THE DOCLAYNET DATASET\nDocLayNet contai'…, Page Numbers: [2, 3]
Chunk 9, text: '3 THE DOCLAYNET DATASET\nWe did not contr'…, Page Numbers: [3]

3 replies

Sims2k Feb 18, 2025
Author

Hi @vagenas,
Thanks for your response and suggestion. I updated my implementation to use chunker.serialize(chunk) instead of directly accessing chunk.text. However, I'm still not seeing any page numbers in the output. For example, when processing the file:

Processing file 1/43: edpb-guidelines-202301_art_37_led_final_0_en.md
Chunk 0, Text Preview: 'Guidelines 01/2023 on Article 37 Law Enf'…, Page Numbers: []
Chunk 1, Text Preview: 'Version history\nVersion 1.0, 1 = 27 Sept'…, Page Numbers: []
Chunk 2, Text Preview: 'Executive summary\nThese guidelines provi'…, Page Numbers: []

Here is the exact code I'm using for the chunking:

from pathlib import Path
from typing import List
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from dotenv import load_dotenv
from openai import OpenAI
from utils.tokenizer import OpenAITokenizerWrapper
import pprint

load_dotenv()

# Initialize OpenAI client (ensure OPENAI_API_KEY is set in your environment)
client = OpenAI()

# Initialize our custom tokenizer
tokenizer = OpenAITokenizerWrapper()
MAX_TOKENS = 8191  # text-embedding-3-large's maximum context length

# Directory containing the extracted Markdown files from PDF extraction
EXTRACTED_DIR = Path("data-pipeline/extracted-pdfs")

def inspect_docling_document(doc):
    # Use dir() to list all attributes and methods
    print("Attributes and methods:")
    pprint.pprint(dir(doc))

    # Use vars() to get the __dict__ attribute of the object
    print("\nObject dictionary:")
    pprint.pprint(vars(doc))

    # If the object has a specific attribute you want to inspect
    if hasattr(doc, 'meta'):
        print("\nMeta attribute:")
        pprint.pprint(doc.meta)

    # Export to a file for further inspection
    with open("docling_document_inspection.txt", "w") as file:
        file.write(pprint.pformat(vars(doc)))

def chunk_markdown_files() -> List:
    """
    Process all Markdown files in the extracted directory using HybridChunker.
    Uses chunker.serialize(chunk) for context-enriched text and prints page number information
    for a few chunks to verify that it's captured.

    Returns:
        A list of all chunks generated from all Markdown files.
    """
    markdown_files = list(EXTRACTED_DIR.glob("*.md"))
    
    if not markdown_files:
        print(f"No markdown files found in {EXTRACTED_DIR.resolve()}.")
        return []
    
    chunker = HybridChunker(
        tokenizer=tokenizer,
        max_tokens=MAX_TOKENS,
        merge_peers=True,
    )
    
    all_chunks = []
    
    for file_index, md_file in enumerate(markdown_files):
        print(f"\nProcessing file {file_index + 1}/{len(markdown_files)}: {md_file.name}")
        doc = DocumentConverter().convert(source=str(md_file.resolve())).document
        if not doc:
            print(f"Conversion failed for {md_file.name}.")
            continue
        
        # Commented out i.e. not printing the overwhelming DoclingDocument details:
        # if file_index == 0:
        #     print("\nFirst DoclingDocument for inspection:")
        #     inspect_docling_document(doc)
        
        chunk_iter = chunker.chunk(dl_doc=doc)
        chunks = list(chunk_iter)
        
        non_empty_chunks = []
        empty_chunk_count = 0
        
        for i, chunk in enumerate(chunks):
            # Use the enriched serialization instead of chunk.text
            serialized_text = chunker.serialize(chunk)
            if not serialized_text.strip():
                empty_chunk_count += 1
                continue

            non_empty_chunks.append(chunk)
            
            # For the first file, print page numbers for the first 3 chunks only.
            if file_index == 0 and i < 3:
                page_numbers = sorted(
                    set(
                        prov.page_no
                        for item in chunk.meta.doc_items
                        for prov in item.prov
                        if hasattr(prov, "page_no")
                    )
                )
                print(f"Chunk {i}, Text Preview: {repr(serialized_text[:40])}…, Page Numbers: {page_numbers}")
        
        if empty_chunk_count > 0:
            print(f"Warning: Found {empty_chunk_count} empty chunks in {md_file.name}")
            
        print(f"Generated {len(non_empty_chunks)} valid chunks from {md_file.name}")
        all_chunks.extend(non_empty_chunks)
    
    print(f"\nTotal non-empty chunks generated: {len(all_chunks)}")
    return all_chunks

if __name__ == "__main__":
    print("Starting chunking process...")
    chunks = chunk_markdown_files()
    print(f"Chunking complete: {len(chunks)} chunks generated.")

As you can see, even though the context-enriched text is printed as expected, the "Page Numbers" list remains empty for every chunk. I double-checked that the extraction process is set up to capture metadata and that I'm iterating over each doc item’s provenance, but the page numbers still aren’t coming through.

I’m also wondering if the process of converting the original PDFs to Markdown files and storing them in a directory as an intermediate step could be affecting this metadata. Our extraction pipeline converts PDFs to Markdown (using export_to_markdown) and saves them in the “data-pipeline/extracted-pdfs” directory. Later, the chunking process reads these Markdown files to produce chunks. Could this intermediate conversion and storage step be contributing to the loss of the page number metadata?

The whole codebase is under the GitHub repo: https://github.com/Sims2k/pdf_data_pipeline.git

Could you advise on further steps or possible issues on our side with extracting or propagating the page number metadata?
Any additional debugging tips or insights would be greatly appreciated.
Thanks again for your help!
Best regards,

Simon

vagenas Feb 18, 2025
Maintainer

@Sims2k

I’m also wondering if the process of converting the original PDFs to Markdown files and storing them in a directory as an intermediate step could be affecting this metadata. Our extraction pipeline converts PDFs to Markdown (using export_to_markdown) and saves them in the “data-pipeline/extracted-pdfs” directory. Later, the chunking process reads these Markdown files to produce chunks. Could this intermediate conversion and storage step be contributing to the loss of the page number metadata?

Bingo, yes, it definitely does — in a nutshell, starting from a DoclingDocument:

exporting to file formats like Markdown or HTML is naturally lossy, in that only certain layout information can be propagated (e.g. headings), but not all (e.g. pages, bounding boxes, etc.)
to retain all information, you can save the DoclingDocument as JSON (DoclingDocument.save_as_json()), which is lossless, and which can be loaded again from disk (DoclingDocument.load_from_json()), e.g. to be chunked.

So if you need such fine-grained information from the DoclingDocument, the way to go is using the approach (2) above ;-)

Answer selected by Sims2k

Sims2k Feb 18, 2025
Author

Hi @vagenas,
Thank you so much for your quick response and clear explanation. Your insight about the lossy nature of exporting to Markdown was exactly the issue I was facing. I switched to exporting the DoclingDocument as JSON and reloading it during chunking—and it worked perfectly.
Really appreciate your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page Numbers Not Appearing Correctly in Provenance Metadata #1012

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Page Numbers Not Appearing Correctly in Provenance Metadata #1012

Sims2k Feb 18, 2025

Replies: 1 comment · 3 replies

vagenas Feb 18, 2025 Maintainer

Sims2k Feb 18, 2025 Author

vagenas Feb 18, 2025 Maintainer

Sims2k Feb 18, 2025 Author

Sims2k
Feb 18, 2025

Replies: 1 comment 3 replies

vagenas
Feb 18, 2025
Maintainer

Sims2k Feb 18, 2025
Author

vagenas Feb 18, 2025
Maintainer

Sims2k Feb 18, 2025
Author