Skip to content

Page Numbers Not Appearing Correctly in Provenance Metadata #1012

Answered by vagenas
Sims2k asked this question in Q&A
Discussion options

You must be logged in to vote

@Sims2k

I’m also wondering if the process of converting the original PDFs to Markdown files and storing them in a directory as an intermediate step could be affecting this metadata. Our extraction pipeline converts PDFs to Markdown (using export_to_markdown) and saves them in the “data-pipeline/extracted-pdfs” directory. Later, the chunking process reads these Markdown files to produce chunks. Could this intermediate conversion and storage step be contributing to the loss of the page number metadata?

Bingo, yes, it definitely does — in a nutshell, starting from a DoclingDocument:

  1. exporting to file formats like Markdown or HTML is naturally lossy, in that only certain layout information…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@Sims2k
Comment options

@vagenas
Comment options

Answer selected by Sims2k
@Sims2k
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants