Add pydantic base type support for page and table metadata #1005

ScottHMcKean · 2025-02-18T05:25:36Z

Requested feature

There is a lot of research coming out regarding document hierarchy analysis and graph network analysis of documents (e.g. DocHieNet - https://aclanthology.org/2024.emnlp-main.65.pdf and Doc2KG - https://arxiv.org/html/2406.02962v1). IMO, Docling is the best positioned library to implement some of these ideas at scale. But we need a bit more support for page & table metadata, as well as hierarchy. In trying to implement some of this, it seems to need a lot of extending and monkey patching.

For a good example, see here: https://github.com/AI4WA/Docs2KG/blob/develop/docs/Tutorial/2.HowToUseDocs2KGPackage.md
Docling is completely abandoned after doing the conversion.

Alternatives

Extend and Monkey Patch existing implementation - generally unstable as base changes can quickly break these changes.
Use docling outputs in downstream pipelines like Docs2KG - fine but throws away much of the capabilities already in docling like the PictureClassificationData etc.

Proposed Solution

Extend the page and table base classes to have a common vocabulary to deal with additional metadata, intentionally designed to be flexible and accommodate things like Page level LLM descriptions, named entity identification, etc.

Would suggest doing this for pages and floating items to unify the API, protect the nice tight pydantic interface, but allow natural extensions with the addition at the same rough level of abstraction as the image capture. Metadata would be similar to annotations and may even replace them as a more general abstraction.

e.g.
MetaDataType = Annotated[
Union[
ItemClassificationData,
ItemDescriptionData,
ItemEntityData,
ItemRelationshipData,
],
Field(discriminator="kind"),
]

class PageItem(BaseModel):
"""PageItem."""

# A page carries separate root items for furniture and body,
# only referencing items on the page
size: Size
image: Optional[ImageRef] = None
page_no: int
metadata: Optional[List[MetaDataType]] = None

class FloatingItem(DocItem):
"""FloatingItem."""

captions: List[RefItem] = []
references: List[RefItem] = []
footnotes: List[RefItem] = []
image: Optional[ImageRef] = None
metadata: Optional[List[MetaDataType]] = None

The text was updated successfully, but these errors were encountered:

ScottHMcKean added the enhancement New feature or request label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pydantic base type support for page and table metadata #1005

Add pydantic base type support for page and table metadata #1005

ScottHMcKean commented Feb 18, 2025

Add pydantic base type support for page and table metadata #1005

Add pydantic base type support for page and table metadata #1005

Comments

ScottHMcKean commented Feb 18, 2025

Requested feature

Alternatives

Proposed Solution