Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pydantic base type support for page and table metadata #1005

Open
ScottHMcKean opened this issue Feb 18, 2025 · 0 comments
Open

Add pydantic base type support for page and table metadata #1005

ScottHMcKean opened this issue Feb 18, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@ScottHMcKean
Copy link

Requested feature

There is a lot of research coming out regarding document hierarchy analysis and graph network analysis of documents (e.g. DocHieNet - https://aclanthology.org/2024.emnlp-main.65.pdf and Doc2KG - https://arxiv.org/html/2406.02962v1). IMO, Docling is the best positioned library to implement some of these ideas at scale. But we need a bit more support for page & table metadata, as well as hierarchy. In trying to implement some of this, it seems to need a lot of extending and monkey patching.

For a good example, see here: https://github.com/AI4WA/Docs2KG/blob/develop/docs/Tutorial/2.HowToUseDocs2KGPackage.md
Docling is completely abandoned after doing the conversion.

Alternatives

  1. Extend and Monkey Patch existing implementation - generally unstable as base changes can quickly break these changes.
  2. Use docling outputs in downstream pipelines like Docs2KG - fine but throws away much of the capabilities already in docling like the PictureClassificationData etc.

Proposed Solution

Extend the page and table base classes to have a common vocabulary to deal with additional metadata, intentionally designed to be flexible and accommodate things like Page level LLM descriptions, named entity identification, etc.

Would suggest doing this for pages and floating items to unify the API, protect the nice tight pydantic interface, but allow natural extensions with the addition at the same rough level of abstraction as the image capture. Metadata would be similar to annotations and may even replace them as a more general abstraction.

e.g.
MetaDataType = Annotated[
Union[
ItemClassificationData,
ItemDescriptionData,
ItemEntityData,
ItemRelationshipData,
],
Field(discriminator="kind"),
]

class PageItem(BaseModel):
"""PageItem."""

# A page carries separate root items for furniture and body,
# only referencing items on the page
size: Size
image: Optional[ImageRef] = None
page_no: int
metadata: Optional[List[MetaDataType]] = None

class FloatingItem(DocItem):
"""FloatingItem."""

captions: List[RefItem] = []
references: List[RefItem] = []
footnotes: List[RefItem] = []
image: Optional[ImageRef] = None
metadata: Optional[List[MetaDataType]] = None
@ScottHMcKean ScottHMcKean added the enhancement New feature or request label Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant