You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a lot of research coming out regarding document hierarchy analysis and graph network analysis of documents (e.g. DocHieNet - https://aclanthology.org/2024.emnlp-main.65.pdf and Doc2KG - https://arxiv.org/html/2406.02962v1). IMO, Docling is the best positioned library to implement some of these ideas at scale. But we need a bit more support for page & table metadata, as well as hierarchy. In trying to implement some of this, it seems to need a lot of extending and monkey patching.
Extend and Monkey Patch existing implementation - generally unstable as base changes can quickly break these changes.
Use docling outputs in downstream pipelines like Docs2KG - fine but throws away much of the capabilities already in docling like the PictureClassificationData etc.
Proposed Solution
Extend the page and table base classes to have a common vocabulary to deal with additional metadata, intentionally designed to be flexible and accommodate things like Page level LLM descriptions, named entity identification, etc.
Would suggest doing this for pages and floating items to unify the API, protect the nice tight pydantic interface, but allow natural extensions with the addition at the same rough level of abstraction as the image capture. Metadata would be similar to annotations and may even replace them as a more general abstraction.
# A page carries separate root items for furniture and body,
# only referencing items on the page
size: Size
image: Optional[ImageRef] = None
page_no: int
metadata: Optional[List[MetaDataType]] = None
Requested feature
There is a lot of research coming out regarding document hierarchy analysis and graph network analysis of documents (e.g. DocHieNet - https://aclanthology.org/2024.emnlp-main.65.pdf and Doc2KG - https://arxiv.org/html/2406.02962v1). IMO, Docling is the best positioned library to implement some of these ideas at scale. But we need a bit more support for page & table metadata, as well as hierarchy. In trying to implement some of this, it seems to need a lot of extending and monkey patching.
For a good example, see here: https://github.com/AI4WA/Docs2KG/blob/develop/docs/Tutorial/2.HowToUseDocs2KGPackage.md
Docling is completely abandoned after doing the conversion.
Alternatives
Proposed Solution
Extend the page and table base classes to have a common vocabulary to deal with additional metadata, intentionally designed to be flexible and accommodate things like Page level LLM descriptions, named entity identification, etc.
Would suggest doing this for pages and floating items to unify the API, protect the nice tight pydantic interface, but allow natural extensions with the addition at the same rough level of abstraction as the image capture. Metadata would be similar to annotations and may even replace them as a more general abstraction.
e.g.
MetaDataType = Annotated[
Union[
ItemClassificationData,
ItemDescriptionData,
ItemEntityData,
ItemRelationshipData,
],
Field(discriminator="kind"),
]
class PageItem(BaseModel):
"""PageItem."""
class FloatingItem(DocItem):
"""FloatingItem."""
The text was updated successfully, but these errors were encountered: