Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert raising 'xlsx' type is not supported. #1037

Open
MatheusAbdias opened this issue Feb 22, 2025 · 2 comments
Open

Convert raising 'xlsx' type is not supported. #1037

MatheusAbdias opened this issue Feb 22, 2025 · 2 comments
Assignees
Labels
bug Something isn't working mimetype

Comments

@MatheusAbdias
Copy link
Contributor

Bug

The _guess_format method in _DocumentConversionInput class is incorrectly identifying XLSX files as "application/zip" format.

Steps to Reproduce

  1. Create an instance of DoclingConvert with XLSX support:
converter = DoclingConvert(allowed_formats=[InputFormat.XLSX])
result = converter.convert(path)
result.document.save_as_markdown(Path("./output.md"))

The _guess_format method in _DocumentConversionInput is returning "application/zip"

class _DocumentConversionInput):
    def _guess_format(self, obj: Path | DocumentStream) -> InputFormat | None: ...

Inside the _guess_format filetype.guess_mime is returning application/zip.
...

Docling version

ocling version: 2.24.0
Docling Core version: 2.20.0
Docling IBM Models version: 3.4.0
Docling Parse version: 3.4.0
Python: cpython-312 (3.12.8)
Platform: Linux-6.6.75-2-MANJARO-x86_64-with-glibc2.41
...

Python version

Python 3.12.8
...

@MatheusAbdias MatheusAbdias added the bug Something isn't working label Feb 22, 2025
@MatheusAbdias
Copy link
Contributor Author

MatheusAbdias commented Feb 23, 2025

While investigating this issue, I found the python-magic library which correctly identifies the XLSX file type. I've tested it with the same file and it returns the proper mime type. However, this solution requires installing libmagic as a system dependency. Would it be acceptable to add this dependency to the project?

@cau-git
Copy link
Contributor

cau-git commented Feb 25, 2025

@MatheusAbdias Thanks for your suggestion. This comes back to #802, where we track this issue more broadly.

We want to avoid working with libmagic since it is under GPL-license, hence we cannot distribute it. We try to avoid system library dependencies that we cannot bundle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mimetype
Projects
None yet
Development

No branches or pull requests

4 participants