Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling HTTPFileSystem.get on large files looks problematic #1766

Open
Koncopd opened this issue Dec 30, 2024 · 4 comments
Open

Calling HTTPFileSystem.get on large files looks problematic #1766

Koncopd opened this issue Dec 30, 2024 · 4 comments

Comments

@Koncopd
Copy link

Koncopd commented Dec 30, 2024

The problem is that when HTTPFileSystem.get is called, it checks if it is a directory here, but to check that this is a directory, it downloads the whole body of the file here, and then again streams the file.

@martindurant
Copy link
Member

If you know for sure it is a file, you can always use get_file() instead, or pass a one-item list with recursive=False to get(), I think that should work. Regardless, you make a good point, that checking for directoriness like this can be pathological - the .read() should be limited to a reasonable size, and maybe only attempted on responses that claim to be HTML or at least text-like.

Does this sound like something you'd like to fix?

@Koncopd
Copy link
Author

Koncopd commented Jan 6, 2025

Thank you, get_file() is inconvenient for a number of reasons, but passing a one-item list with recursive=True sounds like a good solution for now.
Yes, i can look into that when i have time.

btw i also see a separate problem of the default aiohttp timeout of 5 min making it impossible to download large files.

@martindurant
Copy link
Member

Should be recursive=False, I think, to prevent the listing.

i also see a separate problem of the default aiohttp timeout of 5 min making it impossible to download large files.

If you manage to trigger this, do let me know. You can configure aiohttp via the HTTPFileSystem constructor, of course, but large files should download in blocks anyway and not hit the timeout except on problematic connections.

@martindurant
Copy link
Member

(also: setting good defaults that work for everyone is hard!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants