Calling HTTPFileSystem.get on large files looks problematic #1766

Koncopd · 2024-12-30T11:19:31Z

The problem is that when HTTPFileSystem.get is called, it checks if it is a directory here, but to check that this is a directory, it downloads the whole body of the file here, and then again streams the file.

The text was updated successfully, but these errors were encountered:

martindurant · 2025-01-06T15:35:00Z

If you know for sure it is a file, you can always use get_file() instead, or pass a one-item list with recursive=False to get(), I think that should work. Regardless, you make a good point, that checking for directoriness like this can be pathological - the .read() should be limited to a reasonable size, and maybe only attempted on responses that claim to be HTML or at least text-like.

Does this sound like something you'd like to fix?

Koncopd · 2025-01-06T16:01:49Z

Thank you, get_file() is inconvenient for a number of reasons, but passing a one-item list with recursive=True sounds like a good solution for now.
Yes, i can look into that when i have time.

btw i also see a separate problem of the default aiohttp timeout of 5 min making it impossible to download large files.

martindurant · 2025-01-06T16:07:55Z

Should be recursive=False, I think, to prevent the listing.

i also see a separate problem of the default aiohttp timeout of 5 min making it impossible to download large files.

If you manage to trigger this, do let me know. You can configure aiohttp via the HTTPFileSystem constructor, of course, but large files should download in blocks anyway and not hit the timeout except on problematic connections.

martindurant · 2025-01-06T16:08:10Z

(also: setting good defaults that work for everyone is hard!)

This was referenced Dec 30, 2024

synchronizing http links is a bit slow compared to s3 etc laminlabs/lamindb-setup#916

Closed

🐛 Fix double download of files in http laminlabs/lamindb-setup#919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling HTTPFileSystem.get on large files looks problematic #1766

Calling HTTPFileSystem.get on large files looks problematic #1766

Koncopd commented Dec 30, 2024

martindurant commented Jan 6, 2025

Koncopd commented Jan 6, 2025 •

edited

Loading

martindurant commented Jan 6, 2025

martindurant commented Jan 6, 2025

Calling HTTPFileSystem.get on large files looks problematic #1766

Calling HTTPFileSystem.get on large files looks problematic #1766

Comments

Koncopd commented Dec 30, 2024

martindurant commented Jan 6, 2025

Koncopd commented Jan 6, 2025 • edited Loading

martindurant commented Jan 6, 2025

martindurant commented Jan 6, 2025

Koncopd commented Jan 6, 2025 •

edited

Loading