-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeking on Async FS is bugged / not-working #1772
Comments
We should fix this - async/streaming files should not be seekable at all |
Hmm... is "asynchronous" flag on the filesystem meant to mean 1:1 with streaming? That feels like an over-specification / a mismatch from what I would expect when trying to use an async interface. My goal when using async is to prevent IO bound operations from blocking. (I'm trying to read multiple files (from multiple urls / s3 keys) at once). Is there another way I can achieve this without going down the "streaming" path? This is sort of what I originally thought I could do (just open with
but this gets an error
Digging around, i found the |
No, that specifies the intent on whether the However, the sync file-like from open() doesn't make sense in an async context, but its probably a bad idea to overload open() to produce the streaming async variant when asynchronous=True. |
The sync file-like object does, of course, call down into the async code, so it is possible to get a true async and random-access file, but this is not exposed. We don't know what the API should look like, since IOBase is certainly sync; furthermore, a file-like is stateful (the file position), so races are very possible. |
here's a verison where I don't set the
This runs, but outputs:
by specifying the loop, i avoid the co-routine error (which i believe is due to fsspec creating its own event loop if not specified). Note that all the reads are identical (ignoring seek) and also not the full-size they are supposed to be (16126 bytes). In the code above, if I remove the async behavior from fsspec, things instead start to block (the underlying sync -> async doesn't prevent the sync calls from blocking eachother).
the calls block eachother and there are no parallel requests happening. Am I missing something obvious about the API / how to use fsspec? |
Based on your comment:
I decided to try to see if I could get the behavior I wanted by digging in a bit more. I got something that now works, but took a lot of monkey-patching. Specifically, i ended up monkeypatching:
I created a cleanup method: and then to side-step the Example code hereimport fsspec
import asyncio
import random
import aiohttp
import logging
logger = logging.getLogger("fsspec")
async def read_async(self, length=-1):
"""
Return data from cache, or fetch pieces as necessary
Parameters
----------
length: int (-1)
Number of bytes to read; if <0, all remaining bytes.
"""
length = -1 if length is None else int(length)
if self.mode != "rb":
raise ValueError("File not in read mode")
if length < 0:
length = self.size - self.loc
if self.closed:
raise ValueError("I/O operation on closed file.")
if length == 0:
# don't even bother calling fetch
return b""
out = await self.cache._fetch_async(self.loc, self.loc + length)
logger.debug(
"%s read: %i - %i %s",
self,
self.loc,
self.loc + length,
self.cache._log_stats(),
)
self.loc += len(out)
return out
fsspec.spec.AbstractBufferedFile.read_async = read_async
async def _fetch_block_async(self, block_number: int, log_info: str = "sync") -> bytes:
"""
Fetch the block of data for `block_number`.
"""
if block_number > self.nblocks:
raise ValueError(
f"'block_number={block_number}' is greater than "
f"the number of blocks ({self.nblocks})"
)
start = block_number * self.blocksize
end = start + self.blocksize
logger.info("BlockCache fetching block (%s) %d", log_info, block_number)
self.total_requested_bytes += end - start
self.miss_count += 1
async_fetcher = self.fetcher.__self__.async_fetch_range
block_contents = await async_fetcher(start, end)
return block_contents
fsspec.caching.BackgroundBlockCache._fetch_block_async = _fetch_block_async
async def acall(self, asyncfunc, *args, **kwargs):
if kwargs:
raise TypeError(f"Got unexpected keyword argument {kwargs.keys()}")
with self._lock:
if args in self._cache:
self._cache.move_to_end(args)
self._hits += 1
return self._cache[args]
result = await asyncfunc(*args, **kwargs)
with self._lock:
self._cache[args] = result
self._misses += 1
if len(self._cache) > self._max_size:
self._cache.popitem(last=False)
return result
fsspec.caching.UpdatableLRU.acall = acall
async def _fetch_async(self, start: int, end: int) -> bytes:
if start is None:
start = 0
if end is None:
end = self.size
if start >= self.size or start >= end:
return b""
# byte position -> block numbers
start_block_number = start // self.blocksize
end_block_number = end // self.blocksize
fetch_future_block_number = None
fetch_future = None
with self._fetch_future_lock:
# Background thread is running. Check we we can or must join it.
if self._fetch_future is not None:
assert self._fetch_future_block_number is not None
if self._fetch_future.done():
logger.info("BlockCache joined background fetch without waiting.")
self._fetch_block_cached.add_key(
await self._fetch_future, self._fetch_future_block_number
)
# Cleanup the fetch variables. Done with fetching the block.
self._fetch_future_block_number = None
self._fetch_future = None
else:
# Must join if we need the block for the current fetch
must_join = bool(
start_block_number
<= self._fetch_future_block_number
<= end_block_number
)
if must_join:
# Copy to the local variables to release lock
# before waiting for result
fetch_future_block_number = self._fetch_future_block_number
fetch_future = self._fetch_future
# Cleanup the fetch variables. Have a local copy.
self._fetch_future_block_number = None
self._fetch_future = None
# Need to wait for the future for the current read
if fetch_future is not None:
logger.info("BlockCache waiting for background fetch.")
# Wait until result and put it in cache
self._fetch_block_cached.add_key(
await fetch_future, fetch_future_block_number
)
# these are cached, so safe to do multiple calls for the same start and end.
for block_number in range(start_block_number, end_block_number + 1):
# self._fetch_block_cached(block_number)
await self._fetch_block_cached.acall(self._fetch_block_async, block_number)
# fetch next block in the background if nothing is running in the background,
# the block is within file and it is not already cached
end_block_plus_1 = end_block_number + 1
with self._fetch_future_lock:
if (
self._fetch_future is None
and end_block_plus_1 <= self.nblocks
and not self._fetch_block_cached.is_key_cached(end_block_plus_1)
):
self._fetch_future_block_number = end_block_plus_1
self._fetch_future = asyncio.ensure_future(
self._fetch_block_async(end_block_plus_1, "async")
)
return self._read_cache(
start,
end,
start_block_number=start_block_number,
end_block_number=end_block_number,
)
fsspec.caching.BackgroundBlockCache._fetch_async = _fetch_async
async def _cleanup_async_future(self):
with self._fetch_future_lock:
if self._fetch_future is not None:
self._fetch_future.cancel()
self._fetch_future = None
fsspec.caching.BackgroundBlockCache._cleanup_async_future = _cleanup_async_future
async def read_bytes(url, start, end):
my_id = random.randint(0, 100000)
print(f"[{my_id:05}] Starting")
print(f"[{my_id:05}] Starting read from {start} to {end}")
try:
async with aiohttp.ClientSession() as session:
fs = fsspec.filesystem("http")
async with session.get(url, headers={"Range": "bytes=0-0"}) as response:
size = int(response.headers.get("Content-Range", "bytes 0-0/0").split("/")[-1])
f = fsspec.implementations.http.HTTPFile(
fs,
url,
session=session,
size=size,
mode="rb",
block_size=1024*1024*50,
cache_type='background'
)
f.seek(start)
data = await f.read_async(end - start)
print(f"[{my_id:05}] Read {len(data)} bytes")
await f.cache._cleanup_async_future()
return data
finally:
print(f"[{my_id:05}] Done")
async def batch_read():
url = "https://ash-speed.hetzner.com/1GB.bin"
offsets = [random.randint(0, 1024 * 1024 * 1024) for _ in range(5)]
tasks = [read_bytes(url, o, o + 50 * 1024 * 1024) for o in offsets]
results = await asyncio.gather(*tasks)
print("Results:", [r[:15] for r in results])
print([len(r) for r in results])
if __name__ == '__main__':
asyncio.run(batch_read()) In terms of APIs, I think the most confusing thing (that this thread highlighted) is that the |
You are quite right, there are two distinct things going on:
It is possible that you would want both, which is what open_async currently does, but each is independently useful in some cases. There are some implementations of async file-like objects (aiofiles and such) but nothing standard. You couldn't pass this thing to anything expecting a standard IOBase object. Is a file an iterator of chunks, or lines, or something else? Can multiple coroutines wait on the same file? There is exactly one sync streaming file, in HTTPFileSystem for the case that the size cannot be determined and/or byte-range requests are not allowed. |
Here's a minimal example:
This outputs
Note the async version, while respecting
seek
andtell
, and even updating the.loc
after a read, so updated .tell works in terms of describing the.loc
, but the actual bytes that .read is operating on are wrong.The document starts
<!doctype
so we can see that the two.read()
are just reading sequentially, and the seek operation in the async implementation is not affecting the returned bytes (despite updating the.loc
)I actually originally found this via a
s3
filesystem withcache_type='background'
, but as I removed things I eventually got all the way down to pure http and found it still is not working.The text was updated successfully, but these errors were encountered: