-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support limiting the frame size via zstd command line tool #2121
Comments
@pizzard, if the You may want to look at the seekable format (spec, code), although that is not a command-line tool. Alternatively, could you use the |
@felixhandte very valid question there, let me describe the layout of our algorithm and how we do things a bit better. When then reading the file, I just mmap the file into memory. Then I build up a header index table, by skipping through the file once and decoding all the headers with the zstd header reading function. as the compressed data size is present, I can simply skip from header header and read all of them in. As no decompression is happening, this is very fast, When then reading the file, just consecutively decompress the file frame by frame as needed. When a jump to a random position is needed, I just use the lookup table to find the right chunk (by using the uncompressed size counts), decompress it and continue from there. What currently happens is, when some uses the command-line tool it created one frame with all contents in it. Then it doesn't work with the loading, it just loads everything in. My intuition was actually that the cmd line tool does this because it is more space-efficient. But at least on our logfiles, limiting the frame size to 16MB counter-intuitively better compressed the files than making one big frame. This persistet for different binary file layouts and different compression levels. The change only was a few percent, but I'll take it. |
I think adding seekable format support via the CLI would be great. Essentially, the option to set the seekable "Maximum Frame Size" parameter. |
Is your feature request related to a problem? Please describe.
We use zstd to compress log files which are then later read and replayed. Due to the fact the files are pretty large (maybe even larger then the replay systems available memory when replaying multiple of them), in our writer we currently flush (force end frame) after reaching a uncompressed input size using the zstd API. This partions our files into frames of 16MB ucompressed size. The loader can then load the file frame by frame, dropping the data from the previous frame when it is no longer needed. It also can search for a given point in in time in our ordered input by implementing a binary search within the undecompressed frames, searching and finding the next location to jump to, locating that particular frame, decompressing that frame and so on. By only loosing very little compression ratio the memory consumption of random access jumps can be reduced to a fixed amount and the speed increased drastically.
There is one problem though. Sometimes data is written in uncompressed form and one wants to compress the data afterwards. This is convieniently done using the zstd command line tool which sadly then compresses all my data into to a huge frame, which then renders my binary search useless.
Describe the solution you'd like
An option for the command line tool --max-frame-size=X which allows limiting the output frame size to X. I don't care whether this is a limit to the compressed frame size or the uncompressed frame size, as I can adjust X accordingly.
Describe alternatives you've considered
We could write our own zstd command line tool which does this, which we rather want to avoid. I tried different tricks with the streaming API, which didn't work.
Additional context
The zstd API allows me to force the end of a frame, which I use in my application to write frames of certain size by counting uncompressed input and forcing this.
The text was updated successfully, but these errors were encountered: