-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized spill file format #14078
Comments
As a data point, @totoroyyb reports a 100x faster reading of Arrow IPC data without validation on apache/arrow-rs#6933 |
Although we're currently spilling column-wise record batches, I think this will change to row-wise batches in the future. It would be better to benchmark and optimize spilling the Arrow Row format in this issue as well. The reason is that the spilling operation involves sorting, spilling sorted runs, and reading back those runs for merging. Both sorting and merging benefit from the row format. The current implementation performs several unnecessary conversions between row and column formats, which could become inefficient.
|
Spilling the row format makes some sense to me, although I suspect IPC will outperform it, presuming a fast enough disk. I feel I ought to point out though that in order for it to be sound to read a file without validation, DF needs to be sure nobody else could have written/modified it. This may be possible on Unix OSes using some shenanigans with unlinked file descriptors, but I suspect isn't generally possible. I feel I also ought to point out the mmap use-case is slightly different, as it is effectively in memory already, the performance benefit of skipping validation may be lessened when there are other overheads, e.g. reading the data from disk. |
In my opinion, DataFusion itself should not go to any significant trouble / effort to protect against the threat model of someone having enough control over the local file system to make arbitrary changes to spill files. If an adversary already has access to the local file system, the amount of extra safety gained by validating spill files is pretty small in my opinion. |
@2010YOUY01 the idea of avoiding Row->Column->Row conversions is a (very) good one |
Agreed, the more reasonable scenario would be a bug in DF causes it to trample its files, maybe some sort of concurrency bug where it opens the spill file before it finishes writing it. TBC I am not making a judgement of whether or not DF should care about this, someone would need to collect concrete benchmarks of what portion of the cost of spilling actually is validation as a starting point, I am merely stating that what is being proposed is unsound. |
It would be nice to have the option (likely not enabled by default) for the spill files to be compressed. It's almost trivial I think with the current implementation. |
We support this in Comet with the Arrow IPC Writer and it is trivial. We support LZ4, Snappy, and ZSTD. |
FYI I am working with @@totoroyyb on the arrow IPC work, in case anyone is interested or has time to help: |
Is your feature request related to a problem or challenge?
DataFusion spills data to local disk for processing datasets that do not fit in available memory, as illustrated in this comment:
datafusion/datafusion/physical-plan/src/sorts/sort.rs
Lines 88 to 205 in 918ac1b
Here is the code that handles spilling in sort and hash aggregates.
The current version of DataFusion spills data to disk using the Arrow IPC format, which is correct and was easy to get working, but comes with non trivial overhead, as @andygrove found in Comet:
Some potential sources of overhead are due to the validation applied to arrow IPC files (which may in general have come from untrusted sources) which is unecessary if the data was valid when written by DataFusion
Describe the solution you'd like
As part of improving DataFuison's performance of larger than memory datasets, I would like to consider adding a new optimized serialization format for use in spill files. This is similar to how we have added optimized (non Arrow) in memory storage formats for intermediate results in hash aggregation and others
At a high level this would look like:
Describe alternatives you've considered
Add a customized Reader/Writer
@andygrove has a PR to add a customized BatchReader to comet that seems to offer significant performance improvements for shuffle reading/writing:
We could potentially upstream this code into DataFusion
Optimize the Arrow IPC reader
Another option would be to continue using the Arrow IPC format, but disable validation:
@totoroyyb actually has a recently PR to arrow-rs proposing this approach:
Additional context
No response
The text was updated successfully, but these errors were encountered: