-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some basic questions about binsparse specification #55
Comments
|
This would require some kind of binary operator to apply to the duplicate entries, such as "plus" or "second" (the latter would mean to take the last entry for any duplicates). But that makes a lot of assumptions about what kind of data types and operators we would need to support. With user-defined data types in the future, this gets very complicated. GraphBLAS can handle duplicates but it takes a lot of work in terms of the spec and the implementation to get this right. Since binsparse is a file format, not a computational platform like GraphBLAS or the sparse BLAS, we think this kind of duplicate removal should be left to the application, not the file format. I don't think Matrix Market allows for duplicates either (the spec doesn't mention anything about them).
Yes, for the next version of binsparse. We are looking at allowing any user-defined data type, which could be defined by the end user to be any fixed-size type they like. So even if bfloat16 or TF32 are not in the binsparse spec, they could be added by defining them at the end-user level. |
I think for a lot of these questions, there's not necessarily a solid reason for going with one choice over the other. We were instead motivated by having a spec. that was short, opinionated, and relatively easy to implement. I think it could be expanded to handle a lot of these scenarios in the future.
As Willow mentioned, I think we could support this by adding a switch if there's a big demand. The current state does require parsers for 1-based environments to perform an offset (as Matrix Market currently requires of parsers for 0-based environments).
I think this is a really interesting problem. In terms of actually writing a spec., though, I think we need more implementation experience. Currently none of us are working on distributed parsers, although I have in the past. In my view this would be best served by having a bunch of Binsparse files and then some descriptor that describes how they glue together to form a distributed matrix. This could either be a ScaLAPACK-style tiling or an arbitrary offset tiling like Willow mentions.
Yeah, I've recently found myself using
This was added by Willow---I think in the TACO/Julia world there's always some implied fill value, so this allows those libraries to write to disk without losing that fill information. A GraphBLAS reader that doesn't support a fill value would ignore this, though. (And none of the SuiteSparse matrices have fill values.)
This is a good thing to think about. You definitely see views like this in binary, so it makes sense you might want to write them to disk in a zero-copy way.
I agree with Willow---I think this would be pretty simple to add.
We could allow different varieties of sorted/unsortedness, but we wanted to keep things simple to start with. Currently this does require that a user (or parser, depending on the parser semantics) sort matrices before writing, if they are unsorted. I think we could add this if there's demand, but we might also want to do it in a modular way that allows describing sorted-ness for multiple formats.
Again this is just to keep things simple. We could allow duplicates with a switch (similar to sorted/unsorted-ness), but at the cost of more complexity for parsers. I think we could add this if there's big demand. A nice parser might allow you to write matrices with duplicates, de-duplicating them for you, but the Binsparse spec. itself is purely about the on-disk format, so it's up to the parser implementer.
Absolutely. I would love to do this, and it would make Pradeep happy. At the moment it seems a bit janky to write alternate floating point formats with HDF5, but there's nothing stopping us adding them to the spec., I believe.
I agree it could cause confusion. However, there's a long legacy of using "structure" to mean precisely this in Matrix Market. (Matrix Market supports |
Hi all,
I was just reading through the specification and had several comments and question to provide. Some are related and others are not, we can split up into separate questions and discussions ... I will go back through and try to link to places in the spec repo as well
isn't much different than the always 0 index version:
What might it look like to have a distributed range of matrix stored in separate files on separate systems possibly ? Is there a way to encode that this file contains CSR data for rows between 90 and 180 or a larger matrix with nRows 10000 … ?
number_of_stored_values is semantically cleaner than number_of_non_zeros or nnz, but is there a shortened version of the acronym?
"Fill" is fascinating, so it allows you to store a matrix like
very efficiently eh :)
Do DMATR/DMATC take a leading dimension taht is different from the nrows/ncols? Or does it allow to introduce a leading dimension when reading it in from file ?
Would you ever support reading in or writing out a CSR4 representation ? Is it possible as a custom format? CSR4 technically allows for rearranging rows simply or using submatrices … honestly it isn't the most important format, but does show up in some libraries impls … you have two pointers for start and end instead of the one … and each are length nrows instead of nrows+1 …
I am fascinated by the idea of having COOC format as different from COOR (aka COO), but it makes sense :) do you allow unsorted at all for COO ? I don't want you to have to implement matrix sorting in an impl, but is it something to consider, and COOR vs COOC is a nice way to have unique sorted state clear whereas COO doesn't have one …
Why are we requiring that "elements must not be duplicated, even for COO, it says "Pairs must not be duplicated" … will you provide a functionality to compress duplication in some way ?
Data Types: have you considered adding bfloat16 or TF32 as supported real types ?
The word "structure" in sparse contexts often means something a little differently to me than the pattern of symmetric_lower or skew_symmetric_upper … to me it refers to everything but the values … would you consider changing the name "structure" to "pattern" in the specification? I was thinking the structure section was going to be about storing a matrix without values, but I was mistaken :) as in "structure_only" …
On second thought, I suppose it isn't the worst name :D, so keeping it as is, is fine, but it is just slightly unexpected to me… is there a way to mark there are no corresponding values at all ?
Best Regards,
Spencer
The text was updated successfully, but these errors were encountered: