Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some basic questions about binsparse specification #55

Open
srobertp opened this issue Sep 16, 2024 · 3 comments
Open

Some basic questions about binsparse specification #55

srobertp opened this issue Sep 16, 2024 · 3 comments

Comments

@srobertp
Copy link

Hi all,

I was just reading through the specification and had several comments and question to provide. Some are related and others are not, we can split up into separate questions and discussions ... I will go back through and try to link to places in the spec repo as well

  1. I understand why we use 0 based for C/C++, but the format may be implemented in Fortran where 1-base is normal, right ? Doesn't matlab also use 1-base sometimes? Would it be better to have index be part of the definition ? Or is there a reason you want it always 0-base ? I suppose it is simpler, but using CSR format with ind as 0 or 1 index (or for distributed matrices, possibly as large k>0 value for start of local rows ? )
auto row_st = pointers_to_1[row] - ind;
auto row_en = pointers_to_1[row+1] - ind;
for (int j = row_st; j < row_en; ++j) {
    auto col = indices_1[j] - ind;
    auto val = values[j];
}

isn't much different than the always 0 index version:

auto row_st = pointers_to_1[row];
auto row_en = pointers_to_1[row+1];
for (int j = row_st; j < row_en; ++j) {
    auto col = indices_1[j];
    auto val = values[j];
}
  1. What might it look like to have a distributed range of matrix stored in separate files on separate systems possibly ? Is there a way to encode that this file contains CSR data for rows between 90 and 180 or a larger matrix with nRows 10000 … ?

  2. number_of_stored_values is semantically cleaner than number_of_non_zeros or nnz, but is there a shortened version of the acronym?

  3. "Fill" is fascinating, so it allows you to store a matrix like

[ D       ones ]
[ ones  D      ]

very efficiently eh :)

  1. Do DMATR/DMATC take a leading dimension taht is different from the nrows/ncols? Or does it allow to introduce a leading dimension when reading it in from file ?

  2. Would you ever support reading in or writing out a CSR4 representation ? Is it possible as a custom format? CSR4 technically allows for rearranging rows simply or using submatrices … honestly it isn't the most important format, but does show up in some libraries impls … you have two pointers for start and end instead of the one … and each are length nrows instead of nrows+1 …

auto row_st = pointers_to_1_st[row] - ind;   // start of pointers_to_1
auto row_en = pointers_to_1_en[row] - ind; // end of pointers_to_1
for (int j = row_st; j < row_en; ++j) {
    auto col = indices_1[j] - ind;
    auto val = values[j];
}
  1. I am fascinated by the idea of having COOC format as different from COOR (aka COO), but it makes sense :) do you allow unsorted at all for COO ? I don't want you to have to implement matrix sorting in an impl, but is it something to consider, and COOR vs COOC is a nice way to have unique sorted state clear whereas COO doesn't have one …

  2. Why are we requiring that "elements must not be duplicated, even for COO, it says "Pairs must not be duplicated" … will you provide a functionality to compress duplication in some way ?

  3. Data Types: have you considered adding bfloat16 or TF32 as supported real types ?

  4. The word "structure" in sparse contexts often means something a little differently to me than the pattern of symmetric_lower or skew_symmetric_upper … to me it refers to everything but the values … would you consider changing the name "structure" to "pattern" in the specification? I was thinking the structure section was going to be about storing a matrix without values, but I was mistaken :) as in "structure_only" …
    On second thought, I suppose it isn't the worst name :D, so keeping it as is, is fine, but it is just slightly unexpected to me… is there a way to mark there are no corresponding values at all ?

Best Regards,
Spencer

@willow-ahrens
Copy link
Collaborator

  1. I agree, it would be nice to add 1-based indexing. I use it in Julia. Of course, this makes all of the parsers just a little more complicated if they want to accept a 1-based file. However, it would be useful as an in-memory format for 1-based systems.

  2. We've been asked before whether we could encode matrices that represent the parts in a partition. One solution that was floated was to include a starting offset for the matrix, so we would specify not just the size of the matrix, but also the origin. This would allow us to store a matrix of values in e.g. 5:10 x 3:6. However, many distributed systems use local (starting at 0) coordinates to store the local problems, so in the absense of actual applications we were unsure what the right extensions to the format should be.

  3. a leading dimension would be a great addition. Is there a sparse analogue to the leading dimension? Is CSR4 the sparse analogue?

  4. CSR4 should be a straightforward addition, we could add it as a new level type in the custom formats.

  5. We don't currently allow unsorted COO.

  6. because this is a file format and not a tensor processing framework, we don't typically provide routines to e.g. compress duplicates or reformat matrices. These functionalities are nice to have, but they would be the responsibility of the tensor processing framework which uses binsparse to store tensors.

@DrTimothyAldenDavis
Copy link
Member

Why are we requiring that "elements must not be duplicated, even for COO, it says "Pairs must not be duplicated" … will you provide a functionality to compress duplication in some way ?

This would require some kind of binary operator to apply to the duplicate entries, such as "plus" or "second" (the latter would mean to take the last entry for any duplicates). But that makes a lot of assumptions about what kind of data types and operators we would need to support. With user-defined data types in the future, this gets very complicated. GraphBLAS can handle duplicates but it takes a lot of work in terms of the spec and the implementation to get this right. Since binsparse is a file format, not a computational platform like GraphBLAS or the sparse BLAS, we think this kind of duplicate removal should be left to the application, not the file format. I don't think Matrix Market allows for duplicates either (the spec doesn't mention anything about them).

Data Types: have you considered adding bfloat16 or TF32 as supported real types ?

Yes, for the next version of binsparse. We are looking at allowing any user-defined data type, which could be defined by the end user to be any fixed-size type they like. So even if bfloat16 or TF32 are not in the binsparse spec, they could be added by defining them at the end-user level.

@BenBrock
Copy link
Contributor

BenBrock commented Sep 18, 2024

I think for a lot of these questions, there's not necessarily a solid reason for going with one choice over the other. We were instead motivated by having a spec. that was short, opinionated, and relatively easy to implement. I think it could be expanded to handle a lot of these scenarios in the future.

  1. What about 1-based indices?

As Willow mentioned, I think we could support this by adding a switch if there's a big demand. The current state does require parsers for 1-based environments to perform an offset (as Matrix Market currently requires of parsers for 0-based environments).

  1. Distributed blocked formats?

I think this is a really interesting problem. In terms of actually writing a spec., though, I think we need more implementation experience. Currently none of us are working on distributed parsers, although I have in the past.

In my view this would be best served by having a bunch of Binsparse files and then some descriptor that describes how they glue together to form a distributed matrix. This could either be a ScaLAPACK-style tiling or an arbitrary offset tiling like Willow mentions.

  1. number_of_stored_values is long?

Yeah, I've recently found myself using number_of_entries as well in the paper. We wanted something that meant nnz but without the implied zero. One option might be number_of_entries, although I'm a little ambiguous about changing keyword names at this point. At a certain point they are arbitrary.

  1. fill is interesting

This was added by Willow---I think in the TACO/Julia world there's always some implied fill value, so this allows those libraries to write to disk without losing that fill information. A GraphBLAS reader that doesn't support a fill value would ignore this, though. (And none of the SuiteSparse matrices have fill values.)

  1. Implied leading dimension?

This is a good thing to think about. You definitely see views like this in binary, so it makes sense you might want to write them to disk in a zero-copy way.

  1. CSR4?

I agree with Willow---I think this would be pretty simple to add.

  1. Unsorted COO?

We could allow different varieties of sorted/unsortedness, but we wanted to keep things simple to start with. Currently this does require that a user (or parser, depending on the parser semantics) sort matrices before writing, if they are unsorted. I think we could add this if there's demand, but we might also want to do it in a modular way that allows describing sorted-ness for multiple formats.

  1. Not allowing duplicates?

Again this is just to keep things simple. We could allow duplicates with a switch (similar to sorted/unsorted-ness), but at the cost of more complexity for parsers. I think we could add this if there's big demand. A nice parser might allow you to write matrices with duplicates, de-duplicating them for you, but the Binsparse spec. itself is purely about the on-disk format, so it's up to the parser implementer.

  1. BF16/TF16?

Absolutely. I would love to do this, and it would make Pradeep happy. At the moment it seems a bit janky to write alternate floating point formats with HDF5, but there's nothing stopping us adding them to the spec., I believe.

  1. Structure is a weird name?

I agree it could cause confusion. However, there's a long legacy of using "structure" to mean precisely this in Matrix Market. (Matrix Market supports general, symmetric, hermitian, and skew-symmetric structures, and virtually all parsers use "structure" to refer to this keyword, as defined in the Matrix Market spec.) "Pattern" is, I think, more ripe for confusion: Matrix Market supports a pattern monotype data type, which means there is no data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants