-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
additional_scores as Record in parquet instead of list of String #52
Comments
Can you elaborate on filtering/ordering? Which scores do you usually use to order by? I only read the last part of Chad Gippity's output but a hybrid approach of main scores (FDR, qvalue) as columns and secondary scores as a list seems reasonable. |
I put some of Chad Gippity's ideas for those new to parquet. I have moved out as column the Global values and PEP because most of the software nowadays produces them, including our pipeline (it could be null if unavailable). However, software produced for every PSM/Features a lot of scores that would be good to capture, for example, COMET or MSGF but also for example in DIA-NN you have scores like the similarity match I think Im almost 100% sure we need to use one column for all the scores: the major difference is if we make them a list of key-value pairs or a nested structure of keys and values. The main difference is that nested could be incompatible in some platforms; while the list of strings is really slow for filtering, and ordering based on scores. Every operation in the list has to process the string first. This is relevant for example in the id workflow developed recently by @daichengxin we have the signal-to-noise ratio bigbio/quantms#410. Imagine you want to sort by that your PSMs to apply filters or to inspect the data. You will need to go inside the string, with nested structures in pyarrow at least you can "unfold" them and sort easy. |
Maybe you can store in the parquet metadata what each of the scores in the list mean. Then you should be able to use a) floats b) jump to an index based on the metadata. It's basically like creating multiple columns but always adhering to the same overall schema. |
I think benchmarking the alternatives on different pipeline variant outputs would be beneficial |
I like the idea to have a metadata file in parquet, where we write something like:
Because we know it may be that people don't know the description for one particular thing, within the specification we can release a big file with all the terms "known" and their corresponding description and type? In the specification, I have added something for tsv-based file formats such as differential and absolute expression files this idea of metadata in the top of the file:
This is similar to other genomics files. |
I actually mean embedding it in the parquet metadata for the "additional_scores" column: my_schema = pa.schema([
pa.field("psm_name", "string"),
pa.field("psm_main_score_1", "float"),
pa.field("psm_main_score_2", "float"),
pa.field("psm_additional_scores", pa.list_(pa.float32()), metadata={"score_names": ["comet:xcorr", "msgf:eval"]})
]) |
I think this approach is not desirable because every score will be a new column which means that you decrease the possibility to merge handle datasets across experiments for example, if one experiment was analyzed only with Comet and another one will be with comet and msgf+ then in one dataset you will have: Dataset 1:
Dataset2:
This is challenging to analyze if you want to perform analyses like querying all the psms for a given peptide sequence across all the experiments. My approach was more in this direction:
or
|
With main scores I mean scores that should always be there, like a general q-value or an overall combined PEP score. According to your comment:
Additional scores are scores that can differ between programs. my_schema = pa.schema([
pa.field("psm_name", "string"),
pa.field("global_value", "float32"),
pa.field("PEP", "float32"),
pa.field("IdScores", "list[float32]", metadata={"scorenames": ["comet:xcorr", "msgf:eval"]}),
]) |
Interesting, In that representation, you can't filter easily all the peptides by a given MSGF+ Score or sort by that because metadata is not easy to incorporate to the query mechanism. Like How you will be able to filer all comet:xcorr higher than a value? |
This kind of representation is more stable and provides consistent output.
|
import pyarrow as pa
import pyarrow.compute as pc
# create a schema
my_schema = pa.schema([
pa.field("psm_name", "string"),
pa.field("global_value", "float32"),
pa.field("PEP", "float32"),
pa.field("IdScores", type=pa.list_(pa.float32(),2), metadata={"scorenames": "comet:xcorr, msgf:eval"}),
])
# create a table
my_data = [
pa.array(["psm1", "psm2", "psm3"]),
pa.array([1.0, 2.0, 3.0]),
pa.array([0.1, 0.2, 0.3]),
pa.array([[0.1, 0.2], [0.2, 0.3], [0.3, 0.4]]),
]
my_table = pa.table(my_data, schema=my_schema)
print(my_table)
# parse the metadata from the Idscores column, split by comma and create a dict from scorename to index
metadata = my_table.schema.field_by_name("IdScores").metadata
scorenames = metadata[b'scorenames'].decode().split(",")
scorename_to_index = {name: i for i, name in enumerate(scorenames)}
# filter table for rows where the xcorr score is greater than 0.2
xcorr_index = scorename_to_index["comet:xcorr"]
filter_mask = pc.greater(pc.list_element(my_table["IdScores"], xcorr_index), 0.2)
filtered_table = my_table.filter(filter_mask)
print(filtered_table) |
Maybe pyarrow does something like this under the hood when using structs but I have the feeling the list version with metadata would be faster/smaller than the lists of structs. |
The idea is that this parquet is not only used by pyarrow but other technologies like DuckDB or Rust parquet. To decide between the structure or the list metadata approach a part of performance would be good to know how much support it has in other languages. @lazear do you know if Rust parquet has support for these two approaches? @zprobot Can you do a simple test on your side with your approach and @jpfeuffer with a big "fake" file and see the performance for sorting, filtering and finding? |
The test was conducted on one million data. @jpfeuffer's idea performs better in terms of performance. |
@zprobot can you check the support of @jpfeuffer idea in duckDB and other platforms that use parquet, like pyspark. Thanks a lot for the benchmark. @zprobot I guess also size is also really small because you don't have to repeat the fields? |
So I checked and pyarrow is not very good in writing parquet compliant metadata for columns so it might be a bit of a hassle to extract it with duckdb SQL only. Alternatives:
I have not tried pyspark |
Also merging datasets with different scores, for example, one with comet and another with msg+ will be a nightmare? I think struct is well-known and in any case we will not probably do hard filtering in IdScores? |
What about using a Map type field then? |
The rust parquet implementation is very low level, so both approaches are likely supported. I like the idea of using a Map type field. It makes the most sense from a type-system perspective. Ideally there are not variable columns that are sometimes present and sometimes not present. Struct doesn't make sense from a type-system perspective, since you need a fixed set of fields in each record (list of structs works, though). For the first version of the mzparquet format I was working with, I used a list of structs for modelling CV params:
|
In DuckDB, they are all well supported.
|
This means that list of struct works perfectly well in Rust parquet @lazear ? |
The Rust parquet package exposes full low level functionality, so anything goes (including writing invalid parquet files 😄) - I would be more worried about what the various popular query engines/packages support: for instance, last type I checked, polars didn't support the map type, hence why I used list of KV structs. |
BTW, I was checking now and MAPs are not well supported either by BigQuery. I think the most stable and well-supported by other engines/packages is STRUCT/RECORD. |
If we want better compatibility, we can consider these two representations.
or
like: |
This representation can achieve faster query performance compared to before.
or
|
my_schema = pa.schema([
pa.field("IdScores", type=pa.struct([
("name", pa.list_(pa.string())),
("score", pa.list_(pa.float32())) # assuming the score values are float
]))
]) seems like it should produce very similar column chunks to my_schema = pa.schema([
pa.field("IdScores",type=pa.list_(pa.struct([
("name", pa.string()),
("score", pa.float32()) # assuming the score values are float
]),2))
]) I'm still learning the |
@mobiusklein Yes, if using such representation, I have written code like this to retrieve it.
This retrieval speed is really slow. |
I would again urge you to consider the semantics of the types rather than query speed. How often are users going to be filtering by score type? {name: [str], score: [f32]} has very different semantics (and potential for writing meaningless values, e.g. Unequal list lengths) than [{name: str, score: f32}] |
Completely Agree @lazear probably for this type of score is better to have STRUCT which is a more well-known and solid structure. mzTab introduced something called |
Agreed, that is why I was puzzled by the statement that the split lists were faster when, by my understanding of record shredding the two should produce identical column groups (I'm probably missing something about offsets for lists). Otherwise the list of (score name (string/dict-encoded), score value (float)) pair structs is the ideal compromise. It won't be possible for the page index to help, but it should still be a target for predicate pushdown, if my reading is correct. I'll keep poking away at the |
We will use for now as suggested by @lazear and from our own tests list of structs wich is basically a list of keys and values, this is the representation in avro: {
"name": "additional_scores",
"type": {"type":
"array",
"items": {
"type":
"struct",
"field": {
"name": "string",
"value": "float"
}
}
} This is more supported. |
Currently, the IdScores are modelled as a List of key-value pairs: ["score1:value1", "score2:value2"]. This is good for compatibility across different parquet implementations duckDB, Rust Parquet etc. However, it is really bad for filtering ordering etc. We have other two possibilities as nested records or adding every Score as a new Column. Some ideas from AI of the three approaches here:
1. Current Approach: List of Key-Value Pairs
Representation:
IdScores
are stored as a list containing strings formatted as"key:value"
, e.g.,["score1:value1", "score2:value2"]
.Advantages:
Disadvantages:
2. Alternative 1: Nested Records (Struct/Map)
Representation:
IdScores
are modelled as a struct or map, where each key is associated with a value, e.g., a map like{"score1": value1, "score2": value2}
.Advantages:
Disadvantages:
3. Alternative 2: Separate Columns (Wide Table)
Representation:
score1
,score2
,score3
, etc.Advantages:
Disadvantages:
Comparison Summary
Choosing the Best Option
**1. If Query Performance and Efficient Filtering/Sorting are Priorities:
**2. If Schema Flexibility and Dynamic Scores are Essential:
**3. If Maximum Compatibility and Simple Schema are Crucial, with Acceptable Query Performance:
**4. Hybrid Approach:
Recommendation
Evaluate Based on Use Case Requirements:
Stable Schema with Frequent Queries on Specific Scores:
Dynamic or Variable Scores with Moderate Query Needs:
High Flexibility with Low Query Performance Needs:
Complex Requirements Needing Both Flexibility and Performance:
Additional Considerations:
The text was updated successfully, but these errors were encountered: