Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional_scores as Record in parquet instead of list of String #52

Closed
ypriverol opened this issue Sep 5, 2024 · 31 comments · Fixed by #50 or #81
Closed

additional_scores as Record in parquet instead of list of String #52

ypriverol opened this issue Sep 5, 2024 · 31 comments · Fixed by #50 or #81

Comments

@ypriverol
Copy link
Member

Currently, the IdScores are modelled as a List of key-value pairs: ["score1:value1", "score2:value2"]. This is good for compatibility across different parquet implementations duckDB, Rust Parquet etc. However, it is really bad for filtering ordering etc. We have other two possibilities as nested records or adding every Score as a new Column. Some ideas from AI of the three approaches here:


1. Current Approach: List of Key-Value Pairs

Representation:

  • IdScores are stored as a list containing strings formatted as "key:value", e.g., ["score1:value1", "score2:value2"].

Advantages:

  • High Compatibility: Simple list structure ensures broad compatibility across various Parquet implementations.
  • Schema Simplicity: Maintains a flat schema without needing complex nested structures or numerous columns.
  • Flexibility: Easily accommodates dynamic or varying keys without altering the schema.

Disadvantages:

  • Inefficient Filtering and Querying: Filtering, sorting, or aggregating based on individual scores requires parsing the strings.
  • Limited Type Safety: All data is stored as strings, which can lead to issues with type-specific operations and validations.
  • Complex Query Logic: Queries need to extract and possibly cast values from strings, complicating SQL expressions and increasing the likelihood of errors.
  • Storage Overhead: Storing keys and values together as strings can consume more storage compared to native data types.

2. Alternative 1: Nested Records (Struct/Map)

Representation:

  • IdScores are modelled as a struct or map, where each key is associated with a value, e.g., a map like {"score1": value1, "score2": value2}.

Advantages:

  • Structured Data: Maintains a clear association between keys and their respective values without string parsing.
  • Schema Flexibility: Supports dynamic keys without requiring schema changes, similar to the current approach.
  • Improved Query Performance: Easier to access individual scores compared to parsing strings, as data is already structured.
  • Type Safety: Values can be stored using appropriate data types (e.g., integers, floats), enabling type-specific operations and validations.
  • Better Compression: Similar or better compression compared to key-value strings, as keys can be stored once per record or dictionary-encoded.

Disadvantages:

  • Complexity in Filtering and Sorting: While better than string parsing, filtering and sorting based on individual scores can still be less efficient than separate columns, especially if using maps where keys are dynamic.
  • Limited Indexing: Most Parquet-based systems don’t support indexing on individual map or struct fields, potentially leading to slower query performance for specific keys.
  • Compatibility Concerns: Although widely supported, some tools or systems may have limited or varying support for nested structures, potentially affecting interoperability.

3. Alternative 2: Separate Columns (Wide Table)

Representation:

  • Each score is represented as a distinct column in the table, e.g., score1, score2, score3, etc.

Advantages:

  • Optimal Query Performance: Enables efficient filtering, sorting, and aggregation on individual scores since each is a separate column.
  • Indexing Capabilities: Allows the use of indexes on specific score columns, significantly enhancing query speed for those fields.
  • Type Safety and Optimization: Each column can have a defined data type, facilitating better storage optimization and type-specific operations.
  • Simpler Queries: Easier to write SQL queries as there’s no need for parsing or navigating nested structures.

Disadvantages:

  • Schema Rigidity: Adding or removing scores requires schema changes, which can be cumbersome if the number of scores is large or frequently changing.
  • Wide Schema Issues: A table with many score columns can become unwieldy, making management and maintenance more difficult.
  • Increased Storage for Sparse Data: If many score columns contain nulls (i.e., not all scores apply to every record), it can lead to inefficient storage utilization.
  • Potential for Column Explosion: With a vast number of possible scores, the number of columns can grow excessively, complicating both storage and query logic.

Comparison Summary

Feature List of Key-Value Pairs Nested Records (Struct/Map) Separate Columns (Wide Table)
Compatibility High High (but some tool limitations) High
Schema Flexibility Very High (dynamic keys) High (dynamic keys with maps) Low (fixed schema)
Query Performance Low (requires string parsing) Moderate (better than list, but not optimal) High
Filtering & Sorting Inefficient Moderate Efficient
Type Safety Low (all strings) High (typed values) High (typed columns)
Ease of Querying Complex Moderate Simple
Storage Efficiency Lower (string overhead) Better (typed and potentially compressed) Variable (can be efficient or wasteful)
Schema Maintenance Easy Moderate Difficult if dynamic
Indexing Capabilities Limited Limited Extensive
Handling of Dynamic Scores Excellent Good (with maps) Poor (requires schema changes)

Choosing the Best Option

**1. If Query Performance and Efficient Filtering/Sorting are Priorities:

  • Use Separate Columns.
    • Best when the number of scores is relatively fixed and not excessively large.
    • Suitable for scenarios where queries frequently target specific scores.
    • Example Use Case: Analytical dashboards that need to filter or sort based on specific score metrics.

**2. If Schema Flexibility and Dynamic Scores are Essential:

  • Use Nested Records (Map).
    • Ideal when scores are dynamic, with keys that can vary between records.
    • Maintains a balance between structure and flexibility without requiring frequent schema changes.
    • Example Use Case: Systems where different entities might have different sets of scores or where scores can be added dynamically.

**3. If Maximum Compatibility and Simple Schema are Crucial, with Acceptable Query Performance:

  • Continue with List of Key-Value Pairs.
    • Best when interoperability across diverse Parquet implementations is a top priority and query performance is less critical.
    • Useful for storage or archival purposes where data is primarily accessed in bulk rather than through complex queries.
    • Example Use Case: Data lakes where data is ingested and stored for long-term use with occasional processing.

**4. Hybrid Approach:

  • Combine Separate Columns with a Nested Structure.
    • Define commonly queried scores as separate columns for performance.
    • Store less frequent or dynamic scores within a nested struct or map.
    • This approach leverages the strengths of both methods, optimizing for both performance and flexibility.
    • Example Use Case: A dataset with a core set of mandatory scores and additional optional or custom scores.

Recommendation

Evaluate Based on Use Case Requirements:

  1. Stable Schema with Frequent Queries on Specific Scores:

    • Use Separate Columns.
    • Benefits from optimized query performance and simpler SQL operations.
  2. Dynamic or Variable Scores with Moderate Query Needs:

    • Use Nested Records (Map).
    • Balances flexibility with improved query capabilities over simple key-value lists.
  3. High Flexibility with Low Query Performance Needs:

    • Retain List of Key-Value Pairs.
    • Maintains simplicity and maximum compatibility, suitable for storage or infrequent processing.
  4. Complex Requirements Needing Both Flexibility and Performance:

    • Adopt a Hybrid Approach.
    • Leverage separate columns for key scores while storing additional scores in a nested structure.

Additional Considerations:

  • Schema Evolution: If your application frequently introduces new score types, favour approaches that minimize schema changes, such as nested records or key-value lists.
  • Data Volume and Sparsity: For wide tables with many columns, consider the storage implications of sparsity and the potential impact on performance.
  • Tooling and Ecosystem Support: Ensure that your data processing tools and downstream systems effectively support the chosen representation, especially for nested structures or wide schemas.

@jpfeuffer
Copy link
Contributor

Can you elaborate on filtering/ordering? Which scores do you usually use to order by?

I only read the last part of Chad Gippity's output but a hybrid approach of main scores (FDR, qvalue) as columns and secondary scores as a list seems reasonable.

@ypriverol
Copy link
Member Author

ypriverol commented Sep 6, 2024

I put some of Chad Gippity's ideas for those new to parquet. I have moved out as column the Global values and PEP because most of the software nowadays produces them, including our pipeline (it could be null if unavailable).

However, software produced for every PSM/Features a lot of scores that would be good to capture, for example, COMET or MSGF but also for example in DIA-NN you have scores like the similarity match Spectrum.Similarity or Fragment.Correlations that would be good to capture.

I think Im almost 100% sure we need to use one column for all the scores: the major difference is if we make them a list of key-value pairs or a nested structure of keys and values. The main difference is that nested could be incompatible in some platforms; while the list of strings is really slow for filtering, and ordering based on scores. Every operation in the list has to process the string first.

This is relevant for example in the id workflow developed recently by @daichengxin we have the signal-to-noise ratio bigbio/quantms#410. Imagine you want to sort by that your PSMs to apply filters or to inspect the data. You will need to go inside the string, with nested structures in pyarrow at least you can "unfold" them and sort easy.

@jpfeuffer
Copy link
Contributor

jpfeuffer commented Sep 6, 2024

Maybe you can store in the parquet metadata what each of the scores in the list mean. Then you should be able to use a) floats b) jump to an index based on the metadata.
Disadvantage: you have to fill scores inside one file with null if they don't exist to keep the list length the same.
But this should not be a problem, since usually if a certain score is output by a program, most rows should have a value.

It's basically like creating multiple columns but always adhering to the same overall schema.
However it might not compress as nicely as multiple columns.

@jpfeuffer
Copy link
Contributor

I think benchmarking the alternatives on different pipeline variant outputs would be beneficial

@ypriverol
Copy link
Member Author

ypriverol commented Sep 6, 2024

Maybe you can store in the parquet metadata what each of the scores in the list mean. Then you should be able to use a) floats b) jump to an index based on the metadata. Disadvantage: you have to fill scores inside one file with null if they don't exist to keep the list length the same. But this should not be a problem, since usually if a certain score is output by a program, most rows should have a value.

It's basically like creating multiple columns but always adhering to the same overall schema. However, it might not compress as nicely as multiple columns.

I like the idea to have a metadata file in parquet, where we write something like:

Field Name | Description | Value Type

Because we know it may be that people don't know the description for one particular thing, within the specification we can release a big file with all the terms "known" and their corresponding description and type?

In the specification, I have added something for tsv-based file formats such as differential and absolute expression files this idea of metadata in the top of the file:

   #INFO=<ID=Protein, Number=inf, Type=String, Description="Protein Accession">
   #INFO=<ID=SampleAccession, Number=1, Type=String, Description="Sample Accession in the SDRF">
   #INFO=<ID=Condition, Number=1, Type=String, Description="Value of the factor value">
   #INFO=<ID=Ibaq, Number=1, Type=Float, Description="Intensity based absolute quantification">
   #INFO=<ID=IbaqNormalized, Number=1, Type=Float, Description="normalized iBAQ">
   #INFO=<ID=QuantmsioVersion, Number=1, Type=String, Description="Version of the quantms.io">

This is similar to other genomics files.

@jpfeuffer
Copy link
Contributor

jpfeuffer commented Sep 6, 2024

I actually mean embedding it in the parquet metadata for the "additional_scores" column:

my_schema = pa.schema([
    pa.field("psm_name", "string"),
    pa.field("psm_main_score_1", "float"),
    pa.field("psm_main_score_2", "float"),
    pa.field("psm_additional_scores", pa.list_(pa.float32()), metadata={"score_names": ["comet:xcorr", "msgf:eval"]})
])

@ypriverol
Copy link
Member Author

I think this approach is not desirable because every score will be a new column which means that you decrease the possibility to merge handle datasets across experiments for example, if one experiment was analyzed only with Comet and another one will be with comet and msgf+ then in one dataset you will have:

Dataset 1:

my_schema = pa.schema([
    pa.field("psm_name", "string"),
    pa.field("psm_main_score_1", "float"),
    pa.field("psm_additional_scores", pa.list_(pa.float32()), metadata={"score_names": ["comet:xcorr"]})
])

Dataset2:

my_schema = pa.schema([
    pa.field("psm_name", "string"),
    pa.field("psm_main_score_1", "float"),
    pa.field("psm_main_score_2", "float"),
    pa.field("psm_additional_scores", pa.list_(pa.float32()), metadata={"score_names": ["comet:xcorr", "msgf:eval"]})
])

This is challenging to analyze if you want to perform analyses like querying all the psms for a given peptide sequence across all the experiments.

My approach was more in this direction:

my_schema = pa.schema([
    pa.field("psm_name", "string"),
    pa.field("IdScores", "list[string]"),
])

or

pa.field("IdScores", pa.list_(
    pa.struct([
        ("name", pa.string()),
        ("value", pa.float64())  # assuming the score values are float
    ])
))

@jpfeuffer
Copy link
Contributor

jpfeuffer commented Sep 6, 2024

With main scores I mean scores that should always be there, like a general q-value or an overall combined PEP score. According to your comment:

I have moved out as column the Global values and PEP because most of the software nowadays produces them, including our pipeline (it could be null if unavailable).

Additional scores are scores that can differ between programs.

my_schema = pa.schema([
    pa.field("psm_name", "string"),
    pa.field("global_value", "float32"),
    pa.field("PEP", "float32"),
    pa.field("IdScores", "list[float32]", metadata={"scorenames": ["comet:xcorr", "msgf:eval"]}),
])

@ypriverol
Copy link
Member Author

Interesting, In that representation, you can't filter easily all the peptides by a given MSGF+ Score or sort by that because metadata is not easy to incorporate to the query mechanism. Like How you will be able to filer all comet:xcorr higher than a value?

@zprobot
Copy link
Collaborator

zprobot commented Sep 6, 2024

This kind of representation is more stable and provides consistent output.

pa.field("IdScores", pa.list_(
    pa.struct([
        ("name", pa.string()),
        ("value", pa.float64())  # assuming the score values are float
    ])
))

@jpfeuffer
Copy link
Contributor

Interesting, In that representation, you can't filter easily all the peptides by a given MSGF+ Score or sort by that because metadata is not easy to incorporate to the query mechanism. Like How you will be able to filer all comet:xcorr higher than a value?

import pyarrow as pa
import pyarrow.compute as pc
# create a schema
my_schema = pa.schema([
    pa.field("psm_name", "string"),
    pa.field("global_value", "float32"),
    pa.field("PEP", "float32"),
    pa.field("IdScores", type=pa.list_(pa.float32(),2), metadata={"scorenames": "comet:xcorr, msgf:eval"}),
])

# create a table
my_data = [
    pa.array(["psm1", "psm2", "psm3"]),
    pa.array([1.0, 2.0, 3.0]),
    pa.array([0.1, 0.2, 0.3]),
    pa.array([[0.1, 0.2], [0.2, 0.3], [0.3, 0.4]]),
]

my_table = pa.table(my_data, schema=my_schema)
print(my_table)

# parse the metadata from the Idscores column, split by comma and create a dict from scorename to index
metadata = my_table.schema.field_by_name("IdScores").metadata
scorenames = metadata[b'scorenames'].decode().split(",")
scorename_to_index = {name: i for i, name in enumerate(scorenames)}

# filter table for rows where the xcorr score is greater than 0.2
xcorr_index = scorename_to_index["comet:xcorr"]
filter_mask = pc.greater(pc.list_element(my_table["IdScores"], xcorr_index), 0.2)
filtered_table = my_table.filter(filter_mask)
print(filtered_table)

@jpfeuffer
Copy link
Contributor

Maybe pyarrow does something like this under the hood when using structs but I have the feeling the list version with metadata would be faster/smaller than the lists of structs.

@ypriverol
Copy link
Member Author

The idea is that this parquet is not only used by pyarrow but other technologies like DuckDB or Rust parquet. To decide between the structure or the list metadata approach a part of performance would be good to know how much support it has in other languages.

@lazear do you know if Rust parquet has support for these two approaches?

@zprobot Can you do a simple test on your side with your approach and @jpfeuffer with a big "fake" file and see the performance for sorting, filtering and finding?

@zprobot
Copy link
Collaborator

zprobot commented Sep 6, 2024

import pyarrow as pa
import pyarrow.compute as pc
import time
import numpy as np
my_schema = pa.schema([
    pa.field("IdScores", type=pa.list_(pa.float32(),2), metadata={"scorenames": "comet:xcorr, msgf:eval"}),
])
scores = np.random.random((10000000,2))
my_data = [
    pa.array(scores.tolist()),
]
my_table = pa.table(my_data, schema=my_schema)
metadata = my_table.schema.field_by_name("IdScores").metadata
scorenames = metadata[b'scorenames'].decode().split(",")
scorename_to_index = {name: i for i, name in enumerate(scorenames)}
xcorr_index = scorename_to_index["comet:xcorr"]
pre = time.time()
filter_mask = pc.greater(pc.list_element(my_table["IdScores"], xcorr_index), 0.2)
filtered_table = my_table.filter(filter_mask)
time.time() - pre
// 0.047595977783203125
scores_map = [[{'name':'comet','score':score[0]},{'name':'msgf','score':score[1]}] for score in scores]
my_schema = pa.schema([
    pa.field("IdScores",type=pa.list_(pa.struct([
        ("name", pa.string()),
        ("score", pa.float32())  # assuming the score values are float
    ]),2))
])
my_data = [
    pa.array(scores_map),
]

my_table = pa.table(my_data, schema=my_schema)

pre = time.time()
filter_mask = pc.greater(pc.list_element(my_table["IdScores"], 0).chunk(0).field('score'), 0.2)
filtered_table = pc.filter(my_table, filter_mask)
time.time() - pre
// 0.5332629680633545

The test was conducted on one million data. @jpfeuffer's idea performs better in terms of performance.
There is a significant difference in retrieval speed between them.

@ypriverol
Copy link
Member Author

ypriverol commented Sep 6, 2024

@zprobot can you check the support of @jpfeuffer idea in duckDB and other platforms that use parquet, like pyspark. Thanks a lot for the benchmark.

@zprobot I guess also size is also really small because you don't have to repeat the fields?

@jpfeuffer
Copy link
Contributor

So I checked and pyarrow is not very good in writing parquet compliant metadata for columns so it might be a bit of a hassle to extract it with duckdb SQL only.
(You have to get the ARROW:Schema metadata, base64 decode etc.)

Alternatives:

  • Use duckdb in python together with pyarrow to decode metadata first
  • Write it into the file metadata (not in the schema.field metadata). There it is readable by duckdb SQL's parquet_metadata function easily.

I have not tried pyspark

@ypriverol
Copy link
Member Author

ypriverol commented Sep 6, 2024

Also merging datasets with different scores, for example, one with comet and another with msg+ will be a nightmare?

I think struct is well-known and in any case we will not probably do hard filtering in IdScores?

@jpfeuffer
Copy link
Contributor

What about using a Map type field then?
With string to float. Key is score name and value is value.

@lazear
Copy link
Collaborator

lazear commented Sep 7, 2024

The idea is that this parquet is not only used by pyarrow but other technologies like DuckDB or Rust parquet. To decide between the structure or the list metadata approach a part of performance would be good to know how much support it has in other languages.

@lazear do you know if Rust parquet has support for these two approaches?

@zprobot Can you do a simple test on your side with your approach and @jpfeuffer with a big "fake" file and see the performance for sorting, filtering and finding?

The rust parquet implementation is very low level, so both approaches are likely supported. I like the idea of using a Map type field. It makes the most sense from a type-system perspective. Ideally there are not variable columns that are sometimes present and sometimes not present. Struct doesn't make sense from a type-system perspective, since you need a fixed set of fields in each record (list of structs works, though).

For the first version of the mzparquet format I was working with, I used a list of structs for modelling CV params:

    optional group cv_params (list) {
        repeated group list {
            required group element {
                required byte_array accession (utf8);
                required byte_array value (utf8);
            }
        }
    }

@zprobot
Copy link
Collaborator

zprobot commented Sep 7, 2024

In DuckDB, they are all well supported.

#struct
duckdb.sql(
        """
            SELECT *
            FROM read_parquet('struct.parquet')
            WHERE array_extract(IdScores, 1).score > 0.99
        """
        )
// array
duckdb.sql(
        """
            SELECT *
            FROM read_parquet('array.parquet')
            WHERE array_extract(IdScores, 1) > 0.99
        """
        )

@ypriverol
Copy link
Member Author

ypriverol commented Sep 7, 2024

The idea is that this parquet is not only used by pyarrow but other technologies like DuckDB or Rust parquet. To decide between the structure or the list metadata approach a part of performance would be good to know how much support it has in other languages.
@lazear do you know if Rust parquet has support for these two approaches?
@zprobot Can you do a simple test on your side with your approach and @jpfeuffer with a big "fake" file and see the performance for sorting, filtering and finding?

The rust parquet implementation is very low level, so both approaches are likely supported. I like the idea of using a Map type field. It makes the most sense from a type-system perspective. Ideally there are not variable columns that are sometimes present and sometimes not present. Struct doesn't make sense from a type-system perspective, since you need a fixed set of fields in each record (list of structs works, though).

For the first version of the mzparquet format I was working with, I used a list of structs for modelling CV params:

    optional group cv_params (list) {
        repeated group list {
            required group element {
                required byte_array accession (utf8);
                required byte_array value (utf8);
            }
        }
    }

This means that list of struct works perfectly well in Rust parquet @lazear ?

@lazear
Copy link
Collaborator

lazear commented Sep 7, 2024

The Rust parquet package exposes full low level functionality, so anything goes (including writing invalid parquet files 😄) - I would be more worried about what the various popular query engines/packages support: for instance, last type I checked, polars didn't support the map type, hence why I used list of KV structs.

@ypriverol
Copy link
Member Author

BTW, I was checking now and MAPs are not well supported either by BigQuery. I think the most stable and well-supported by other engines/packages is STRUCT/RECORD.

@zprobot
Copy link
Collaborator

zprobot commented Sep 8, 2024

If we want better compatibility, we can consider these two representations.

my_schema = pa.schema([
    pa.field("Software",type=pa.list_(pa.string())),
    pa.field("IdScores", type=pa.list_(pa.float32())),
])

or

my_schema = pa.schema([
    pa.field("IdScores", type=pa.struct([
        ("name", pa.list_(pa.string())),
        ("score", pa.list_(pa.float32()))  # assuming the score values are float
    ]))
])

like:
[comet, score] [0.99435776, 0.21769278]
or
{'name': [comet, score], 'score': [0.99435776, 0.21769278]}

@zprobot
Copy link
Collaborator

zprobot commented Sep 8, 2024

This representation can achieve faster query performance compared to before.
like duckdb:

duckdb.sql("""
       SELECT *
        FROM read_parquet('list.parquet')
        WHERE IdScores[array_position(Software, 'comet')] > 0.99;
        """)

or

duckdb.sql("""
       SELECT *
       FROM read_parquet('s.parquet')
       WHERE IdScores.score[array_position(IdScores.name, 'comet')] > 0.99;
        """)

@mobiusklein
Copy link
Contributor

my_schema = pa.schema([
    pa.field("IdScores", type=pa.struct([
        ("name", pa.list_(pa.string())),
        ("score", pa.list_(pa.float32()))  # assuming the score values are float
    ]))
])

seems like it should produce very similar column chunks to

my_schema = pa.schema([
    pa.field("IdScores",type=pa.list_(pa.struct([
        ("name", pa.string()),
        ("score", pa.float32())  # assuming the score values are float
    ]),2))
])

I'm still learning the pyarrow.compute layer to replicate the comparison @zprobot did in-memory. Are there ways that this layout might matter?

@zprobot
Copy link
Collaborator

zprobot commented Sep 9, 2024

@mobiusklein Yes, if using such representation, I have written code like this to retrieve it.

my_schema = pa.schema([
    pa.field("IdScores",type=pa.list_(pa.struct([
        ("name", pa.string()),
        ("score", pa.float32())  # assuming the score values are float
    ]),2))
])
duckdb.sql(
        """
        SELECT *
        FROM read_parquet('struct.parquet') AS t, 
        LATERAL unnest(t.IdScores) AS s
        WHERE (s.IdScores.name = 'comet' AND s.IdScores.score > 0.99)
        """
        )

This retrieval speed is really slow.

@lazear
Copy link
Collaborator

lazear commented Sep 9, 2024

@mobiusklein Yes, if using such representation, I have written code like this to retrieve it.

my_schema = pa.schema([
    pa.field("IdScores",type=pa.list_(pa.struct([
        ("name", pa.string()),
        ("score", pa.float32())  # assuming the score values are float
    ]),2))
])
duckdb.sql(
        """
        SELECT *
        FROM read_parquet('struct.parquet') AS t, 
        LATERAL unnest(t.IdScores) AS s
        WHERE (s.IdScores.name = 'comet' AND s.IdScores.score > 0.99)
        """
        )

This retrieval speed is really slow.

I would again urge you to consider the semantics of the types rather than query speed. How often are users going to be filtering by score type?

{name: [str], score: [f32]} has very different semantics (and potential for writing meaningless values, e.g. Unequal list lengths) than [{name: str, score: f32}]

@ypriverol
Copy link
Member Author

ypriverol commented Sep 9, 2024

Completely Agree @lazear probably for this type of score is better to have STRUCT which is a more well-known and solid structure.

mzTab introduced something called best_search_engine_name and best_search_engine_score which could be used to store the main score of the search engine, that was my purpose for PEP and q-value which are normally the search engine options.

@mobiusklein
Copy link
Contributor

Agreed, that is why I was puzzled by the statement that the split lists were faster when, by my understanding of record shredding the two should produce identical column groups (I'm probably missing something about offsets for lists). Otherwise the list of (score name (string/dict-encoded), score value (float)) pair structs is the ideal compromise.

It won't be possible for the page index to help, but it should still be a target for predicate pushdown, if my reading is correct. I'll keep poking away at the pyarrow.compute API or just dump to disk and use duckdb/datafusion.

@ypriverol ypriverol changed the title IdScores as Record in parquet instead of list of String additional_scores as Record in parquet instead of list of String Sep 10, 2024
@ypriverol ypriverol linked a pull request Sep 11, 2024 that will close this issue
4 tasks
@ypriverol
Copy link
Member Author

We will use for now as suggested by @lazear and from our own tests list of structs wich is basically a list of keys and values, this is the representation in avro:

{
"name": "additional_scores", 
  "type": {"type": 
         "array",
        "items": { 
        "type": 
         "struct", 
         "field": { 
               "name": "string", 
               "value": "float"
         }
        }
} 

This is more supported.

@ypriverol ypriverol linked a pull request Nov 7, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants