Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to avoid losing values in collect_map #5576

Open
philrz opened this issue Jan 14, 2025 · 0 comments
Open

Options to avoid losing values in collect_map #5576

philrz opened this issue Jan 14, 2025 · 0 comments

Comments

@philrz
Copy link
Contributor

philrz commented Jan 14, 2025

tl;dr

The docs for the collect_map aggregate function disclose:

If collect_map receives multiple values for the same key, the last value received is retained.

This may be fine for many use cases, but we may want to offer options that preserve all values.

Details

At the time this issue is being opened, super is at commit 55d99d3.

To illustrate, we'll use this example input prices.json that's similar to the data used in the collect_map docs.

{"stock":"IBM","price":46.67}
{"stock":"APPL","price":150.13}
{"stock":"GOOG","price":87.07}
{"stock":"APPL","price":150.13}
{"stock":"GOOG","price":89.15}

By simply applying collect_map, we get the final price value seen in the input stream for each unique stock ticker.

$ super -version
Version: v1.18.0-222-g55d99d3b

$ super -Z -c 'collect_map(|{stock:price}|)' prices.json 
|{
    "IBM": 46.67,
    "APPL": 150.13,
    "GOOG": 89.15
}|

This "last value wins" behavior may be familiar to many users. For instance, ChatGPT suggested the following jq command line to combine the same stream of objects into a single JSON object.

$ jq -s 'reduce .[] as $item ({}; .[$item.stock] = $item.price)' prices.json 
{
  "IBM": 46.67,
  "APPL": 150.13,
  "GOOG": 89.15
}

However, @mccanne recently pointed out that silently dropping values may not be ideal, especially since SuperDB provides complex data types that could easily hold all values, such as storing them as a set if the user wants to keep each unique value, or an array if they want to keep every value (including repeats) in the order encountered in the stream.

This can already be achieved using existing building blocks, e.g., first invoking union in a separate step to create sets:

$ super -Z -c 'price:=union(price) by stock | collect_map(|{stock:price}|)' prices.json  
|{
    "IBM": |[
        46.67
    ]|,
    "APPL": |[
        150.13
    ]|,
    "GOOG": |[
        87.07,
        89.15
    ]|
}|

Or collect to create arrays:

$ super -Z -c 'price:=collect(price) by stock | collect_map(|{stock:price}|)' prices.json 
|{
    "IBM": [
        46.67
    ],
    "APPL": [
        150.13,
        150.13
    ],
    "GOOG": [
        87.07,
        89.15
    ]
}|

If we make this kind of functionality a change of default behavior and/or new options of collect_map, it seems we'd also want to consider if we want the wrapping in the complex type to happen even for single values (such as shown in these examples with existing building blocks) or only when multiple values are observed for a single key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant