[FEATURE]PPL aggregation Performance enhancement using sampling #790

YANG-DB · 2024-10-18T16:24:42Z

Is your feature request related to a problem?
Currently PPL's Top / Rare perform a full scan & order by to output the most common or rare values:

SELECT 
    status_code AS Field, 
    COUNT(*) AS Frequency
FROM 
    table
GROUP BY 
    status_code
ORDER BY 
    Frequency DESC
LIMIT 5

This query represents the logical plan to be executed by the engine.
It has the inherited flaw of having to scan the entire table and order the results only to get the top 5 elements - this is very costly .

In general any aggregation of select statement could benefit using a sampling strategy ...

What solution would you like?
In many cases the overall cardinality of a column and its values can be determined using a small sample of the dataset.
spark offers the next syntax for sampling the table:

TABLESAMPLE ({ integer_expression | decimal_expression } PERCENT)
    | TABLESAMPLE ( integer_expression ROWS )
    | TABLESAMPLE ( BUCKET integer_expression OUT OF integer_expression )

And in our case:

SELECT 
    status_code AS Field, 
    COUNT(*) AS Frequency
FROM 
    t TABLESAMPLE (10 precent) -- Get approximately 10% of the rows
GROUP BY 
    status_code
ORDER BY 
    Frequency DESC
LIMIT 5;

The new top and rare api will look as follows:

top [N] <field-list> [by-clause] [TABLESAMPLE ({ integer_expression | decimal_expression } PERCENT)]

Examples:

source=accounts | top 5 age by gender tablesample (10 precent)
source=accounts | rare 5 age by nationality tablesample (500 rows)
source=accounts | rare 10 nationality tablesample(bucket 4 out of 10);

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

YANG-DB added enhancement New feature or request Lang:PPL Pipe Processing Language support labels Oct 18, 2024

YANG-DB self-assigned this Oct 18, 2024

github-actions bot added the untriaged label Oct 18, 2024

YANG-DB removed the untriaged label Oct 18, 2024

YANG-DB mentioned this issue Oct 18, 2024

PPL fieldsummary command #766

Merged

YANG-DB changed the title ~~[FEATURE]PPL Top/Rare Performance enhancement~~ [FEATURE]PPL aggregation Performance enhancement using sampling Oct 21, 2024

YANG-DB added the 0.6 label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]PPL aggregation Performance enhancement using sampling #790

[FEATURE]PPL aggregation Performance enhancement using sampling #790

YANG-DB commented Oct 18, 2024 •

edited

Loading

[FEATURE]PPL aggregation Performance enhancement using sampling #790

[FEATURE]PPL aggregation Performance enhancement using sampling #790

Comments

YANG-DB commented Oct 18, 2024 • edited Loading

YANG-DB commented Oct 18, 2024 •

edited

Loading