Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]PPL aggregation Performance enhancement using sampling #790

Open
YANG-DB opened this issue Oct 18, 2024 · 0 comments
Open

[FEATURE]PPL aggregation Performance enhancement using sampling #790

YANG-DB opened this issue Oct 18, 2024 · 0 comments
Assignees
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Oct 18, 2024

Is your feature request related to a problem?
Currently PPL's Top / Rare perform a full scan & order by to output the most common or rare values:

SELECT 
    status_code AS Field, 
    COUNT(*) AS Frequency
FROM 
    table
GROUP BY 
    status_code
ORDER BY 
    Frequency DESC
LIMIT 5

This query represents the logical plan to be executed by the engine.
It has the inherited flaw of having to scan the entire table and order the results only to get the top 5 elements - this is very costly .

In general any aggregation of select statement could benefit using a sampling strategy ...

What solution would you like?
In many cases the overall cardinality of a column and its values can be determined using a small sample of the dataset.
spark offers the next syntax for sampling the table:

TABLESAMPLE ({ integer_expression | decimal_expression } PERCENT)
    | TABLESAMPLE ( integer_expression ROWS )
    | TABLESAMPLE ( BUCKET integer_expression OUT OF integer_expression )

And in our case:

SELECT 
    status_code AS Field, 
    COUNT(*) AS Frequency
FROM 
    t TABLESAMPLE (10 precent) -- Get approximately 10% of the rows
GROUP BY 
    status_code
ORDER BY 
    Frequency DESC
LIMIT 5;

The new top and rare api will look as follows:

top [N] <field-list> [by-clause] [TABLESAMPLE ({ integer_expression | decimal_expression } PERCENT)]

Examples:

source=accounts | top 5 age by gender tablesample (10 precent)
source=accounts | rare 5 age by nationality tablesample (500 rows)
source=accounts | rare 10 nationality tablesample(bucket 4 out of 10);

Do you have any additional context?

@YANG-DB YANG-DB added enhancement New feature or request Lang:PPL Pipe Processing Language support labels Oct 18, 2024
@YANG-DB YANG-DB self-assigned this Oct 18, 2024
@YANG-DB YANG-DB removed the untriaged label Oct 18, 2024
@YANG-DB YANG-DB changed the title [FEATURE]PPL Top/Rare Performance enhancement [FEATURE]PPL aggregation Performance enhancement using sampling Oct 21, 2024
@YANG-DB YANG-DB added the 0.6 label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
Status: InReview
Development

No branches or pull requests

1 participant