You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?
Currently PPL's Top / Rare perform a full scan & order by to output the most common or rare values:
SELECT
status_code AS Field,
COUNT(*) AS Frequency
FROM
table
GROUP BY
status_code
ORDER BY
Frequency DESCLIMIT5
This query represents the logical plan to be executed by the engine.
It has the inherited flaw of having to scan the entire table and order the results only to get the top 5 elements - this is very costly .
In general any aggregation of select statement could benefit using a sampling strategy ...
What solution would you like?
In many cases the overall cardinality of a column and its values can be determined using a small sample of the dataset.
spark offers the next syntax for sampling the table:
SELECT
status_code AS Field,
COUNT(*) AS Frequency
FROM
t TABLESAMPLE (10 precent) -- Get approximately 10% of the rowsGROUP BY
status_code
ORDER BY
Frequency DESCLIMIT5;
source=accounts | top 5 age by gender tablesample (10 precent)
source=accounts | rare 5 age by nationality tablesample (500 rows)
source=accounts | rare 10 nationality tablesample(bucket 4 out of 10);
Is your feature request related to a problem?
Currently PPL's Top / Rare perform a full scan & order by to output the most common or rare values:
This query represents the logical plan to be executed by the engine.
It has the inherited flaw of having to scan the entire table and order the results only to get the top 5 elements - this is very costly .
In general any aggregation of select statement could benefit using a sampling strategy ...
What solution would you like?
In many cases the overall cardinality of a column and its values can be determined using a small sample of the dataset.
spark offers the next syntax for sampling the table:
And in our case:
The new
top
andrare
api will look as follows:top [N] <field-list> [by-clause] [TABLESAMPLE ({ integer_expression | decimal_expression } PERCENT)]
Examples:
Do you have any additional context?
by
a high cardinality field will cause writing job fail with "size exceed limitation" #740The text was updated successfully, but these errors were encountered: