Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][SanityTest] stats by a high cardinality field will cause writing job fail with "size exceed limitation" #740

Open
LantaoJin opened this issue Oct 3, 2024 · 2 comments
Assignees
Labels
0.6 bug Something isn't working Lang:PPL Pipe Processing Language support

Comments

@LantaoJin
Copy link
Member

What is the bug?
When group by a high cardinality field, such as clientip, the spark writing job will fail "Request size exceeded 10485760 bytes" .

Screenshot 2024-10-01 at 17 33 09 Screenshot 2024-10-01 at 15 12 07

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Go to Workbench page. For example https://search-flint05-sanity-h5if7yelmxws5hc35cap2lauf4.us-east-1.es-integ.amazonaws.com/_dashboards/app/opensearch-query-workbench#/
  2. Execute a PPL query source = myglue_test.default.http_logs | stats avg(size) by clientip
  3. Open the EMR-S job in Spark UI.
  4. See error

What is the expected behavior?
A clear and concise description of what you expected to happen.

What is your host/environment?

  • OS: [e.g. iOS]
  • Version [e.g. 22]
  • Plugins

Do you have any screenshots?
If applicable, add screenshots to help explain your problem.

Do you have any additional context?
Add any other context about the problem.

@LantaoJin LantaoJin added bug Something isn't working untriaged labels Oct 3, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Oct 3, 2024

@LantaoJin what do you suggest here ?
should we rewrite the query in such high carnality cases ?

@YANG-DB YANG-DB added the Lang:PPL Pipe Processing Language support label Oct 3, 2024
@YANG-DB YANG-DB self-assigned this Oct 7, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Oct 7, 2024

fieldsummary should provide statistical info about fields and assisting the cost estimation of the query including cardinality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 bug Something isn't working Lang:PPL Pipe Processing Language support
Projects
None yet
Development

No branches or pull requests

2 participants