Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect avg aggregation result when querying CSUP data (regression at 66b20d0) #5593

Open
philrz opened this issue Jan 22, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Jan 22, 2025

tl;dr

Running ClickBench query 2 against CSUP data produces an incorrect avg() aggregation result. The incorrect result comes in both the sequential runtime and vector runtime, so it's presumably a problem with the conversion to the CSUP format. I've confirmed that this problem showed up starting at commit 66b20d0, which is associated with the changes in #5577.

Details

Repro is with super commit 04b7efc using the hits.parquet test data from ClickBench.

As a baseline, here's the presumed correct avg() result of 1513.4879349030107 that matches when querying the original Parquet data from DuckDB and SuperDB using both the sequential and vector runtime.

$ duckdb -version
v1.1.3 19864453f7

$ duckdb -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.parquet;"
┌──────────────────┬──────────────┬──────────────────────┐
│ sum(AdvEngineID) │ count_star() │ avg(ResolutionWidth) │
│      int128      │    int64     │        double        │
├──────────────────┼──────────────┼──────────────────────┤
│          7280088 │     99997497 │   1513.4879349030107 │
└──────────────────┴──────────────┴──────────────────────┘

$ super -version
Version: v1.18.0-237-g04b7efce

$ SUPER_VAM=1 super -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.parquet;"
{sum:7280088,count:99997497(uint64),avg:1513.4879349030107}

$ super -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.parquet;"
{sum:7280088,count:99997497(uint64),avg:1513.4879349030107}

Once we convert the Parquet file to CSUP, the avg result is now very different in both the sequential and vector runtimes.

$ super -f csup -o hits.csup hits.parquet

$ SUPER_VAM=1 super -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.csup;"
{sum:7280088,count:99997497(uint64),avg:7224.820559888614}

$ super -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.csup;"
{sum:7280088,count:99997497(uint64),avg:7224.820559888614}

Tests indicate this started happening at commit 66b20d0, which is associated with the changes in #5577. The result was correct at the commit just before:

$ super -version && super -f csup -o hits.csup hits.parquet && SUPER_VAM=1 super -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.csup;"
Version: v1.18.0-226-g02d23319
{sum:7280088,count:99997497(uint64),avg:1513.4879349030107}

Then incorrect at 66b20d0.

$ super -version && super -f csup -o hits.csup hits.parquet && SUPER_VAM=1 super -c "SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits.csup;"
Version: v1.18.0-227-g66b20d0f
{sum:7280088,count:99997497(uint64),avg:7224.820559888614}
@philrz philrz added the bug Something isn't working label Jan 22, 2025
@philrz philrz changed the title Incorrect avg aggregation result when querying CSUP data Incorrect avg aggregation result when querying CSUP data (regression at 66b20d0) Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant