Is DuckDB using the geometry_bbox struct as a sudo spatial index #396
Replies: 1 comment 2 replies
-
@TimJMartin I'm glad you've took the time to have a go at this and shared your experince as i was about to test the RTree index myself. Those speed differences are pretty amazing when using the RTree index! I think your process here is the way I would I would go about it. Frustrating how long it takes a .db file to use an index on.. With regards to the struct being used for a de-facto index, I think this is very much case and has been available in DuckDB since v0.10 DuckDB General discussion here on how Overture made use of this: OvertureMaps/data#91 So I would say the best route to go down for now is to use native geoparquet files that have a bbox struct. |
Beta Was this translation helpful? Give feedback.
-
Hi
So with the latest release - awesome job btw - I thought I would try a few things to see how best to apply it to some of the data I have.
Normally we use PostGIS and have a dataset with 130million polygon features. We spatial index it and queries similar to this
Take around 0.6 secs
So I used GDAL to convert the PostGIS table to a Geoparquet (this took a while) which is about a 32Gb file, and then loaded this into a persistent DuckDB database. This took several hours and resulted in a 102Gb DB file.
I then ran the query and it took 26 mins
Then added the new RTreee index and rerun the query and it took 0.4 seconds
Shows the power of the index - however the overhead for loading into a DuckDB specific DB and then indexing makes it not interopable with other systems or use cases, eg leaving as a geoparquet means I can use the same source data in QGIS.
So it got me wondering what the raw query straight from the geoparquet would be and it was about 1 minute. Have played around with some partitioning as I saw that a parquet file greater than 10gb isn't advised, the best I can get is about 40 seconds.
Back to my question - is DuckDB using the geometry_bbox struct in the geoparquet as a sudo spatial index when doing a spatial query?
Reading this post by Chris Holmes - https://medium.com/radiant-earth-insights/geoparquet-1-1-coming-soon-9b72c900fbf2 it would seem they are using the struct and that will be considered in the in format "index".
So what is the benefit of the RTree index over using the struct (obviously there is a 0.4 second to 40second difference - but the hours spent loading out way the use benefit?
Probably going about this all wrong, so thought I would post this as a discussion and get some thoughts from people.
Beta Was this translation helpful? Give feedback.
All reactions