Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB process killed when creating an R-Tree index #410

Closed
CGenie opened this issue Sep 25, 2024 · 11 comments · Fixed by #420
Closed

DuckDB process killed when creating an R-Tree index #410

CGenie opened this issue Sep 25, 2024 · 11 comments · Fixed by #420

Comments

@CGenie
Copy link

CGenie commented Sep 25, 2024

I downloaded the OSM PBF file for Poland from here: https://download.geofabrik.de/europe.html

I created a duckdb table like this:

INSTALL spatial;
LOAD spatial;

CREATE TABLE nodes AS (SELECT * FROM ST_ReadOSM('poland.osm.pbf') WHERE kind = 'node';
ALTER TABLE nodes ADD COLUMN pt GEOMETRY;
UPDATE nodes SET pt = ST_Point2D(lat, lon);

My DB size is around 6.5 GB. Now I try to do:

CREATE INDEX nodes_pt_idx ON nodes USING RTREE (pt);

I see a progress bar, it goes up to around 28%, then I get a killed message, probably because Linux saw the process uses too much memory. Indeed, it raises to 18GB when looking at htop at the same time (I have 32 GB of RAM). I thought DuckDB is able to handle tasks larger than memory size?

I'm using Duckdb v 1.1.1.

BTW, it seems I get the same thing with creating a full-text-search index (https://duckdb.org/docs/extensions/full_text_search.html), maybe it's an issue when creating an index overall?

@Maxxen
Copy link
Member

Maxxen commented Sep 25, 2024

Hello! Have you tried setting a memory limit? All memory allocated by the R-Tree should be tracked by DuckDB and respect the memory limit parameter.

Are you running with a disk-backed database file or entirely in memory? You may want to have a look at the spilling-to-disk section as well.

@CGenie
Copy link
Author

CGenie commented Sep 25, 2024

Well, with memory_limit set to 10G I get:

D set memory_limit = '10G';
D load spatial;
D CREATE INDEX nodes_pt_idx ON nodes USING RTREE (pt);
 33% ▕███████████████████▊                                        ▏ INTERNAL Error: Attempted to dereference unique_ptr that is NULL!
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors

I'll try disk spilling. The DB is a file, not in memory.

@CGenie
Copy link
Author

CGenie commented Sep 25, 2024

OK I get the same error with disk spilling (i.e `SET tmp_dir = '...').

@Maxxen
Copy link
Member

Maxxen commented Sep 25, 2024

Im unable to reproduce this on DuckDB v1.1.0 MacOS in disk-backed mode (although without a memory limit).

load spatial;
D CREATE TABLE nodes AS SELECT * FROM ST_ReadOSM('poland-latest.osm.pbf') WHERE kind = 'node';
100% ▕████████████████████████████████████████████████████████████▏
D ALTER TABLE nodes ADD COLUMN pt GEOMETRY;
D UPDATE nodes SET pt = ST_Point2D(lat, lon);
100% ▕████████████████████████████████████████████████████████████▏
D CREATE INDEX nodes_pt_idx ON nodes USING RTREE (pt);
100% ▕████████████████████████████████████████████████████████████▏
D

Ill see if I can mimic your setup and reproduce the error.

@CGenie
Copy link
Author

CGenie commented Sep 25, 2024

Hm, ok, it seems to have finished just fine with memory limit as 15G.

@CGenie
Copy link
Author

CGenie commented Sep 25, 2024

I'm on Linux btw.

@Maxxen
Copy link
Member

Maxxen commented Sep 25, 2024

If I set a 10gb limit I get a proper out-of-memory error before I even get to the index creation.

D SET memory_limit = '10gb';
D load spatial;
D CREATE TABLE nodes AS SELECT * FROM ST_ReadOSM('poland-latest.osm.pbf') WHERE kind = 'node';
100% ▕████████████████████████████████████████████████████████████▏
D ALTER TABLE nodes ADD COLUMN pt GEOMETRY;
D UPDATE nodes SET pt = ST_Point2D(lat, lon);
 73% ▕███████████████████████████████████████████▊                ▏ Out of Memory Error: failed to pin block of size 256.0 KiB (9.3 GiB/9.3 GiB used)

Nonetheless, a INTERNAL Error is never good, I'll look into setting up a linux box to dive into this further tomorrow.

@CGenie
Copy link
Author

CGenie commented Sep 26, 2024

Yes, I guess even when you set memory_limit = '1G' it should still finish, though maybe it could take more time.

@Maxxen
Copy link
Member

Maxxen commented Sep 30, 2024

Not necessarily, the index itself must still be able to fit in memory. With the 217 708 491 nodes in the Poland extract you're looking at 217708491 * 40b = ~8.7Gb of raw index buffers, not including any additional bookkeeping structures in DuckDB.

@CGenie
Copy link
Author

CGenie commented Sep 30, 2024

Hm, ok, but shouldn't it handle data larger than RAM with temp disk space?

@Maxxen
Copy link
Member

Maxxen commented Sep 30, 2024

Unfortunately, this is currently a limitation of all indexes in DuckDB. While they will be lazily loaded from disk once you open a database, they will not unload from memory, which is why you need to be able to keep the entire index in memory when you first create it. That said, their memory is still tracked by DuckDB, meaning that they should respect the SET memory_limit, option, and DuckDB will attempt to unload other temporary memory to make space for them.

I've pushed a fix for the INTERNAL Error: Attempted to dereference unique_ptr that is NULL error in #420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants