Replies: 2 comments 4 replies
-
Hey @ieure. Thanks for coming here. I don't see a problem with your code and I don't think Datahike is supposed to behave like that. Maybe @whilo has an idea on what's going on. |
Beta Was this translation helpful? Give feedback.
-
@ieure Interesting! I think you are suffering from this because Datahike does a lot of caching and does not enforce bounds on Datoms stored yet (neither for byte arrays nor for strings). Since many of your Datoms have large blobs as byte arrays assigned to them even a single index segment with 512 Datoms will effectively hog a big part of memory all the time (unless you disconnect). Datomic opted to bound all inputs which is good for predictable performance, but forces people to use it only in one way. The solution in this case is to store the blobs outside of the index, e.g. by directly storing the blobs in the underlying konserve store instead of inside of the index (you can grab the store reference from your connection with |
Beta Was this translation helpful? Give feedback.
-
I'm using Datahike for a Clojure (on the JVM) project where I store binary data as well as metadata about it. I'm running into issues with memory usage -- after storing around a few hundred megs of data, the Java heap is exhausted and I start getting
OutOfMemoryError
s. Java has 8gb of heap space.I fired up jvisualvm to see what was eating up memory, and it shows that there's 6.4gb of
byte[]
objects on the heap, almost all (6.3gb) of which are rooted indatahike.connector.Connection
.My application is using a single Datahike connection, which is opened during startup, and released during shutdown. Because of the issue I'm experiencing, I'm wondering: is this the correct way to use Datahike? It's definitely unusual for other datastores I've used, but the "Usage" section of the README strongly implies that this is the expected interaction model. There is a note that "you might need to release the connection for specific stores," but it doesn't mention which stores, or if there are cases where this is needed, or whether the release should happen at shutdown, around every operation, periodically, etc. I have concurrent writers, and I'm not sure if it's safe to open multiple connections to the same underlying store or not.
Basically, I don't understand if I'm using Datahike wrong, or if it has a memory leak.
I wrote up a testcase that reproduces this. I'm using OpenJDK 17, Datahike 0.6.1558, and Clojure 1.11. This mimics my real application behavior. A single run will result in 4-5gb rooted in the
Connection
, and a second run will (on my machine with 32gb RAM / 8gb heap for the JVM) trigger OOMEs. I'm not sure why there's 10x more heap used than data being stored. Triggering a manual GC will free up a good chunk of this, but without doing that, it OOMEs, which is surprising. After the manual GC, there's still ~1gb ofbyte[]
retained by theConnection
-- releasing the connection frees that.Beta Was this translation helpful? Give feedback.
All reactions