Should I use multiple connections, or one? #667

ieure · 2024-04-24T20:16:41Z

ieure
Apr 24, 2024

I'm using Datahike for a Clojure (on the JVM) project where I store binary data as well as metadata about it. I'm running into issues with memory usage -- after storing around a few hundred megs of data, the Java heap is exhausted and I start getting OutOfMemoryErrors. Java has 8gb of heap space.

I fired up jvisualvm to see what was eating up memory, and it shows that there's 6.4gb of byte[] objects on the heap, almost all (6.3gb) of which are rooted in datahike.connector.Connection.

My application is using a single Datahike connection, which is opened during startup, and released during shutdown. Because of the issue I'm experiencing, I'm wondering: is this the correct way to use Datahike? It's definitely unusual for other datastores I've used, but the "Usage" section of the README strongly implies that this is the expected interaction model. There is a note that "you might need to release the connection for specific stores," but it doesn't mention which stores, or if there are cases where this is needed, or whether the release should happen at shutdown, around every operation, periodically, etc. I have concurrent writers, and I'm not sure if it's safe to open multiple connections to the same underlying store or not.

Basically, I don't understand if I'm using Datahike wrong, or if it has a memory leak.

I wrote up a testcase that reproduces this. I'm using OpenJDK 17, Datahike 0.6.1558, and Clojure 1.11. This mimics my real application behavior. A single run will result in 4-5gb rooted in the Connection, and a second run will (on my machine with 32gb RAM / 8gb heap for the JVM) trigger OOMEs. I'm not sure why there's 10x more heap used than data being stored. Triggering a manual GC will free up a good chunk of this, but without doing that, it OOMEs, which is surprising. After the manual GC, there's still ~1gb of byte[] retained by the Connection -- releasing the connection frees that.

(ns repro
  (:require [datahike.api :as d])
  (:import [java.security MessageDigest]
           [java.util Base64]))

(defn hash-blob [blob]
  (let [encoder (Base64/getEncoder)]
    (let [md (MessageDigest/getInstance "SHA-256")]
      (.update md blob)
      (.encodeToString encoder (.digest md)))))

(defn generate-blob [size]
  (let [arr (byte-array size)]
    (doseq [i (range size)]
      (aset-byte arr i (- (rand-int 255) 127)))
    arr))

(defn insert-blob! [db blob]
  (d/transact db [[:db/add -1 ::hash (hash-blob blob)]
                  [:db/add -1 ::blob blob]]))

(def schema
  [{:db/ident ::hash
    :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one}
   {:db/ident ::blob
    :db/valueType :db.type/bytes
    :db/cardinality :db.cardinality/one}])

(def blob-sizes
  [
   65000000                             ; 65mb
   55000000                             ; 55mb
   60000000                             ; 60mb
   57000000                             ; 57mb
   62000000                             ; 62mb
   54000000                             ; 54mb
                                        ; 353mb total
   ])

(defn insert-blobs! [conn]
  (doseq [size blob-sizes]
    (insert-blob! conn (generate-blob size))))

(def db (d/create-database {:store {:backend :file :path (format "/tmp/datahike-leak-repro-%s" (random-uuid))}}))

(def conn (d/connect db))
(d/transact conn schema)

;; Will consume ~5gb heap, rooted in datahike.connector.Connection
(insert-blobs! conn)

;; Will trigger OutOfMemoryError if run again.
#_(insert-blobs! conn)

TimoKramer · 2024-04-25T07:03:35Z

TimoKramer
Apr 25, 2024
Maintainer

Hey @ieure. Thanks for coming here. I don't see a problem with your code and I don't think Datahike is supposed to behave like that. Maybe @whilo has an idea on what's going on.

1 reply

ieure Apr 27, 2024
Author

Should I file a bug, or do you want to convert this discussion to an issue?

whilo · 2024-04-29T00:59:50Z

whilo
Apr 29, 2024
Maintainer

@ieure Interesting! I think you are suffering from this because Datahike does a lot of caching and does not enforce bounds on Datoms stored yet (neither for byte arrays nor for strings). Since many of your Datoms have large blobs as byte arrays assigned to them even a single index segment with 512 Datoms will effectively hog a big part of memory all the time (unless you disconnect).

Datomic opted to bound all inputs which is good for predictable performance, but forces people to use it only in one way. The solution in this case is to store the blobs outside of the index, e.g. by directly storing the blobs in the underlying konserve store instead of inside of the index (you can grab the store reference from your connection with (:store @(:wrapped-atom conn)) and assoc the blobs there to your hash). Alternatively you can shrink the cache size settings in the db config for https://github.com/replikativ/datahike/blob/main/src/datahike/config.cljc#L27 and https://github.com/replikativ/datahike/blob/main/src/datahike/config.cljc#L28 as far as needed. The first solution is probably better though, since you cannot do a lot of logic with byte arrays in terms of index comparators anyway (no need to put them in the index) and instead just load the blobs manually in query clauses from konserve if you need to reason about the values in your query (probably you don't). Lmk whether this fixes your issue and if some of this needs clarification.

3 replies

whilo Apr 29, 2024
Maintainer

Mind you that using a separate store (konserve or otherwise) is actually better, since Datahike's gc would not be aware of your blobs stored if you would ever want to use it (you don't have to).

ieure May 4, 2024
Author

Gotcha, thank you for the explanation. I was thinking of moving the blobs to direct on-disk storage anyway, since that'd let me read/write them with mmap, so I'll go that route. But I'll miss the easy testability of pointing the blob store at an in-memory Datahike.

Do you have thoughts on the memory amplification I'm seeing (where 400mb of data consumes 4gb RAM) -- is that related to the same blob data showing up in multiple indexes?

Is it worth filing issues for either behavior, even if just to acknowledge that those problems exist / possibly improve the situation in the future?

whilo May 5, 2024
Maintainer

I think it is worth filing an issue to address this behaviour in the documentation and I think Datahike probably should have memory bounds that can be opt-out/in and changed by users for their use case, but not allow the behaviour you observed without warning. If you use a history index and value indexing than the data would be stored 6 times. I am not sure why you observer a 10x factor, but maybe there are some inefficiencies somewhere in the JVM memory layout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should I use multiple connections, or one? #667

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Should I use multiple connections, or one? #667

ieure Apr 24, 2024

Replies: 2 comments · 4 replies

TimoKramer Apr 25, 2024 Maintainer

ieure Apr 27, 2024 Author

whilo Apr 29, 2024 Maintainer

whilo Apr 29, 2024 Maintainer

ieure May 4, 2024 Author

whilo May 5, 2024 Maintainer

ieure
Apr 24, 2024

Replies: 2 comments 4 replies

TimoKramer
Apr 25, 2024
Maintainer

ieure Apr 27, 2024
Author

whilo
Apr 29, 2024
Maintainer

whilo Apr 29, 2024
Maintainer

ieure May 4, 2024
Author

whilo May 5, 2024
Maintainer