In-memory caching for indexes #505

MrEbbinghaus · 2022-04-03T14:17:56Z

MrEbbinghaus
Apr 3, 2022

I recently noticed that the latency of my pulls aren't that great on my NVMe SSD equipped machine with the file backend.
A pull-many with 9 entities and 8 Attributes (one join) takes 40ms.
I investigated and found out that roughly half of that time is spent on resolving a lookup-ref to the eid.

I propose adding more in-memory caching to datahike. At least for the aevt index, but I think in many cases datahike could benefit from memory caches for other data as well.

In my case (a small web-application), I am not so much interested in batch write performance, but retrieval latency.

Tests with a simple cache:

(require '[clojure.core.cache.wrapped :as cw]

(profile {}
  (let [db @conn
        *eid-cache (cw/soft-cache-factory {})
        lookup->eid (fn [db lookup] (:e (first (d/datoms db {:index :avet, :components lookup}))))
        pull-with-eid-cache 
        (fn pull-with-eid-cache [db selector eid]
          (d/pull db selector
            (if (integer? eid)
              eid
              (cw/lookup-or-miss *eid-cache eid #(lookup->eid db %))]
    (doseq [_ (range 1000)]
      (p :lookup->eid
        (lookup->eid db [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-lookup
        (d/pull db ['*] [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-lookup->eid-cache
        (pull-with-eid-cache db ['*] [:decide.models.proposal/id #uuid"6051a9e7-5c78-46b4-90e7-4492c89f4728"]))
      (p :pull-with-eid
        (d/pull db ['*] 156)))))

pId                           nCalls        Min      50% ≤      90% ≤      95% ≤      99% ≤        Max       Mean   MAD      Clock

:pull-with-lookup              1,000   691,83μs   722,40μs   873,85μs   911,35μs     1,07ms     1,76ms   764,55μs   ±8%   764,55ms    
:lookup->eid                   1,000   434,32μs   454,18μs   545,43μs   569,15μs   671,98μs     1,58ms   479,82μs   ±8%   479,82ms
:pull-with-lookup->eid-cache   1,000   257,03μs   272,01μs   329,56μs   354,27μs   455,49μs     1,25ms   290,05μs  ±10%   290,05ms
:pull-with-eid                 1,000   250,20μs   260,28μs   302,66μs   327,57μs   407,67μs     1,39ms   274,43μs   ±8%   274,43ms

kordano · 2022-04-05T08:44:29Z

kordano
Apr 5, 2022
Maintainer

@jsmassa, @whilo what do you think about that?

0 replies

whilo · 2022-04-07T05:00:42Z

whilo
Apr 7, 2022
Maintainer

Since #503 is working now we can think about different loading and caching strategies inside of the persistent set. We have not been able to get there yet for the hitchhiker-tree (just cached the most used tree fragments). So hopefully we can significantly reduce this access time. @MrEbbinghaus Can you isolate which index access calls cause the latency? Ideally for many databases you want to keep everything in memory and only have a copy on disk in case of cold reboots, but it will really depend on the use case. I think it would be good to use this opportunity to discuss a few caching policies and how to expose them to users.

4 replies

MrEbbinghaus Apr 7, 2022
Author

Like shown in my post, I experimented with pulling by eid vs lookup and noticed the (quite large) latency difference.
I am using pathom for my graph API, so I have many pulls/queries by the same lookup ref on the same database in a single request.

But in the end, my database is not particularly large. (Hundreds of entities, 50Mb on disk) So in my case it would be quite possible to keep everything (even the history) in memory.

Disclaimer
I am familiar with triple stores and their indexes, but not how Datahike implements it exactly. So when I talk about things like this, I don't take into account Datahike specific implementations.

My understanding is that it takes two accesses to the AVET index to resolve a lookup ref [:person/name "Björn"] its eid.

[:db/ident :person/name] to get the eid of the :person/name attribute. (lets say 42)
[42 "Björn"] to get the eid of the entity with the identity [:person/name "Björn"]

I am not quite sure, but while looking at the code to find out if I can implement aliases, I noticed that Datahike seems to use attribute keywords directly in the Datoms, instead of their eid, am I correct? That means Step 1 isn't required, so there is only one access to the AVET index. Sadly, using attribute keywords directly means no simple attribute aliases.

jsmassa Apr 7, 2022
Maintainer

We inherited the usage of keywords from DataScript but since then have also added the option to configure the database to use references instead. So the option of aliases could be added for this configuration.
For the eid of attributes there is a hashmap built in the database record on connection, so there is no lookup in one of the actual indeces taking place, but this map is being used.

MrEbbinghaus Apr 10, 2022
Author

but since then added the option to configure the database to use references instead

Interesting. How do I enable this behaviour?
I assume I would have to migrate my existing database myself for this?

Can't find this in the docs.

jsmassa Apr 12, 2022
Maintainer

You are right, there is nothing in the docs about it. Must have forgotten to add something to them. I opened a small PR about it to at least let people know it exists.

In short, attribute references are used with the configuration option{:attribute-refs? true}.

Unfortunately, our import/export functionality within Datahike is not suited for a switch between keywords and references, but as I understand, our migration tool wanderung can handle even those cases. @kordano will know how to achieve this.

MrEbbinghaus · 2022-05-11T15:16:35Z

MrEbbinghaus
May 11, 2022
Author

@whilo
So. I am seeing the term: persistent set a lot in recent commits. Is there somewhere an explanation of what this is and how it is improving Datahike?

I would also like to know the reason behind the layout change with konserve 0.6.
Are there feature / performance improvements? If so, in what use cases?

1 reply

jsmassa May 16, 2022
Maintainer

Since I am working on the persistent set mainly lately, I can answer your question about this.

The persistent set commits concern our efforts to make it possible to use the storage capability of konserve with the persistent sorted set index that was originally used in DataScript similarly like we are doing it with the hitchhiker-tree index at the moment. Since the persistent set is faster for the in-memory backend than the hitchhiker-tree, we are hoping to get a considerable performance improvement for the file-backend and external backends by doing this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory caching for indexes #505

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

In-memory caching for indexes #505

MrEbbinghaus Apr 3, 2022

Replies: 3 comments · 5 replies

kordano Apr 5, 2022 Maintainer

whilo Apr 7, 2022 Maintainer

MrEbbinghaus Apr 7, 2022 Author

jsmassa Apr 7, 2022 Maintainer

MrEbbinghaus Apr 10, 2022 Author

jsmassa Apr 12, 2022 Maintainer

MrEbbinghaus May 11, 2022 Author

jsmassa May 16, 2022 Maintainer

MrEbbinghaus
Apr 3, 2022

Replies: 3 comments 5 replies

kordano
Apr 5, 2022
Maintainer

whilo
Apr 7, 2022
Maintainer

MrEbbinghaus Apr 7, 2022
Author

jsmassa Apr 7, 2022
Maintainer

MrEbbinghaus Apr 10, 2022
Author

jsmassa Apr 12, 2022
Maintainer

MrEbbinghaus
May 11, 2022
Author

jsmassa May 16, 2022
Maintainer