-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make cache configuration available from cpp api. #946
Comments
Note that a lot of this as been discussed in #311 This issue is mainly intended to answer one thing: |
I'm not in favor of moving this issue forward until we have confirmed that tuning all these settings would help. I have two main use-cases in mind:
We could maybe also have a look at other readers, but I'm less experienced with them. To me, the main problem currently is not really that these settings are not available from the cpp API, but rather that most tools creators (scrapers, library generation, many readers probably...) and ops do not know how to tweak these settings properly. So I would prefer to start from the end:
|
I was thinking it was already the case.
Without knowing the use case and some investigation, it is difficult to provide specific configuration for each use case. However, if you want to reduce cache at maximum, you can set
Yet again, I was thinking you have already done it. |
The only thing we've done so far is set Then @kelson42 decided to setup some kind of task force / hackathon to dive into understanding better what could be done but unless I miss it, it never happened. |
|
If neither the software developer nor the user knows how to configure a memory limit, it will be difficult to design a smart system.
To be clear. Moving configuration from ENV var to cpp var will NOT remove complexity.
Why ? That a cache must be configurable for specific use case yes.
Designing this kind of API seems a bit premature as we still don't know what we want.
Yes, this is clearly stated in my proposition Limiting the cache by memory is a whole different project (and far most complex):
For now, all caches (different types and different zims) are independent. When we want to add an item to a cache and it is full, we simply drop to last used item in the cache. If we reason by memory, dropping the last item may be not enough. We may drop enough last itemS to free enough memory. Maybe it is better to drop a "big but not last used" item than several "last but small". I'm not against implementing such a system. But it should probably be a project in a whole and not just a small issue. |
Requesting this information form the user is a no-go, so we will have to be smart. If we look to similar softwares, this is what a Web browser achieve to do AFAIK?! Seem doable to me.
Whatever is needed, we obviously need a tight control about cache memory consumption. This is something I expect to be treated now, and not in an unclear future.
All of these questions seem legitim, but if you think twice, none of them seem that hard to answer. I don't think that conceiving a first version of such a system takes more than a day... then of course a bit of time will be needed for the implementation. Remember, I'm not asking to build the smartest cache system. I'm just asking for a caching system which does not run out of memory and works in a reasonable manner. |
I have created #947 to discuss about memory limitation cache. I don't think there is a issue dedicated to this and this is a different subject (but close) than cache configuration. |
To be clear about one point, precharging/preloading of (1) Xapian Index (2) dirent via the fastlookup cache should be options of the open ZIM primitive(s). The limits via dedicated methods (I already have written this) so nothing is complicated to handle/understand. |
After discussion with @kelson42 and @rgaudin about caching issues, it appears that:
As a remainder, here are the cache strategy in use for now:
This number is controlled by the env var
ZIM_CLUSTERCACHE
.The memory used by this cache is not obvious as we do partial decompression. So, on top of the decompressed data, we also store the decompressor stream/context which store itself some data.
This number is controlled by the env var
ZIM_DIRENTCACHE
.The memory used by this cache is not really known (mainly as each dirent have a variable size because of url/title size) but can be "easily" calculated at runtime.
This number of ranges is controlled by the env var
ZIM_DIRENTLOOKUPCACHE
.Question of memory usuage is the same as for dirent cache but less important.
Contrary to the other caches, this cache is fixed size and is fully populated at first access.
This cache improve following readings but if there is really few readings after, populate the cache may slow down the whole process.
The default value for
ZIM_DIRENTLOOKUPCACHE
being 1024, we have to prevent 1024 dirent reading to have this cache being efficient. This is almost impossible when doing only few reads (as getting the metadata from the zim file) only.Proposition :
CacheConfig
structure which contains information about cache strategy (for now, size of the different caches)FastDirentLookup
"deactivate" itself if value is zero or one.If a tool (zim-tools, kiwix-desktop, ...) still want to use env var to control caches, this should be implemented there.
Tools would have to be adapted to use this new feature. As we keep a compatible API, there is no need to adapt them right now and so they will not be adapted in this run.
Limiting the cache memory size is a whole more complex things as we would need to make the cache global to all opened zims (and so loading a dirent/cluster in a zim may imply dropping a cached dirent/cluster of another zim). I consider this as out of scope of this issue.
Testing:
Automatic testing is a bit complex here. We would have to mock the cache system to get information about what it cached or not and test that.
I will simply test the functional part and check we get the same results whatever the cache config is.
The text was updated successfully, but these errors were encountered: