mv flexcache

Status

merged to master
code complete September 22, 2013
development started

History / Context

leveldb contains two caches per database (Riak vnode): file cache and block cache. The user establishes the size of each cache via the Options structure passed to leveldb upon database open. The two cache sizes are then fixed while the database is open. With Riak, the two cache sizes are actually fixed across all databases until Riak is restarted.

Riak is a very dynamic user of leveldb. Riak can have a highly variable number of databases open. And Riak can run for long periods of time, using hundreds to thousands of leveldb .sst table files. Both of these dynamics lead to suboptimal sizing of the two caches since sizes must conform to the worst case not the most likely, or to the mathematically estimated scenarios not the actual runtime usage.

This branch automates the sizing / resizing of the file cache and block cache during operation while maximizing total memory usage across a dynamic number databases. The variable items considered:

varying number of open databases (vnodes) as Riak performs handoffs
prioritized memory demands of file versus block cache since cost of a miss in file cache is much greater than cost of block cache miss
file cache entries can become stale (not actively accessed) and therefore reduce the block cache size available to active files
Riak has two tiers of databases, user databases (vnodes) and internal databases (active anti-entropy management, AAE), that should get different allocations of memory
the total memory available to virtual servers can change dynamically in some VM environments and it would be helpful if leveldb could adjust without total restart

Branch description

mv-flexcache branch manages memory allocation to each database, and the split of the allocation between the database's file and block caches. Each database's file and block cache are independent of other databases'. This design eliminates the need thread synchronization across different database accesses. And therefore maintains current performance characteristics. A single, unified file and/or block cache would have many spinlock or mutex hits under the Riak usage pattern, slowing all accesses.

This branch adds code to keep track of the number of databases open (vnodes) and whether those databases are for user data or internal data (active anti-entropy feature). Each Open() operation adds the new database to an stl::list and informs the master cache object, FlexCache, of the total memory allocation available to all databases. The FlexCache object calls all existing database objects to inform them that the per database allocation has changed. If internal databases exist, they split a 20% allocation of all memory. User database either split 80% if internal databases exist, or 100% if no internal databases exist (AAE disabled).

Each database object DBImpl contains a master cache object DoubleCache. DoubleCache manages both the file cache and the block cache. Double cache gives memory priority to the file cache. But all memory not used by the file cache is available to the block cache. The block cache is guaranteed a 2Mbyte minimum size.

The Open API can fail if the new database object would receive less than 18Mbytes of shared cache.

Two new unit tests support the branch's changes: cache2_test and flexcache_test.

leveldb::Options changes

The Options structure contains two new option values: is_internal_db (bool) and total_leveldb_mem (uint64_t). Two old values remain but are now deprecated / unused: max_open_files (int) and block_cache (Cache *). (Actually block_cache is still actively used in some unit tests, but will likely go away once unit tests are update.)

is_internal_db is a flag used by Riak to mark Riak's internally generated databases. Internal databases get only a 20% share of the total cache memory allocation.
total_leveldb_mem is the total memory to be used across all open databases and their file/block caches. The number is internally adjusted down to also account for memory used by write buffers and log files per database. The adjustment is highly necessary given the possible large swings in database counts, example 32 becoming 64, in Riak environments and the amount of memory used by the objects.

leveldb::DBList()

Google's leveldb implementation did not contain a mechanism for intercommunication between database objects and/or background activities. This branch adds a global DBList() interface that keeps a list of all active databases. It also contains simple methods that will iterate the database list and calling the same function in each database.

The current design creates two std::sets: one for user database and one for internal databases. The use of two std::sets seemed intuitively obvious during the design and early coding. The final usage patterns suggest one unified set would be fine. That potential clean up is left to the future.

DBList() tracks current DBImpl objects. Therefore the list is maintained in an obvious fashion. The DBImpl constructor calls DBList()->AddDB. The DBImpl destructor calls DBList()->ReleaseDB. There is no other maintenance.

DBList()->ScanDBs is a method to walk all user or internal databases and call a DBImpl method on each database. This is written out instead of using stl::foreach so as to release the DBList spinlock before each database method call. Failing to release the spinlock could result in a deadlock scenario.

DBList() is implicitly tested via the routines in cache2_test and flexcache_test.

Provide feedback

Saved searches