We adopt 4 real datasets from SOSD [1]
Specifically:
- fb: A set of user IDs randomly sampled from Facebook [2].
- wiki: A set of edit timestamp IDs committed to Wikipedia [3].
- books: A dataset of book popularity from Amazon.
- osm_cellids: A set of cell IDs from OpenStreetMap [4].
We also generate 3 synthetic datasets by sampling from uniform, normal, and log-normal distributions, following a process similar to [1, 5].
All keys are stored as 64-bit unsigned integers (uint64_t
in C++). Table 1 summarizes the dataset statistics.
Table 1: Statistics of benchmark datasets.
Dataset | Category | Keys | Raw Size | ||
---|---|---|---|---|---|
fb | Real | 200 M | 1.6 GB | 3.88 | 94 |
wiki | Real | 200 M | 1.6 GB | 1.77 | 877 |
books | Real | 800 M | 6.4 GB | 5.39 | 101 |
osm | Real | 800 M | 6.4 GB | 1.91 | 129 |
uniform | Synthetic | 400 M | 3.2 GB | Varied | N.A. |
normal | Synthetic | 400 M | 3.2 GB | Varied | N.A. |
lognormal | Synthetic | 400 M | 3.2 GB | Varied | N.A. |
References:
[1] Marcus, et al. SOSD: A Benchmark Suite for Similarity Search over Sorted Data. PVLDB, 2020.
[2] Sandt, et al. Facebook's Data Infrastructure at Scale. SIGMOD, 2019.
[3] Wikidata. https://www.wikidata.org/wiki/Wikidata:Main_Page
[4] OpenStreetMap. https://www.openstreetmap.org/
[5] Zhang, et al. Making Learned Indexes Practical: A Comprehensive Study on Data Distribution and Model Selection. PVLDB, 2024.
RMI code: https://github.com/learnedsystems/RMI/tree/master
This is a reference implementation of recursive model indexes (RMIs). A prototype RMI was initially described in [The Case for Learned Index Structures (https://arxiv.org/abs/1712.01208) by Kraska et al. in 2017.
cargo run --release -- --optimize (optimizer_out_wiki.json)JSON_PATH wiki_200M_uint64
cargo run wiki_200M_uint64 --param-grid (optimizer_out_wiki.json)JSON_PATH -d YOUR_RMI_SAVE_FOLDER --threads 8 --zero-build-time
And then the RMI will generate 9 models in YOUR_RMI_SAVE_FOLDER. For fb dabatset: it includs 9 RMI output parameter files(fb_200M_uint64_i_L1_PARAMETERS, i=0,...,9), 27 RMI RMI codes files (each model generate 3 code files (named fb_200M_uint64_i_data.h, fb_200M_uint64_i.cpp, fb_200M_uint64_i.h, i=0,...,9).
Our work focuses on in-memory read-heavy workloads. Given a key set K, we generate the query workload by randomly sampling S (by default S=5,000) keys from K. To simulate different access patterns, we sample lookup keys from two distributions:
- Uniform: Every key in K has an equal likelihood of being sampled.
- Zipfan: The probability of sampling the i-th key in K is given by p(i)=iα/∑j=1Njα. For the Zipfan workload, by default, we set the parameter α=1.3 such that over 90% of index accesses occur within the range of (0, 103].
The detailed sampling method for these two query workloads is in the main_dataset folder. dataset can be fb, wiki, books, osm, uniform, normal, or lognormal.
To run the benchmarks:
make -f Makefile_all run_all