learned_index

Ⅰ. Benchmark Datasets (7 datasets)

We adopt 4 real datasets from SOSD [1]

Specifically:

fb: A set of user IDs randomly sampled from Facebook [2].
wiki: A set of edit timestamp IDs committed to Wikipedia [3].
books: A dataset of book popularity from Amazon.
osm_cellids: A set of cell IDs from OpenStreetMap [4].

We also generate 3 synthetic datasets by sampling from uniform, normal, and log-normal distributions, following a process similar to [1, 5].

All keys are stored as 64-bit unsigned integers (uint64_t in C++). Table 1 summarizes the dataset statistics.

Table 1: Statistics of benchmark datasets.

Dataset	Category	Keys	Raw Size	$h_D$	$\overline{Cov}$
fb	Real	200 M	1.6 GB	3.88	94
wiki	Real	200 M	1.6 GB	1.77	877
books	Real	800 M	6.4 GB	5.39	101
osm	Real	800 M	6.4 GB	1.91	129
uniform	Synthetic	400 M	3.2 GB	Varied	N.A.
normal	Synthetic	400 M	3.2 GB	Varied	N.A.
lognormal	Synthetic	400 M	3.2 GB	Varied	N.A.

References:
[1] Marcus, et al. SOSD: A Benchmark Suite for Similarity Search over Sorted Data. PVLDB, 2020.
[2] Sandt, et al. Facebook's Data Infrastructure at Scale. SIGMOD, 2019.
[3] Wikidata. https://www.wikidata.org/wiki/Wikidata:Main_Page
[4] OpenStreetMap. https://www.openstreetmap.org/
[5] Zhang, et al. Making Learned Indexes Practical: A Comprehensive Study on Data Distribution and Model Selection. PVLDB, 2024.

II. RUN RMI to generate results:

1, RUN RMI to construct models:

RMI code: https://github.com/learnedsystems/RMI/tree/master
This is a reference implementation of recursive model indexes (RMIs). A prototype RMI was initially described in [The Case for Learned Index Structures (https://arxiv.org/abs/1712.01208) by Kraska et al. in 2017.

cargo run --release -- --optimize (optimizer_out_wiki.json)JSON_PATH wiki_200M_uint64
cargo run wiki_200M_uint64 --param-grid (optimizer_out_wiki.json)JSON_PATH -d YOUR_RMI_SAVE_FOLDER --threads 8 --zero-build-time

And then the RMI will generate 9 models in YOUR_RMI_SAVE_FOLDER. For fb dabatset: it includs 9 RMI output parameter files(fb_200M_uint64_i_L1_PARAMETERS, i=0,...,9), 27 RMI RMI codes files (each model generate 3 code files (named fb_200M_uint64_i_data.h, fb_200M_uint64_i.cpp, fb_200M_uint64_i.h, i=0,...,9).

2，Query Workloads.

Our work focuses on in-memory read-heavy workloads. Given a key set K, we generate the query workload by randomly sampling S (by default S=5,000) keys from K. To simulate different access patterns, we sample lookup keys from two distributions:

Uniform: Every key in K has an equal likelihood of being sampled.
Zipfan: The probability of sampling the i-th key in K is given by p(i)=i^α/∑_j=1^Nj^α. For the Zipfan workload, by default, we set the parameter α=1.3 such that over 90% of index accesses occur within the range of (0, 10³].

The detailed sampling method for these two query workloads is in the main_dataset folder. dataset can be fb, wiki, books, osm, uniform, normal, or lognormal.

To run the benchmarks:

make -f Makefile_all run_all

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

learned_index

Ⅰ. Benchmark Datasets (7 datasets)

II. RUN RMI to generate results:

1, RUN RMI to construct models:

2，Query Workloads.

Files

README.md

Latest commit

History

README.md

File metadata and controls

learned_index

Ⅰ. Benchmark Datasets (7 datasets)

II. RUN RMI to generate results:

1, RUN RMI to construct models:

2，Query Workloads.