The Wikipedia Goggle is a search engine for the English Wikipedia, using a trigram index and a ranking algorithm similar to Google's original PageRank, implemented in modern C++.
-
Install Bazel from https://bazel.build/install (you can do
brew install bazel
if you have Homebrew) -
Clone the git repo
git clone https://github.com/aapeliv/goggle.git
- Download files from the Wikipedia data dump using. E.g. to get a sample dump from April 1st 2022, go to https://dumps.wikimedia.org/enwiki/20220401/, download a partial dump and extract the index file.
cd data
# download the data dump
wget https://dumps.wikimedia.org/enwiki/20220401/enwiki-20220401-pages-articles-multistream1.xml-p1p41242.bz2
# download the data index file
wget https://dumps.wikimedia.org/enwiki/20220401/enwiki-20220401-pages-articles-multistream-index1.txt-p1p41242.bz2
# extract the index
bunzip2 enwiki-20220401-pages-articles-multistream-index1.txt-p1p41242.bz2
- Run the tests
bazel test //...
- Start the indexer and backend with
bazel run //src:goggle
- Once the backend comes up with a message `Serving on 8080.', you can test it with a query such as
curl "http://localhost:8080/query?q=finland"
- Build an optimized binary
bazel build --config=optz //src:goggle`
- Build an optimized frontend
cd frontend/
npm run build
-
Get a TLS certificate and place them in the working directory
-
Download the full Wikipedia dump and index
-
Run the full thing
./bazel-bin/src/goggle \
--db_dir=prod_db/ \
--dump_file path/to/articles-multistream.xml.bz2 \
--index_file path/to/articles-multistream-index.txt \
--enable_tls \
--server_cert path/to/cert.pem \
--server_key path/to/key.pem \
--frontend_server_dir frontend/build/