Vector Engine creates embeddings from Daily Maverick articles, stores them in a database, and provides a REST API to perform comparrison operations on the embeddings.
It consists of three primary components:
The bulk vectorizer exports all the articles from Daily Maverick, and runs them through the vectorization process.
The article vectorizer takes a single article ID and runs it through the vectorization process. If the article is not in the database, it will be added. If it is already in the database, it will be updated.
The REST API provides endpoints to find similar articles, to perform natural-language searches, and to trigger vectorizing individual articles.
The article vectorization process is as follows:
- The article is extracted from the RevEngine API.
- The article is cleaned. We remove HTML tags, scripts and Wordpress shortcodes. We also normalize paragraph breaks.
- The article is chunked if necessary. We prefer full articles, but if the article is too long, we will split it into chunks. We add a header to the start of each chunk so that they relate to one another.
- Each chunk is vectorized using the all-minilm model.
- The vectors are stored with metadata in a Qdrant database.
- Qdrant
- Redis
- Node.js
- Clone the repository
- Run
npm install
- Create a
.env
file in the root directory of the project. - Add the following variables to the
.env
file:
JXP_SERVER
- The URL of the JXP serverJXP_API_KEY
- The API key for the JXP serverMONGO_URI
- The URI of the MongoDB server, defaults tomongodb://localhost:27017
MONGO_DB_NAME
- The name of the MongoDB database, defaults todm
REDIS_HOST
- The host of the Redis server, defaults tolocalhost
REDIS_PORT
- The port of the Redis server, defaults to6379
REDIS_PASSWORD
- The password for the Redis server, defaults to no passwordHOST
- The host of the server, defaults to127.0.0.1
PORT
- The port of the server, defaults to8001
- Start the server with
npm start
The project includes a Dockerfile
and docker-compose.yml
file. To run the project in Docker, run the following commands:
docker-compose build
docker-compose up -d
If you want to import the articles into the Docker MongoDB instance, first copy your articles.bson and articles.metadata.json files to ./mongodb_dump. Then run the following command:
docker exec -i vectorizer-mongodb-1 /usr/bin/mongorestore --uri "mongodb://mongodb" -d dm -c articles /data/mongodb_dump/articles.bson
To bulk vectorize all the articles, run the following command:
node src/bin/vectorize.js -a
The process should take a few hours to complete. It creates a folder called "articles" which contains the articles in each step of the process. If you want to rerun the process from the last point, do not delete this folder and the command will skip most of the work. If you want to start over, delete the folder and rerun the command.
You can run just one step of the process at a time. Run the following to get all the options:
node src/bin/vectorize.js -h
The API server will be available at http://localhost:8001
, or whatever host and port you have set in the .env
file.
The API provides the following endpoints:
ID can be either a Wordpress post ID or a RevEngine article ID. The endpoint will return the 5 most similar articles to the provided ID from the last 30 days.
Note that the article must have already been vectorized and stored in the DB.
This is a more advanced endpoint that can include recommendations based on reading history. It can also be used to find articles from a specific date range, section and/or tag.
Body:
{
"post_id": 1234, # Either a Wordpress post ID
"revengine_id": 5678, # Or a RevEngine article ID
"limit": 5,
"history": [],
"previous_days": 30,
"section": "section name",
"tag": "tag name",
"date_start": "2024-01-01",
"date_end": "2024-01-31"
}
The body of the request should be a JSON object with the following properties:
{
"query": "search query",
"limit": 5,
"previous_days": 30,
"section": "section name",
"tag": "tag name",
"date_start": "2024-01-01",
"date_end": "2024-01-31",
"author": "author name"
}
The endpoint will return the 5 most similar articles to the provided search query.
Works the same as the POST /search endpoint, but the query parameters are passed in the URL.
Eg. /search/who%20won%20the%20trump%20harris%20debate?limit=10§ion=South%20Africa
Parameters:
{
"limit": 5,
"previous_days": 30,
"section": null,
"tag": null,
"date_start": null,
"date_end": null,
"author": null
}
ID can be either a Wordpress post ID or a RevEngine article ID. The endpoint will vectorize the article with the provided ID.
Body:
{
"post_id": 1234, # Either a Wordpress post ID
"revengine_id": 5678 # Or a RevEngine article ID
}
This endpoint will clear the Redis cache.
Caching is set for 24 hours. Response time for the /similar/:id
endpoint can be as high as ~200ms for an uncached request, but should drop to ~10ms for a cached request.
This project is licensed under the MIT License - see the LICENSE.md file for details.