This is a cryptonews sentiment prediction application cryptobarometer.org.
Below is an architecture skecth of the application's backend.
The general idea of the workflow is the following (bolded verbs below correspond to arrows in the diagram above):
Database
(PostgreSQL) stores 3 tables: one for raw news titles (title_id, title, source, timestamps), one more for model predictions (title_id, negative, neutral, positive, predicted_class, entropy), and a third one for labeled news titles (title_id, label, timestamps)Crawler
periodically scrapes news from RSS feeds, filters them, and writes this data to theDatabase
;ML model API
service hosts the ML model inference API (model training is not covered here);- the
Model Scorer
service periodically reads those news titles from theDatabase
that lack model predictions, invokesML model API
for these fresh titles and writes the result to theDatabase
; Data Provider
service reads news titles from theDatabase
with its model predictions and provides it toFrontend
via API;Frontend
reads a metric (average sentiment score for all news titles for the last 24 hours) from theData Provider
and visualizes it as a barometer. It also gets a list of latest news from theData Provider
as well as historical news sentiment values and depicts them;Scheduler
launchesCrawler
,Model Scorer
, andLabelStudio
on a schedule, e.g. 4 times a day;MLflow
allows to track ML experiments and provides model registry interface with models being stored inMinio
;Label Studio
reads unlabeled data from theDatabase
based of an active learning criterion (e.g. entropy) stored in themodel_predictions
table and imports this data into the Label Studio project. Annotatorannotate
the tasks, and thenLabel Studio
writes labeled data to theDatabase
,labeled_news_titles
table.
All components except for the database are packed together and managed by docker-compose
. See docker-compose.yml
which lists all services and associated commands. At the moment, the database is spun up separately, manually.
Preparation:
- install
docker
anddocker-compose
. Tested with Docker version 20.10.14 and docker-compose v2.4.1 (make sure it's docker-compose v2 not v1, we had to add the path to docker-compose to the PATH env variable:export PATH=/usr/libexec/docker/cli-plugins/:$PATH
); - put variables from this Notion page (section "Project env variables", limited to project contributors) in the
.env
file, see.env.example
- check
volumes/pgadmin
andvolumes/postgres
permissions, if they are root-only, runsudo chown -R <username> <folder_name>; sudo chmod -R 777 <folder_name>
for both folders (otherwise, they won't be accessible inside docker); - if you have trained a model with
train
service before, you need to specify model name and model version in configs. Otherwise either runtrain
service first or put the model file intostatic/models
, at the moment the model file/artifacts/models.logit_tfidf_btc_sentiment.onnx
is stored on the Hostkey machine (limited to project contributors) ;
To launch the whole application:
docker compose -f docker-compose.yml --profile production up --build
This will open a React app http://<hostname>:3000
in your browser, see a screenshot below in the Frontend section.
To set up labeling with LabelStudio:
There's one non-automated step: to label data, one needs a LabelStudio access token. This token can not be read with LabelStudio API but rather needs to be communicated to the app manually. To provide this token:
- go to LabelStudio (https//:<server-ip>:8080);
- create an account (in needs to be created from scratch each time
docker-compose
is relaunched); - go to Accounts & Settings (upper-right corner in the LabelStudio interface) and copy the access token;
- put the access token into the
.env
file as LABEL_STUDIO_ACCESS_TOKEN.
For more details, see the Label Studio section.
- select model config in the conf/config.yaml file. Available configs can be found in the conf/models folder.
- place data in the data folder. Specify the path to the data in the config (path_to_data key). This will be replaced with reading from a database.
- make sure that other services are stopped:
docker compose stop
anddocker compose rm
. - run the following command
USER=$(id -u) GROUP=$(id -g) docker compose -f docker-compose.yml --profile train up --build
The model checkpoint will be saved with the path specified with checkpoint_path
key in the model config. Onnx model will be saved in Minio
service and all metrics will be logged to MLflow
, and another copy of onnx model will be saved locally with the path specified with path_to_model
key in the model config. If you would like to train a model while other services are running add -d
option to the command shown above.
Using gpu to train a model:
- enable gpu access with compose:
https://docs.docker.com/compose/gpu-support/
- set device: CUDA in model's config
- run the following command
USER=$(id -u) GROUP=$(id -g) docker compose -f docker-compose.yml -f docker-compose.gpu.yml --profile train up --build
The app includes the following components:
Component | Source | Docker-compose service (docker-compose.yml ) |
---|---|---|
PostgreSQL database | --- | db |
PGAdmin | --- | pgadmin |
Crawler | crypto_sentiment_demo_app/crawler/ |
crawler |
Model API endpoint | crypto_sentiment_demo_app/model_inference_api/ |
model_inference_api |
Model scoring the news | crypto_sentiment_demo_app/model_scorer/ |
model_scorer |
Data provider | crypto_sentiment_demo_app/data_provider/ |
data_provider |
Frontend | crypto_sentiment_demo_app/frontend/ |
frontend |
Scheduler | --- | scheduler |
Label studio | crypto_sentiment_demo_app/label_studio/ |
label_studio |
ML Flow | --- | mlflow |
Model trainer | crypto_sentiment_demo_app/train/ |
train |
Below, we go through each one individually.
Source: db
service defined in docker-compose.yml
A PostgreSQL database is set up as a db
service, see docker-compose.yml
with initialization scripts provided in the db_setup
folder. (Previously, it was also configured without Docker on the machine provided by Hostkey see this Notion page, limited to project contributors).
At the moment, there're 3 tables in the cryptotitles_db
database:
news_titles
– for raw news:(title_id BIGINT PRIMARY KEY, title VARCHAR(511) NOT NULL, source VARCHAR(72), pub_time TIMESTAMP))
;model_prediction
– for model scores produced for each news title:(title_id BIGINT PRIMARY KEY, negative FLOAT, neutral FLOAT, positive FLOAT, predicted_class INTEGER, entropy FLOAT)
;labeled_news_titles
– for labeled news:(title_id BIGINT PRIMARY KEY, label FLOAT, pub_time TIMESTAMP))
.
To run Postgres interactive terminal: psql -U mlooops -d cryptotitles_db -W
(the password is also mentioned on this Notion page).
Some commands are:
\dt
to list tablesselect * from news_titles
select count(*) from model_predictions
- etc. see psql shortcuts here.
The data can be injested from CSV files into Postgres tables with a custom script crypto_sentiment_demo_app/database/write_df_to_db.sh
. In particular, ~4750 labeled titles are populated this way into tables news_titles
and labeled_news_titles
:
sh crypto_sentiment_demo_app/database/write_df_to_db.sh -p data/20220606_news_titles_to_import.csv -t news_titles
sh crypto_sentiment_demo_app/database/write_df_to_db.sh -p data/20220606_labeled_news_titles_to_import.csv -t labeled_news_titles
Where files data/20220606_*_news_titles_to_import.csv
are produced here – those are basically all the labeled data we have, with the only fields left that are matching those in the database.
Source: pgadmin
service defined in docker-compose.yml
Once the app is running (docker compose -f docker-compose.yml --profile production up --build
), visit http://<IP_ADDRESS>:8050/browser/ to launch PGAdmin which is the administration and development platform for PostgreSQL.
Source: crypto_sentiment_demo_app/crawler/
Crawler reads ~100 RSS feeds defined in data/crypto_rss_feeds.txt
, and further filters out non-English text, news without a verb, questions and short news (refer to the analysis performed here).
Then it puts the data (title IDs, titles, source, publication timestamps) into the news_titles
table, and puts titles IDs into the model_predictions
table to be later picked up by the model_scorer
service.
Source: crypto_sentiment_demo_app/model_inference_api/
At the moment, we are running a FastAPI with tf-idf & logreg model based on this repo. Model training is not covered here, hence the model file needs to be put into the static/model
folder prior to spinning up the API or be stored in Minio
and have been logged previosly into MLflow
.
To be superseded by a more advanced BERT model (Notion ticket).
Source: crypto_sentiment_demo_app/model_scorer/
The model scorer service takes those title IDs from the model_predictions
table that don't yet have predictions (score for negative
, score for neutral
, score for positive
, and predicted_class
) and calls the Model inference API to run the model against the corresponding titles. It then updates records in the model_predictions
table to write model predictions into it.
Source: crypto_sentiment_demo_app/data_provider/
FastAPI service which aggregates the necessary data for our frontend from the database. Run it and check its documentation
for endpoints description and examples.
Source: crypto_sentiment_demo_app/frontend/
Curently, the React app looks like this:
The frontend itself talks to the database to get the average positive
score for today's news titles (this will be changed in the future, TODO: describe how Frontend talks to the data provider service to get metrics to visualize.).
Source: scheduler
service defined in docker-compose.yml
We use Ofelia job scheduler that is specifically designed for Docker environments and is easy to use with docker-compose.
The scheduler manages crawler
and model_scorer
.
To see it in action:
- launch the app:
docker compose -f docker-compose.yml --profile production up --build
- check scheduler logs:
docker-compose logs scheduler
- additionally, you can check the number of records in the
news_titles
andmodel_predictions
tables (they will be growing in time). For that, launch PGAdmin, navigate to Servers -> <SERVER_NAME> (e.g. "Docker Compose") -> Databases -> <DB_NAME> (e.g. "cryptotitles_db"), then select Tools -> Query Tool and type your SQL:select count(*) from news_titles
.
You can check your ML experiments and models by visiting http://:8500/.
Source: crypto_sentiment_demo_app/label_studio/
Label Studio service allows us to annotate additional data.
To launch the service:
- run the app:
docker compose -f docker-compose.yml --profile production up --build
- This will launch Label Studio at http://<server-ip>:8080/
Further, the scheduler picks it up, creates a LabelStudio project and performs imports of data into and exports on schedule. See the "label_studio" service definition in the docker-compose.yml
file.
It's also possible to import unlabeled data into LabelStudio and to export labeled data from LabelStudio manually. For that, you need to log into the running container with Label Studio: docker ps
show running containers, there you can copy the Label Studio container ID and run docker exec -it \<container_id\> /bin/bash
.
Importing data into Label Studio
From inside the running container,
bash /home/crypto_sentiment_demo_app/crypto_sentiment_demo_app/label_studio/modify_tasks.sh \
-p <project name> \
-m import \
-c <active learning sampling strategy> \
-n <number of items to sample>
creates a new Label Studio project, loads samples from the model_predictions
table and also creates annotation tasks. Visit http://<server-ip>:8080/ to find your new project there.
Two sampling strategies are available: least_confidence
and entropy
. Also, the model_predictions
table will be modified so that is_annotation
flag will be set to True
for the imported samples.
Exporting data from Label Studio
From inside the running container,
bash /home/crypto_sentiment_demo_app/crypto_sentiment_demo_app/label_studio/modify_tasks.sh \
-p <project_name> \
-m export
exports the annotated and submitted tasks from the Label Studio project and writes them to the labeled_news_titles
table.
To add self-hosted Github runner:
- Navigate to the main page of the repository
- Settings -> Actions -> Runners
- Click New self-hosted runner and follow the instructions to download and configure the GitHub Actions Runner
We are using setup-python action to setup a python environment. The following steps required for this action to work with a self-hosted runner:
- Put the variable
AGENT_TOOLSDIRECTORY=/opt/hostedtoolcache
into the.env
file. - Create a directory called
hostedtoolcache
inside/opt
- Run:
sudo chown <user name>:<user group> /opt/hostedtoolcache/
To launch the runner:
cd actions-runner
sudo ./svc.sh install root
sudo ./svc.sh start
To stop the runner:
sudo ./svc.sh stop
We are very grateful to Hostkey and dstack.ai for providing computational resources, both GPU and CPU, for the project.