System that consumes data from Wikipedia EventStreams service and exposes analytics dashboard with insights about activity on different wikipedia instances. See:
- Architecture section for tech details,
- Deployment for instructions on running the code,
- Screenshot for overview of ready dashboard.
Project submitted for Data Engineering Zoomcamp 2024.
Guiding principles of tech choices:
- avoiding vendor lock in,
- preferring lightweight tools without redudant features,
- configuration kept in repo.
This repo contains example terraform config for Deployment using Hetzner Cloud but in principle everything can work on any linux server, hosted anywhere. Also, no proprietary tools are used.
Component | Description |
---|---|
wiki_sse_reader | Python service reading Server-sent events from wikipedia source |
wiki_dbt | Models (i.e. SQL code) for transforming data within database |
wiki_dash | BI dashboard app defined in Python (Dash) |
RabbitMQ | Message queue that handles events |
Clickhouse | Main OLAP database to store data and run analytics queries |
table | level (medallion architecture) | description |
---|---|---|
wiki_raw | bronze | Raw ingested wiki data |
wiki | silver | Parsed and filtered wiki data |
wiki_minutely_summary | gold | Event count by minute (total) |
wiki_hourly_summary | gold | Event count by hour (total) |
wiki_minutely_bywiki_summary | gold | Event count by minute (by wiki) |
wiki_hourly_bywiki_summary | gold | Event count by hour (by wiki) |
wiki_weekdays_summary | gold | Average event count for specific times of the week |
wiki_bywiki_summary | gold | Total event counts by wiki |