This project is intended to help data engineers quickly set up POCs using a Spark/Kafka infrastructure (or Spark/Kafka/MySQL) for several uses:
- If Spark/Kafka is a good fit to solve business challenges
- Learn how to work with the Spark/Kafka infrastructure
All of the pieces in this repo can be pieced together individually; however, when first assembling the structure it is very easy for one to spend more time focused on Docker setup, troubleshooting network issues, or figuring out exactly how to set up a producer, than it is to dive in to using Spark/Kafka themselves.
My hope is this project will help curious engineers jump right in to learning Spark and showing how it can add value to their team.
A HUGE credit for this goes to wurstmeister (https://github.com/wurstmeister) for his work on setting up Kafka + Zookeeper in a no-hastle way for Docker Compose.
Note: NOT to be used in production environments!
- Zookeeper
- https://zookeeper.apache.org/
- An open source server for distributed coordination; Kafka runs on this and, in production environments, uses to ensure proper replication across instances
- Most users of this project will not have to directly interact with Zookeeper
- Kafka
- https://kafka.apache.org/
- Open source distributed streaming platform
- Users of this project will interact with Kafka in several ways:
- Writing to Kafka using a simple producer program written in Python
- Reading from Kafka using a simple consumer program written in Python
- Reading from Kafka using Apache Spark
- Kafka data is not persisted outside of the image, so each restart of the container will result in a new data set
- Zeppelin
- https://zeppelin.apache.org/
- Open source web-based computational notebook platform
- Includes a Spark interpreter, making it very easy to write, test, and demo code
- Because the interpreter is packaged with Zeppelin, you do not need to have a Spark cluster running locally on your machine, but that is configurable as well
- Majority of interaction will be here
- Configuration and notebooks are persisted outside of the image, to avoid tedious rework
- A persistent directory for Spark outputs (such as Parquet) is provided
- MySQL (Optional)
- Some teams will want to see that they can write outputs to a more traditional data system; an option is included with a MySQL container in the Docker network to make this easier
- Data written to MySQL is persisted
./compose-files
- Docker compose service definitions./python-client
producer.py
- Basic producer which sends test data to Kafka; for POC sub-in a custom data setconsumer.py
- Basic consumer to quickly validate that data is flowing to Kafka
setup_dir.sh
- Setup local persistent directories which will be mountedrequirements.txt
- Python package requirementsrun_compose.sh
- Run Zookeeper, Kafka, and Zeppelin in silent moderun_compose_mysql.sh
- Run Zookeeper, Kafka, MySQL, and Zeppelin in silent mode.
- docker-compose
- Python (Tested with 3.5; should be compatible with 2.7)
- virtualenv
Created on Ubuntu 16.04
- Clone this repo locally:
git clone https://github.com/colemanja91/docker-kafka-spark-poc.git
cd ./docker-kafka-spark-poc
- Set up directories for mount volumes
sudo bash setup_dir.sh
- If you want to change directory locations, you will need to update in the docker-compose files as well
- This will also copy the default Zeppelin config file from this project, which has the necessary connection information for Kafka
- Run docker-compose
bash ./run_compose.sh
- (with MySQL):
bash ./run_compose_mysql.sh
- Verify that Zeppelin is running via browser:
- localhost:8080
To kill the containers:
cd ./compose-files
docker-compose down --remove-orphans
Now that our Kafka instance is running, we need to send data to set up our POC.
- Set up a virtual environment:
virtualenv venv -p python3.5
source venv/bin/activate
pip install -r requirements.txt
- To send default test data:
python ./python-client/producer.py
- This will randomly send one of three simple test records at intervals between 0.5-5 seconds
- A POC is often more helpful if using data you are familiar with; it may be helpful to load a JSON or CSV file of sample data to send
To ensure the data from the above producer is being captured by Kafka, we have a simple script which reads data from our test topic and prints to the console:
python ./python-client/consumer.py
This is most helpful if trying to troubleshoot connection issues.
The artifact required to read from Kafka in Spark is included in the Zeppelin config:
org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Also included is the MySQL JDBC driver:
mysql:mysql-connector-java:5.1.38
Here are some basic code snippets for Spark (of course, it's more fun to figure them out on your own ;) ):
// Read in a data stream from Kafka
// Write Parquet table to our persistent volume