docker-kafka-spark-poc

This project is intended to help data engineers quickly set up POCs using a Spark/Kafka infrastructure (or Spark/Kafka/MySQL) for several uses:

If Spark/Kafka is a good fit to solve business challenges
Learn how to work with the Spark/Kafka infrastructure

All of the pieces in this repo can be pieced together individually; however, when first assembling the structure it is very easy for one to spend more time focused on Docker setup, troubleshooting network issues, or figuring out exactly how to set up a producer, than it is to dive in to using Spark/Kafka themselves.

My hope is this project will help curious engineers jump right in to learning Spark and showing how it can add value to their team.

A HUGE credit for this goes to wurstmeister (https://github.com/wurstmeister) for his work on setting up Kafka + Zookeeper in a no-hastle way for Docker Compose.

Note: NOT to be used in production environments!

Overview

Components

Zookeeper
- https://zookeeper.apache.org/
- An open source server for distributed coordination; Kafka runs on this and, in production environments, uses to ensure proper replication across instances
- Most users of this project will not have to directly interact with Zookeeper
Kafka
- https://kafka.apache.org/
- Open source distributed streaming platform
- Users of this project will interact with Kafka in several ways:
  - Writing to Kafka using a simple producer program written in Python
  - Reading from Kafka using a simple consumer program written in Python
  - Reading from Kafka using Apache Spark
- Kafka data is not persisted outside of the image, so each restart of the container will result in a new data set
Zeppelin
- https://zeppelin.apache.org/
- Open source web-based computational notebook platform
- Includes a Spark interpreter, making it very easy to write, test, and demo code
  - Because the interpreter is packaged with Zeppelin, you do not need to have a Spark cluster running locally on your machine, but that is configurable as well
- Majority of interaction will be here
- Configuration and notebooks are persisted outside of the image, to avoid tedious rework
- A persistent directory for Spark outputs (such as Parquet) is provided
MySQL (Optional)
- Some teams will want to see that they can write outputs to a more traditional data system; an option is included with a MySQL container in the Docker network to make this easier
- Data written to MySQL is persisted

Project Structure

./compose-files - Docker compose service definitions
./python-client
- producer.py - Basic producer which sends test data to Kafka; for POC sub-in a custom data set
- consumer.py - Basic consumer to quickly validate that data is flowing to Kafka
setup_dir.sh - Setup local persistent directories which will be mounted
requirements.txt - Python package requirements
run_compose.sh - Run Zookeeper, Kafka, and Zeppelin in silent mode
run_compose_mysql.sh - Run Zookeeper, Kafka, MySQL, and Zeppelin in silent mode.

Getting Started

Prerequisites

docker-compose
Python (Tested with 3.5; should be compatible with 2.7)
virtualenv

Created on Ubuntu 16.04

Running Docker Compose

Clone this repo locally:

git clone https://github.com/colemanja91/docker-kafka-spark-poc.git
cd ./docker-kafka-spark-poc

Set up directories for mount volumes
- sudo bash setup_dir.sh
- If you want to change directory locations, you will need to update in the docker-compose files as well
- This will also copy the default Zeppelin config file from this project, which has the necessary connection information for Kafka
Run docker-compose
- bash ./run_compose.sh
- (with MySQL): bash ./run_compose_mysql.sh
Verify that Zeppelin is running via browser:
- localhost:8080

To kill the containers:

cd ./compose-files
docker-compose down --remove-orphans

Writing data to Kafka

Now that our Kafka instance is running, we need to send data to set up our POC.

Set up a virtual environment:

virtualenv venv -p python3.5
source venv/bin/activate
pip install -r requirements.txt

To send default test data:
- python ./python-client/producer.py
- This will randomly send one of three simple test records at intervals between 0.5-5 seconds
A POC is often more helpful if using data you are familiar with; it may be helpful to load a JSON or CSV file of sample data to send

Reading data from Kafka

To ensure the data from the above producer is being captured by Kafka, we have a simple script which reads data from our test topic and prints to the console: python ./python-client/consumer.py

This is most helpful if trying to troubleshoot connection issues.

Basic Spark scripting

Artifacts

The artifact required to read from Kafka in Spark is included in the Zeppelin config: org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0 Also included is the MySQL JDBC driver: mysql:mysql-connector-java:5.1.38

Code snippets

Here are some basic code snippets for Spark (of course, it's more fun to figure them out on your own ;) ):

// Read in a data stream from Kafka

// Write Parquet table to our persistent volume

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker-kafka-spark-poc

Overview

Components

Project Structure

Getting Started

Prerequisites

Running Docker Compose

Writing data to Kafka

Reading data from Kafka

Basic Spark scripting

Artifacts

Code snippets

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
compose-files		compose-files
python-client		python-client
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_compose.sh		run_compose.sh
run_compose_mysql.sh		run_compose_mysql.sh
setup_dir.sh		setup_dir.sh

colemanja91/docker-kafka-spark-poc

Folders and files

Latest commit

History

Repository files navigation

docker-kafka-spark-poc

Overview

Components

Project Structure

Getting Started

Prerequisites

Running Docker Compose

Writing data to Kafka

Reading data from Kafka

Basic Spark scripting

Artifacts

Code snippets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages