AWS Glue Streaming uses Glue Connections to connect to different sources and targets. One of these conections is Kafka. However, this connection does not support SASL/PLAIN, a common authentication mechanism used by vanilla Kafka and Confluent. This limitation means that Glue Streaming does not natively support Confluent out-of-the-box.
An alternative solution would be using native Spark APIs to integrate AWS Glue Streaming with Confluent. This repository provides a simple demo and boilerplate code for Glue Streaming. It reads data from a Confluent Cloud topic and writes that data into another topic, with the only transformation being the removal of a specific column. This code serves as a foundation that you can build upon by adding your own custom transformations as needed.
We use Terraform to deploy all the necessary resources. The script deploys the following:
The template deploys:
- Confluent Cloud environment
- Confluent Cloud Cluster
- Confluent Cloud source and target topics
- API Keys with read/write permissions on the source and target topics
- Datagen Connector to generate mock data for the demo
- Glue Steaming Python code
- S3 Bucket to upload the code
├── assets <-- Directory that will hold demo assests
│ ├── architecture.png <-- Demo architecture diagram
└── Terraform <-- Demo terraform script and artifacts
│ ├── aws.tf <-- Terraform for AWS resources
│ ├── main.tf <-- Terraform for Confluent resources
│ ├── outputs.tf <-- Terraform output file
│ ├── providors.tf <-- Terraform providors file
│ ├── streaming.py <-- Glue Streaming code
│ ├── variables.tf <-- Terraform variables file
└── README.md
The demo uses Glue Streaming to read raw messages generated by the Datagen connector. It removes the itemid field and subsequently publishes the modified data back to Confluent.
Note: This is a basic example transformation. You can add any transformation supported by Spark Structured Streaming.
- Confluent Cloud API Keys - (Cloud API Keys)[https://docs.confluent.io/cloud/current/access-management/authenticate/api-keys/api-keys.html#cloud-cloud-api-keys] with Organisation Admin permissions are needed to deploy the necessary Confluent resources.
- Terraform (0.14+) - The application is automatically created using Terraform. Besides having Terraform installed locally, will need to provide your cloud provider credentials so Terraform can create and manage the resources for you.
- AWS account - This demo runs on AWS
- AWS CLI - Terraform script uses AWS CLI to manage AWS resources
- Clone the repo onto your local development machine using
git clone <repo url>
. - Change directory to demo repository and terraform directory.
cd stream-processing-with-confluent-and-glue-streaming/Terraform
- Use Terraform CLI to deploy solution
terraform plan
terraform apply
-
Go to Confluent Cloud Topics UI, then choose the newly created environment and cluster.
-
Browse to source-topic and view raw messages
-
Navigate to target-topic and then view post-processed messages. Notice the output messages are missing one column which was dropped by the Glue Steaming job.
-
Play with the Glue streaming code to add any transformations needed.
The great thing about Cloud resources is that you can spin the up and down with few commands. Once you are finished with this demo, remember to destroy the resources you created, to avoid incurring in charges. You can always spin it up again anytime you want.
Note: When you are done with the demo, you can automatically destroy all the resources created using the command below:
terraform destroy