Skip to content

Commit

Permalink
Big Work :flex: :
Browse files Browse the repository at this point in the history
Got the front page implemented. Added more diagrams to the different sections of the case study. Fixed styling issues. Added the `Team` section to the bottom of the home page.

Co-authored-by: Paco <[email protected]>
Co-authored-by: Esther Kim <[email protected]>
Co-authored-by: Cruz <[email protected]>
  • Loading branch information
4 people committed Dec 7, 2024
1 parent 9373978 commit dde915c
Show file tree
Hide file tree
Showing 29 changed files with 2,397 additions and 325 deletions.
2 changes: 1 addition & 1 deletion docs/Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar_label: Introduction
sidebar_position: 1
---

# 1. Tumbleweed
# 1. Introduction

![Tumbleweed Logo](/img/tumbleweed_logo.png)

Expand Down
2 changes: 2 additions & 0 deletions docs/existing_solutions.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@ There are several enterprise solutions which implement CDC systems for microserv
Most enterprise solutions focus on broader CDC implementations and are more tailored to traditional source-to-sink replication between data stores. Several solutions, such as Confluent[^1] and Striim[^2], do allow for managed log-based CDC pipelines to replicate data between microservices. Such solutions typically function well, however, they come with tradeoffs. These services are expensive and require recurring costs. Additionally, allowing the pipeline to be hosted by a managed service leads to decreased data privacy and less control over infrastructure. Additionally, managed CDC solutions, while usable, have not often been designed for the specific use of the outbox pattern data syncing between microservices.

![Striim Logo and Confluent Logo](/img/striim_confluent.png "Striim and Confluent Logos")
<figcaption>Figure 1: Striim and Confluent.</figcaption>

## 3.2 DIY Solutions

An alternative to enterprise solutions is DIY. DIY solutions can be built by utilizing open-source tools such as Debezium[^3] and Apache Kafka[^4], which offer extensive flexibility for data customization. Customizations include, but are not limited to schema evolution, data transformation, and topic customization. Building a DIY solution utilizing some of the aforementioned open-source tools may be a good fit for those teams that prefer to have more control over their infrastructure, with the option for extensive customizations in their CDC pipeline, while avoiding recurring costs from using an enterprise solution. These benefits come at the cost of managing the complex configurations of these tools, which may hinder a team's ability to deploy a CDC pipeline quickly. Without extensive research or experience in the problem domain and these technologies, even experienced developers will require considerable time to build a production ready DIY system.

![Kafka and Debezium Logos](/img/debezium_kafka.png "Kafka and Debezium Logos")
<figcaption>Figure 2: Apache Kafka and Debezium.</figcaption>

[^1]: [Confluent Developer: Your Apache Kafka® Journey begins here. (n.d.). Confluent.](https://developer.confluent.io/)
[^2]: ["Striim (2024, October 28). Real-time data integration and streaming platform."](https://www.striim.com/1)
Expand Down
17 changes: 17 additions & 0 deletions docs/problem_domain.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ One solution for this problem is the use of Event Driven Architecture (EDA). In

In event stream processing, an event is a record or "...a small, self-contained, immutable object containing the details of something that happened at some point in time..."[^6] and an event stream is therefore an "unbounded, incrementally processed[^6] " stream of such data. Many event stream processing frameworks can also be described as asynchronous message-passing systems or message brokers.[^6] Generally speaking, message brokers allow producing processes to send messages or records to a topic or queue, then the broker facilitates the delivery of that data to subscribed consumers.

![Event Streaming](/img/event_streaming.svg "Event streaming")
<figcaption>Figure 2: Event Stream Processing.</figcaption>

The use of a message broker has a number of advantages over direct messaging between services. It can act as a buffer for when consumers are unavailable and more easily allows for a single message to be sent to multiple consumers. This approach also promotes greater decoupling between producer and consumer services, allowing microservices to be designed agnostic as to whom the event data is being sent or how it is being consumed. [^7] Consumers, likewise, subscribe to only the types of events that concern their business logic and receive these events for processing from the streaming platform.

## 2.3 The Dual-Write Problem
Expand All @@ -29,8 +32,14 @@ A dual-write may occur when data needs to be written to different systems. For e

This process of writing changes to separate systems is where problems can arise and create data inconsistencies between services. If the data successfully writes to the source database but fails to be sent to a message broker due to some kind of application or network issue, the source database will have a record of the change even if the destination never receives it.

![Dual Write 1](/img/dual-write_1.svg "Dual Write Problem 1")
<figcaption>Figure 1: Fails to write to the message broker.</figcaption>

On the other hand, if the data was successfully written to a message broker, but failed to write to the source database, the destination service received the message but the source database has no record of it. Either scenario can result in errors or data inconsistencies, complicating operations between services.

![Dual Write 2](/img/dual-write_2.svg "Dual Write Problem 2")
<figcaption>Figure 2: Fails to write to the source database.</figcaption>

One solution is to only write changes once. If we chose to write changes to a broker, the source service would be listening for new messages, as well as the destination service. When a change occurs in the source service, the message is first sent to the broker before being consumed by both the source and destination services.

This scenario comes with its own drawbacks. While data may eventually be consistent, a source service writing to the broker before updating its own database can introduce latency and create delays, especially if that data needs to be immediately queried from the source database. This has the potential to negatively impact user experience. The handling of delivery failures would also need to be considered. For example, if a message was successfully sent to the message broker but failed to write to the source database for some reason, additional retry logic may be required to address this failure.
Expand All @@ -41,6 +50,9 @@ Instead, we could write changes to the source database before pushing messages t

When using the transactional outbox pattern, database changes are recorded locally to a specially created “outbox” table within the same transaction as the original operation. Transactions in a database allow multiple actions to be carried out as a single logical operation. The outbox table stores metadata about the changes, such as the operation type, and a data payload. Separate processes should then monitor the outbox table for new entries and update the necessary microservices accordingly.

![Outbox Pattern](/img/outbox_pattern.svg "Outbox Pattern")
<figcaption>Figure 3: The outbox pattern.</figcaption>

Outbox table schemas can vary but typically include the following columns:
- `id`: A unique identifier for each outbox event
- `aggregatetype`: An event descriptor, often called a topic, which can be used for categorized routing of event records via a messaging system. For example, In an order propagating service, a change to an orders database might have the aggregate type of “orders”.
Expand All @@ -59,15 +71,20 @@ Data has become a fundamental component of our world and when it comes to choosi
Change Data Capture (CDC) is the process of monitoring a source, capturing data changes in near real-time, and propagating those changes to a variety of downstream consumers, which may include other databases, caches, applications, and more. There are three primary CDC methods in common usage: time-based, trigger-based, and log-based.
Time-based CDC requires semi-invasive database schema changes by adding timestamp columns to each table that tracks when the row was last modified. While somewhat straightforward to implement, time-based CDC is unable to track delete operations and every row in a table must be scanned to find the latest updated value, making it suitable only when a small percentage of data changes.

![Time Based CDC](/img/timestamp-based_CDC.svg "Timestamp Based CDC")
<figcaption>Figure 4: Time-Based CDC</figcaption>

Trigger-based CDC involves using a built-in database function that is automatically triggered whenever an insert, update, or delete operation occurs on a table. These changes are then stored in what is often called a “shadow table”, which can then be used for propagation of data changes to downstream systems. While triggers are supported by most databases, this method requires multiple writes for every change which impacts the source database performance. It can also become cumbersome to manage a large number of triggers.

![Trigger Based CDC](/img/Trigger-based_CDC.png "Trigger Based CDC")
<figcaption>Figure 4: Trigger-Based CDC</figcaption>

Although both time-based and trigger-based CDC still remain in use, log-based CDC has emerged as a generally more efficient and less invasive technique by capturing changes directly from database transaction logs.

### 2.5.1 Log-based CDC

![Log Based CDC](/img/log-based-cdc.png)
<figcaption>Figure 5: Log-Based CDC</figcaption>

For applications that need to access data in near-real time, Log-based CDC is the most widely-used among the various CDC solutions. When changes happen to a database via create, update, or delete operations, the database writes these changes into the transaction log before they are written to the database. In PostgreSQL the transaction log is known as the Write-Ahead Log (WAL). The primary use for transaction logs is backup and recovery, but various CDC tools can read from these logs in order to replicate changes and send them to other systems.

Expand Down
4 changes: 4 additions & 0 deletions docs/tumbleweed.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Tumbleweed is an open-source, user-friendly framework designed specifically for
Tumbleweed integrates and containerizes a variety of open-source frameworks and tools, along with a custom TypeScript backend API into an AWS deployable cluster.

![Tumbleweed Architecture](/img/tumbleweed_simplified.png)
<figcaption>Figure 1: Tumbleweed Architecture.</figcaption>

In order to understand how Tumbleweed functions, it's necessary to examine some of these technologies more in depth.

Expand All @@ -37,6 +38,9 @@ Kafka Connect allows for transmission between Kafka and external systems via sel

Connect provides the connection to Kafka, but we still needed a source connector. In our research we came across Debezium, an open-source distributed platform that implements log-based Change Data Capture. Debezium offers a number of well-maintained and documented source connectors for use with Kafka Connect. These connectors can be used to capture and create event records with a consistent structure, regardless of the source database. We found that the Debezium PostgreSQL connector was well-suited for our purposes.

![Kafka Connect with Debezium](/img/kafka_connect_debezium.png "Connect and Debezium")
<figcaption>Figure 2: Kafka Connect with Debezium.</figcaption>

Thus, Tumbleweed uses a Kafka Connect instance with Debezium PostgreSQL source connectors to listen to producer service outbox tables and capture every create, update, and delete transaction passing them to Kafka. They are then passed to our custom-built TypeScript Backend sink which streams the data to the appropriately subscribed consumer services.

### 4.2.4 Apicurio Schema Registry
Expand Down
110 changes: 37 additions & 73 deletions docusaurus.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ const config: Config = {
favicon: 'img/tumbleweed_favicon.ico',

// Set the production url of your site here
url: 'https://your-docusaurus-site.example.com',
url: 'https://tumbleweed-cdc.github.io',
// Set the /<baseUrl>/ pathname under which your site is served
// For GitHub pages deployment, it is often '/<projectName>/'
baseUrl: '/',

// GitHub pages deployment config.
// If you aren't using GitHub pages, you don't need these.
// organizationName: 'facebook', // Usually your GitHub org/user name.
projectName: 'Tumbleweed', // Usually your repo name.
projectName: 'tumbleweed-cdc.github.io', // Usually your repo name.

onBrokenLinks: 'throw',
onBrokenMarkdownLinks: 'warn',
Expand All @@ -37,102 +37,66 @@ const config: Config = {
{
docs: {
sidebarPath: './sidebars.ts',
// Please change this to your repo.
// Remove this to remove the "edit this page" links.
// editUrl:
// 'https://github.com/facebook/docusaurus/tree/main/packages/create-docusaurus/templates/shared/',
},
// blog: {
// showReadingTime: true,
// feedOptions: {
// type: ['rss', 'atom'],
// xslt: true,
// },
// // Please change this to your repo.
// // Remove this to remove the "edit this page" links.
// editUrl:
// 'https://github.com/facebook/docusaurus/tree/main/packages/create-docusaurus/templates/shared/',
// // Useful options to enforce blogging best practices
// onInlineTags: 'warn',
// onInlineAuthors: 'warn',
// onUntruncatedBlogPosts: 'warn',
// },
theme: {
customCss: './src/css/custom.css',
customCss: require.resolve('./src/css/custom.css'),
},
} satisfies Preset.Options,
],
],

plugins: [
async function tailwindPlugin(context, options) {
return {
name: "docusaurus-tailwindcss",
configurePostCss(postcssOptions) {
postcssOptions.plugins.push(require("tailwindcss"));
postcssOptions.plugins.push(require("autoprefixer"));
return postcssOptions;
},
};
},
],

themeConfig: {
colorMode: {
defaultMode: 'light',
disableSwitch: true
},
// Replace with your project's social card
image: 'img/FaviconTumbleweedTransparent.ico',
navbar: {
title: 'Tumbleweed',
logo: {
alt: 'My Site Logo',
alt: 'Tumbleweed Logo',
src: 'img/FaviconTumbleweedTransparent.ico',
},
items: [
{ to: '/docs/introduction',
label: 'Case Study',
position: 'right'},
{
type: 'docSidebar',
sidebarId: 'tutorialSidebar',
position: 'right',
label: 'Case Study',
to: "/#team",
label: "Team",
position: "right",
activeBasePath: "never-active",
},
{to: '/team', label: 'Team', position: 'right'},
{
href: 'https://github.com/tumbleweed-cdc',
label: 'GitHub',
position: 'right',
},
],
},
footer: {
style: 'dark',
links: [
{
title: 'Docs',
items: [
{
label: 'Tutorial',
to: '/docs/intro',
},
],
},
{
title: 'Community',
items: [
{
label: 'Stack Overflow',
href: 'https://stackoverflow.com/questions/tagged/docusaurus',
},
{
label: 'Discord',
href: 'https://discordapp.com/invite/docusaurus',
},
{
label: 'X',
href: 'https://x.com/docusaurus',
},
],
},
{
title: 'More',
items: [
{
label: 'Blog',
to: '/blog',
},
{
label: 'GitHub',
href: 'https://github.com/tumbleweed-cdc',
},
],
},
],
copyright: `Copyright © ${new Date().getFullYear()} My Project, Inc. Built with Docusaurus.`,
},
// footer: {
// style: 'light',
// logo: {
// alt: "Tumbleweed Logo",
// src: "img/transparent_tumbleweed_logo.png",
// width: 200,
// },
// copyright: `Copyright © ${new Date().getFullYear()} Tumbleweed`,
// },
prism: {
theme: prismThemes.github,
darkTheme: prismThemes.dracula,
Expand Down
Loading

0 comments on commit dde915c

Please sign in to comment.