From 34695b7ced5c068fe1e8765ee3863db2d4683269 Mon Sep 17 00:00:00 2001 From: Adrian Date: Mon, 22 Apr 2024 14:38:58 +0200 Subject: [PATCH 1/5] blog post --- ...24-04-23-replacing-saas-with-python-etl.md | 110 ++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md diff --git a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md new file mode 100644 index 0000000000..1e9a32ffd9 --- /dev/null +++ b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md @@ -0,0 +1,110 @@ +--- +slug: replacing-saas-elt +title: "Replacing Saas ETL with Python dlt: A painless experience for Yummy.eu" +image: https://storage.googleapis.com/dlt-blog-images/embeddable-etl.png +authors: + name: Adrian Brudaru + title: Open source Data Engineer + url: https://github.com/adrianbr + image_url: https://avatars.githubusercontent.com/u/5762770?v=4 +tags: [full code etl, yes code etl, etl, python elt] +--- + +About [Yummy.eu](https://about.yummy.eu/) + +Yummy is a Lean-ops meal-kit company streamlines the entire food preparation process for customers in emerging markets by providing personalized recipes, +nutritional guidance, and even shopping services. Their innovative approach ensures a hassle-free, nutritionally optimized meal experience, +making daily cooking convenient and enjoyable. + + +Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive. To make it in this market, Yummy needs to be fast and informed in their operations. + +### Pipelines are not yet a commodity. + +At Yummy, time is the most valuable resource. When presented with the option to buy vs build, it’s a no brainer - BUY! + +Yummy’s use cases for data pipelining was copying sql data with low latency and high SLA for operational usage as well as analytical. + +Unfortunately, money does not buy ~~happiness~~ fast, reliable pipelines. + +## A rant about how SaaS vendors will rip you off if they can: A common experience of opaque pricing and black hat practices + +There is lots of literature about how saas vendors take money where they shouldn’t, for example: + +- **Pricing transparency**: By billing for things you cannot measure, it is difficult to plan cost and you might be surprised by 2-10x cost compared to what you thought (never pleasantly - that would be advertised upfront). +- **Default to max payment:** New table? no problem, vendors will make sure to replicate it by default at max cost to you (full refresh). Imagine if your waiter decided to bring you desert and bill you for it, you’d be outraged, no? +- **Customization options**: While the automation of processes simplifies data integration significantly, it also limits customization. Greater flexibility in customizing ETL processes would benefit those with complex data workflows and specific business requirements. +- **Connector dependency**: The extensive range of connectors greatly facilitates data integration across various sources. However, the functionality heavily relies on the availability and maintenance of these connectors. More frequent updates and expansions to the connector library would ensure robust data integration capabilities for all users. +- **Feature access across different plans**: Many essential features are accessible only at higher subscription tiers. Providing a broader range of critical features across all plans would make the platform more competitive and accessible to a wider range of businesses. +- **Data sync frequencies**: Limitations on data sync frequencies in lower-tier plans can hinder businesses that require more frequent updates. Offering more flexibility with sync frequencies across all plans would better support the needs of businesses with dynamic data requirements. + +In the case of Martin, cost was not even the main factor - the delays were a pain and so was the lack of reliability. And when Martin added a state log table to the production database and that ended up full replicated over and over by default, generating stupid high bills, he had enough. + +This is a very common experience. And you won’t get your money back by complaining either, despite the actual cost of delivering you the service was only 1% of your bill at most. Because, it was no accident! + +Suppose you set up a tool for the first time. It’s the vendor’s responsibility to take your attention through all the considerations they created, and give you the information with clarity. + +If I were in the shoes of the person who had to explain this bill to finance, I would avoid those tools like the plague. To be overcharged, and dependent on poor quality service? I’ll take a hard pass. + +### Back to what’s important: Velocity, Reliability, Speed, time, money + +Cost aside, the main requirement for Yummy was + +- Velocity: it should not take much of our time, fast to set up +- Reliabilty: the data should not stop flowing as it’s used for operations. +- Speed: It should be fast to run +- ideally, it would also be cheap +- ideally, it would be able to extract once and write to multiple destinations. + +Besides the velocity to set up, none of the other requirements were met by the vendor. + +## 10x faster, 182x cheaper with dlt + async + modal + +Fast to build, fast & cheap to run. + +- dlt lets you go from idea to pipeline in minutes due to its simplicity +- async lets you leverage parallelism +- [Modal](https://modal.com/) give you orchestrated parallelism at low cost + +By building a dlt postgres source with async functionality and running it on Modal, the important stuff was done. + +- Velocity: easy to build with dlt. +- Reliabilty: Both dlt and modal are reliable. +- Speed: 10x gains over the saas tool +- cost: 183x cheaper than the saas vendor (it doesn’t help their case to have costly defaults) +- ideally, it would be able to extract once and write to multiple destinations. Dlt can, to do this with a saas vendor you would need to chain pipelines and thus also multiply costs and delays. + +### The build + +Yummy needed the data to stay competitive, so Martin wasted no more time, and set of to create one of the cooler pipelines i’ve seen - a postgres async source that will extract data async by slicing tables and passing the data to dlt, which can also process it in paralel. + +https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff + +### The outcome + +ETL cost down 182x per month, sync time improved 10x using Modal labs and dlt and dropping fivetran. + +[![salo-martin-tweet](https://storage.googleapis.com/dlt-blog-images/martin_salo_tweet.png)](https://twitter.com/salomartin/status/1755146404773658660) + +## Closing thoughts + +Taking back control of your data stack has never been easier, given the ample open source options. + +SQL copy pipelines **are** a commodity. They are mostly interchangeable, there is no special transformation or schema difference between varieties of them, +and you can find many of them for free online. Besides the rarer ones that have special requirements, there' little uniqueness +distinguishing various sql copy pipelines from each other. + +SQL to SQL copy pipelines remain one of the most common use cases in the industry. Despite the low complexity of such pipelines, +this is where many vendors make a fortune, charging thousands of euros monthly for something that could be run for the cost of a few coffees. + +At dltHub, we encourage you to use simple, free resources to take back control of your data deliverability and budgets. + +From our experience, the cost of setting up a SQL pipeline can be minutes. + +Try these pipelines: + +- [30+ sql database sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database) +- [Martin’s async postgres source](https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff) +- [Arrow + connectorx](https://www.notion.so/Martin-Salo-Yummy-2061c3139e8e4b7fa355255cc994bba5?pvs=21) for up to 30x faster copies + +Need help? [Join our community!](https://dlthub.com/community) \ No newline at end of file From 478e650c6831bf2b0567e94017b99074e6dd05b0 Mon Sep 17 00:00:00 2001 From: Adrian Date: Mon, 22 Apr 2024 14:39:37 +0200 Subject: [PATCH 2/5] blog post --- docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md index 1e9a32ffd9..926277e89d 100644 --- a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md +++ b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md @@ -1,7 +1,7 @@ --- slug: replacing-saas-elt title: "Replacing Saas ETL with Python dlt: A painless experience for Yummy.eu" -image: https://storage.googleapis.com/dlt-blog-images/embeddable-etl.png +image: https://storage.googleapis.com/dlt-blog-images/martin_salo_tweet.png authors: name: Adrian Brudaru title: Open source Data Engineer From 1a6310635a2cc8925f4dac7b9146c14e0d383c00 Mon Sep 17 00:00:00 2001 From: Adrian Date: Mon, 22 Apr 2024 15:09:00 +0200 Subject: [PATCH 3/5] blog post --- ...24-04-23-replacing-saas-with-python-etl.md | 45 +++++++++---------- 1 file changed, 22 insertions(+), 23 deletions(-) diff --git a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md index 926277e89d..4fba8457a1 100644 --- a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md +++ b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md @@ -16,47 +16,45 @@ Yummy is a Lean-ops meal-kit company streamlines the entire food preparation pro nutritional guidance, and even shopping services. Their innovative approach ensures a hassle-free, nutritionally optimized meal experience, making daily cooking convenient and enjoyable. - -Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive. To make it in this market, Yummy needs to be fast and informed in their operations. +Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive. +To make it in this market, Yummy needs to be fast and informed in their operations. ### Pipelines are not yet a commodity. -At Yummy, time is the most valuable resource. When presented with the option to buy vs build, it’s a no brainer - BUY! +At Yummy, time is the most valuable resource. When presented with the option to buy vs build, Yummy's CTO Martin thought not to waste time and BUY! Yummy’s use cases for data pipelining was copying sql data with low latency and high SLA for operational usage as well as analytical. Unfortunately, money does not buy ~~happiness~~ fast, reliable pipelines. -## A rant about how SaaS vendors will rip you off if they can: A common experience of opaque pricing and black hat practices +### What’s important: Velocity, Reliability, Speed, time. Money is secondary. + +Cost aside, the main requirement for Yummy was as below: -There is lots of literature about how saas vendors take money where they shouldn’t, for example: +- Velocity: it should not take much of our time, fast to set up +- Reliabilty: the data should not stop flowing as it’s used for operations. +- Speed: It should load data with low latency +- ideally, it would also be cheap +- ideally, it would be able to extract once and write to multiple destinations. -- **Pricing transparency**: By billing for things you cannot measure, it is difficult to plan cost and you might be surprised by 2-10x cost compared to what you thought (never pleasantly - that would be advertised upfront). -- **Default to max payment:** New table? no problem, vendors will make sure to replicate it by default at max cost to you (full refresh). Imagine if your waiter decided to bring you desert and bill you for it, you’d be outraged, no? -- **Customization options**: While the automation of processes simplifies data integration significantly, it also limits customization. Greater flexibility in customizing ETL processes would benefit those with complex data workflows and specific business requirements. -- **Connector dependency**: The extensive range of connectors greatly facilitates data integration across various sources. However, the functionality heavily relies on the availability and maintenance of these connectors. More frequent updates and expansions to the connector library would ensure robust data integration capabilities for all users. -- **Feature access across different plans**: Many essential features are accessible only at higher subscription tiers. Providing a broader range of critical features across all plans would make the platform more competitive and accessible to a wider range of businesses. -- **Data sync frequencies**: Limitations on data sync frequencies in lower-tier plans can hinder businesses that require more frequent updates. Offering more flexibility with sync frequencies across all plans would better support the needs of businesses with dynamic data requirements. +Martin found the velocity to set up to be good, but everything else lacking. -In the case of Martin, cost was not even the main factor - the delays were a pain and so was the lack of reliability. And when Martin added a state log table to the production database and that ended up full replicated over and over by default, generating stupid high bills, he had enough. +## Enough is enough! Black hat practices of vendors drove Martin away. -This is a very common experience. And you won’t get your money back by complaining either, despite the actual cost of delivering you the service was only 1% of your bill at most. Because, it was no accident! +Martin, like many others, was very open to using the saas service and was happy to not do it himself. -Suppose you set up a tool for the first time. It’s the vendor’s responsibility to take your attention through all the considerations they created, and give you the information with clarity. +However, he quickly ran into the dark side of saas vendor service. Initially, the latency and reliability were annoying, but not enough reason to move. -If I were in the shoes of the person who had to explain this bill to finance, I would avoid those tools like the plague. To be overcharged, and dependent on poor quality service? I’ll take a hard pass. +Martin's patience ran out when a state log table added to his production database was automatically replicated in full, repeatedly. -### Back to what’s important: Velocity, Reliability, Speed, time, money +This default setting led to exorbitantly high charges that were neither justified nor sustainable, pushing him to seek better solutions. -Cost aside, the main requirement for Yummy was +This is a common issue which is "by design" as people complain about it for over a decade. But the majority of customers were "born yesterday" into the etl marketplace and the vendors are ready to take them for a ride. -- Velocity: it should not take much of our time, fast to set up -- Reliabilty: the data should not stop flowing as it’s used for operations. -- Speed: It should be fast to run -- ideally, it would also be cheap -- ideally, it would be able to extract once and write to multiple destinations. + -Besides the velocity to set up, none of the other requirements were met by the vendor. ## 10x faster, 182x cheaper with dlt + async + modal @@ -83,6 +81,7 @@ https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff ### The outcome ETL cost down 182x per month, sync time improved 10x using Modal labs and dlt and dropping fivetran. +Martin was happy enough that he agreed to go on a call and tell us about it :) [![salo-martin-tweet](https://storage.googleapis.com/dlt-blog-images/martin_salo_tweet.png)](https://twitter.com/salomartin/status/1755146404773658660) From 2e69e2f63a5b87897871c03632a7ddec6146e077 Mon Sep 17 00:00:00 2001 From: Adrian Date: Mon, 22 Apr 2024 15:10:37 +0200 Subject: [PATCH 4/5] format --- .../blog/2024-04-23-replacing-saas-with-python-etl.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md index 4fba8457a1..4e58849665 100644 --- a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md +++ b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md @@ -74,9 +74,10 @@ By building a dlt postgres source with async functionality and running it on Mod ### The build -Yummy needed the data to stay competitive, so Martin wasted no more time, and set of to create one of the cooler pipelines i’ve seen - a postgres async source that will extract data async by slicing tables and passing the data to dlt, which can also process it in paralel. +Yummy needed the data to stay competitive, so Martin wasted no more time, and set of to create one of the cooler pipelines I’ve seen - a postgres async +source that will extract data async by slicing tables and passing the data to dlt, which can also process it in parallel. -https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff +Check out [Martin's async postgres source in this gist here](https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff). ### The outcome From a0f0a80a17785a788d4e71abc250b4bb5841b056 Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 23 Apr 2024 14:50:55 +0200 Subject: [PATCH 5/5] format --- ...24-04-23-replacing-saas-with-python-etl.md | 87 +++++-------------- 1 file changed, 23 insertions(+), 64 deletions(-) diff --git a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md index 4e58849665..eac2d6908f 100644 --- a/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md +++ b/docs/website/blog/2024-04-23-replacing-saas-with-python-etl.md @@ -21,90 +21,49 @@ To make it in this market, Yummy needs to be fast and informed in their operatio ### Pipelines are not yet a commodity. -At Yummy, time is the most valuable resource. When presented with the option to buy vs build, Yummy's CTO Martin thought not to waste time and BUY! - -Yummy’s use cases for data pipelining was copying sql data with low latency and high SLA for operational usage as well as analytical. - -Unfortunately, money does not buy ~~happiness~~ fast, reliable pipelines. +At Yummy, efficiency and timeliness are paramount. Initially, Martin, Yummy’s CTO, chose to purchase data pipelining tools for their operational and analytical +needs, aiming to maximize time efficiency. However, the real-world performance of these purchased solutions did not meet expectations, which +led to a reassessment of their approach. ### What’s important: Velocity, Reliability, Speed, time. Money is secondary. -Cost aside, the main requirement for Yummy was as below: - -- Velocity: it should not take much of our time, fast to set up -- Reliabilty: the data should not stop flowing as it’s used for operations. -- Speed: It should load data with low latency -- ideally, it would also be cheap -- ideally, it would be able to extract once and write to multiple destinations. - -Martin found the velocity to set up to be good, but everything else lacking. - -## Enough is enough! Black hat practices of vendors drove Martin away. - -Martin, like many others, was very open to using the saas service and was happy to not do it himself. - -However, he quickly ran into the dark side of saas vendor service. Initially, the latency and reliability were annoying, but not enough reason to move. - -Martin's patience ran out when a state log table added to his production database was automatically replicated in full, repeatedly. +Martin was initially satisfied with the ease of setup provided by the SaaS services. -This default setting led to exorbitantly high charges that were neither justified nor sustainable, pushing him to seek better solutions. - -This is a common issue which is "by design" as people complain about it for over a decade. But the majority of customers were "born yesterday" into the etl marketplace and the vendors are ready to take them for a ride. +The tipping point came when an update to Yummy’s database introduced a new log table, leading to unexpectedly high fees due to the vendor’s default settings that automatically replicated new tables fully on every refresh. This situation highlighted the need for greater control over data management processes and prompted a shift towards more transparent and cost-effective solutions. ## 10x faster, 182x cheaper with dlt + async + modal -Fast to build, fast & cheap to run. - -- dlt lets you go from idea to pipeline in minutes due to its simplicity -- async lets you leverage parallelism -- [Modal](https://modal.com/) give you orchestrated parallelism at low cost - -By building a dlt postgres source with async functionality and running it on Modal, the important stuff was done. - -- Velocity: easy to build with dlt. -- Reliabilty: Both dlt and modal are reliable. -- Speed: 10x gains over the saas tool -- cost: 183x cheaper than the saas vendor (it doesn’t help their case to have costly defaults) -- ideally, it would be able to extract once and write to multiple destinations. Dlt can, to do this with a saas vendor you would need to chain pipelines and thus also multiply costs and delays. +Motivated to find a solution that balanced cost with performance, Martin explored using dlt, a tool known for its simplicity in building data pipelines. +By combining dlt with asynchronous operations and using [Modal](https://modal.com/) for managed execution, the improvements were substantial: -### The build +* Data processing speed increased tenfold. +* Cost reduced by 182 times compared to the traditional SaaS tool. +* The new system supports extracting data once and writing to multiple destinations without additional costs. -Yummy needed the data to stay competitive, so Martin wasted no more time, and set of to create one of the cooler pipelines I’ve seen - a postgres async -source that will extract data async by slicing tables and passing the data to dlt, which can also process it in parallel. +For a peek into on how Martin implemented this solution, [please see Martin's async Postgres source on GitHub.](https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff). -Check out [Martin's async postgres source in this gist here](https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff). - -### The outcome - -ETL cost down 182x per month, sync time improved 10x using Modal labs and dlt and dropping fivetran. -Martin was happy enough that he agreed to go on a call and tell us about it :) [![salo-martin-tweet](https://storage.googleapis.com/dlt-blog-images/martin_salo_tweet.png)](https://twitter.com/salomartin/status/1755146404773658660) -## Closing thoughts - -Taking back control of your data stack has never been easier, given the ample open source options. - -SQL copy pipelines **are** a commodity. They are mostly interchangeable, there is no special transformation or schema difference between varieties of them, -and you can find many of them for free online. Besides the rarer ones that have special requirements, there' little uniqueness -distinguishing various sql copy pipelines from each other. +## Taking back control with open source has never been easier -SQL to SQL copy pipelines remain one of the most common use cases in the industry. Despite the low complexity of such pipelines, -this is where many vendors make a fortune, charging thousands of euros monthly for something that could be run for the cost of a few coffees. +Taking control of your data stack is more accessible than ever with the broad array of open-source tools available. SQL copy pipelines, often seen as a basic utility in data management, do not generally differ significantly between platforms. They perform similar transformations and schema management, making them a commodity available at minimal cost. -At dltHub, we encourage you to use simple, free resources to take back control of your data deliverability and budgets. +SQL to SQL copy pipelines are widespread, yet many service providers charge exorbitant fees for these simple tasks. In contrast, these pipelines can often be set up and run at a fraction of the cost—sometimes just the price of a few coffees. -From our experience, the cost of setting up a SQL pipeline can be minutes. +At dltHub, we advocate for leveraging straightforward, freely available resources to regain control over your data processes and budget effectively. -Try these pipelines: +Setting up a SQL pipeline can take just a few minutes with the right tools. Explore these resources to enhance your data operations: -- [30+ sql database sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database) -- [Martin’s async postgres source](https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff) -- [Arrow + connectorx](https://www.notion.so/Martin-Salo-Yummy-2061c3139e8e4b7fa355255cc994bba5?pvs=21) for up to 30x faster copies +- [30+ SQL database sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database) +- [Martin’s async PostgreSQL source](https://gist.github.com/salomartin/c0d4b0b5510feb0894da9369b5e649ff) +- [Arrow + connectorx](https://www.notion.so/Martin-Salo-Yummy-2061c3139e8e4b7fa355255cc994bba5?pvs=21) for up to 30x faster data transfers -Need help? [Join our community!](https://dlthub.com/community) \ No newline at end of file +For additional support or to connect with fellow data professionals, [join our community](https://dlthub.com/community).