Skip to content

theventurecity/data-toolkit

Repository files navigation

TheVentureCity

Data Pipeline Toolkit for Early-Stage Startups

A powerful compilation of free tools and custom code that any startup can adapt to turn raw data into visual insights

There are lots of reasons why the data your startup is collecting is important: managing your team, impressing investors, delighting customers, planning for the future, etc. But which data? Where do you get it? How do you turn it from raw information into coherent story? And how can you do that on a tight budget? To answer those questions, TheVentureCity has developed this toolkit for startup founders who want to supplement their gut and intuition with data-driven insights.

Make Your Own Copy & Learn by Doing

The purpose of the toolkit is twofold:

  1. To allow any startup to deploy an Extract-Transform-Load (ETL) pipeline that takes raw event log data as its input and feeds visual dashboards as its output. Our intent is that you will make your own copy of these tools and use them. If you need our help, just ask.
  2. To supply context about what is happening behind the scenes by walking you through the code via Jupyter Python notebooks.

TVC ETL Data Analysis Pipeline

The main engine orchestrating the ETL pipeline is Python code. It is available in two forms:

  • Notebooks using Google Colaboratory's cloud Jupyter runtime environment (see below for more notes); and
  • .py files in our GitHub repository

Each notebook contains raw Python code and/or imports our .py files to illustrate a specific type of analysis as listed below. By combining working Python code with a discussion of how it works and why it is important, these notebooks help you learn on your own by analyzing your startup's data. Once the data is extracted and transformed in memory (the "E" and "T" in "ETL"), the Python code loads it into Google Sheets (the "L" step). From there, Looker Studio (formerly known as "Google Data Studio") connects to Google Sheets to enable visualization and dissemination of the transformed data.

Even if you do not have a dedicated data analyst at this time—and most of early-stage teams of 5-10 members don't have that role—make sure there is somebody on your engineering team who is tasked with instrumentation and analysis. It is this person who most needs to review this toolkit. The Python code is fully commented and ready to run as-is in the cloud notebook environment. When you have adapted it to your business and are ready to automate your pipeline, you'll need to convert the notebooks to .py files scheduled to run at regular intervals.

Be sure to bookmark this page so you can stay up-to-date as we continue to deploy new features.

Toolkit Menu

0. Introductions to Notebooks & Google Tools

1. Data Analysis Building Blocks — Before you can start analyzing the data, you need to understand raw event log data and how to access it. Then the raw data needs some pre-processing to convert it into a “DAU Decorated” data set, which serves as the jumping-off point for the rest of the analysis. Inspecting the Google Sheets and Looker Studio pieces of the puzzle will help you understand these critical components as well.

  • Understanding Event Logs (GitHub | Colab)
  • Create the “DAU Decorated” data set (GitHub | Colab)
  • Explore the Google Sheets workbook these pipelines use to store the data after it gets transformed--the "Load" step. It is read-only. Therefore, to use this pipeline on your own, you need to create your own copy of this workbook to your Google Drive account.
  • Explore the Looker Studio that reads from the Google Sheets workbook to create the visualizations. It is also read-only, so create your own copy under your Google Drive account.

2. Mini-Pipeline notebooks are stand-alone, "full stack" pipelines designed to teach the specifics of a particular subset of startup data analytics, carrying out each step of the Extract-Transform-Load-Visualize process along the way. In particular, each "Transform" step contains verbose, commented code and an explanation of the data transformation taking place. We suggest you review these Mini-Pipelines first before trying to implement the Full Pipeline below.

Note: the embedded iFrame visualizations from Google Data Studio do not render in the GitHub version of the notebooks. To see the visuals, visit the Colab version of the notebook or visit the Looker Studio dashboard.

  • The Mini-Pipeline: MAU Growth Accounting notebook (GitHub | Colab) aggregates DAU Decorated at a monthly level, categorizes different types of users in each month, and then uses that information to arrive at a measure for growth efficiency called the Quick Ratio. Be sure to check out our post introducing this concept, Quick Ratio as a Shortcut to Understand Product Growth.

  • The Mini-Pipeline: Cohort Analysis notebook (GitHub | Colab) transforms the DAU Decorated data set into a cohort analysis dataframe to examine monthly user retention and cohort revenue LTV. Cohort retention metrics help us see how long users continue to use the product after the first time they use it. Good retention makes growth so much easier and efficient: newly-acquired users count toward user growth rather than having to replace lost users.

  • The Mini-Pipeline: Engagement notebook (GitHub | Colab) shows how to transform the DAU Decorated data set into engagement dataframes to analyze a DAU Histogram, Active Days per Month over Time, and Multi-Day Users Ratio over Time. Engagement metrics gauge the extent to which users find value in the product by measuring the frequency with which they use it. In this way, we can use data to assess and track product-market fit, an important but tricky concept for which data helps supplement gut feel. Solid engagement sets the stage for retention over a long period of time.

3. Full Pipeline -- This notebook (GitHub | Colab) combines the logic from each of the mini-pipelines into one. Instead of using verbose, inline code, it leverages TheVentureCity's python libraries to perform the Transform step.

3A. Full Pipeline for Daily Use For a Python script that uses Google Service Account authentication, have a look at servbiz_example_pipeline.py, which does a loop to handle both an unsegmented and a segmented scenario. If you want to run a complete data pipeline for your business:

  1. Set up your own runtime Python environment, being sure to run pip install -r requirements.txt to ensure that these scripts have the right version of Pandas and other libraries for them to work
  2. Create a copy of this script and config.ini in your runtime environment
  3. Configure it to point to...
    1. Your raw data source (you may need to do an SQL query)
    2. Google Sheets workbook, using the Service Account method. For more on how to do that, visit these instructions: Google Service Account Setup
  4. Configure Looker Studio to point to the Google Sheets workbook
  5. Schedule it to run every day or week to fill your dashboard with the most up-to-date data

Isn't data analytics just analyzing data?

Credits

  • This toolkit builds upon the fantastic work done by Jonathan Hsu and his team at Social Capital
  • Thanks to Analytics Vidya for the meme.

Notes

With the Google Colaboratory option there is no need to install any software. The other option is to install Jupyter (Python 3.6) and the relevant libraries on your local machine. Whatever your comfort level with Python, we encourage you learn by doing: hit Shift-Enter to run each cell and see what happens. If you want some exposure to some Python basics to supplement this toolkit, we recommend DataCamp, which has some excellent free courses, including a tutorial on Jupyter notebooks.