Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactors google sheets #232

Merged
merged 8 commits into from
Aug 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 77 additions & 79 deletions poetry.lock

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ packages = [{include = "sources"}]
[tool.poetry.dependencies]
python = "^3.8.1"
black = "^23.3.0"
dlt = {version = "^0.3.5a", allow-prereleases = true, extras = ["redshift", "bigquery", "postgres", "duckdb"]}
dlt = {version = "^0.3.8", allow-prereleases = true, extras = ["redshift", "bigquery", "postgres", "duckdb"]}

[tool.poetry.group.dev.dependencies]
mypy = "^0.991"
Expand Down
138 changes: 84 additions & 54 deletions sources/google_sheets/README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,92 @@
# Google Sheets

This verified source can be used to load data from a [Google Sheets](https://www.google.com/sheets/about/) workspace onto a destination of your choice.
## Prepare your data

| Endpoints | Description |
| --- | --- |
| Tables | tables of the spreadsheet, tables have same name as individual sheets |
| Named ranges | loaded as a separate column with an automatically generated header |
| Merged cells | retains only the cell value that was taken during the merge (e.g., top-leftmost), and every other cell in the merge is given a null value |
We recommend to to use [Named Ranges](link to gsheets) to indicate which data should be extracted from a particular spreadsheet and this is how this source
will work by default - when called with without setting any other options. All the named ranges will be converted into tables named after them and stored in the
destination.
* You can let the spreadsheet users to add and remove tables by just adding/removing the ranges, you do not need to configure the pipeline again.
* You can indicate exactly the fragments of interest and only this data will be retrieved so it is the fastest.
* You can name database tables by changing the range names.

If you are not happy with the workflow above, you can:
* Disable it by setting `get_named_ranges` option to False
* Enable retrieving all sheets/tabs with `get_sheets` option set to True
* Pass a list of ranges as supported by Google Sheets in `range_names`

Initialize a `dlt` project with the following command:
```bash
dlt init google_sheets bigquery
### Make sure your data has headers and is a proper table
**First row of any extracted range should contain headers**. Please make sure:
1. The header names are strings and are unique.
2. That all the columns that you intend to extract have a header.
3. That data starts exactly at the origin of the range - otherwise source will remove padding but it is a waste of resources!

When source detects any problems with headers or table layout **it will issue a WARNING in the log** so it makes sense to run your pipeline script manually/locally and fix all the problems.
1. Columns without headers will be removed and not extracted!
2. Columns with headers that does not contain any data will be removed.
2. If there's any problems with reading headers (ie. header is not string or is empty or not unique): **the headers row will be extracted as data** and automatic header names will be used.
3. Empty rows are ignored
4. `dlt` will normalize range names and headers into table and column names - so they may be different in the database than in google sheets. Prefer small cap names without special characters!

### Data Types
`dlt` normalizer will use first row of data to infer types and will try to coerce following rows - creating variant columns if that is not possible. This is a standard behavior.
**date time** and **date** types are also recognized and this happens via additional metadata that is retrieved for the first row.

## Passing the spreadsheet id/url and explicit range names
You can use both url of your spreadsheet that you can copy from the browser ie.
```
https://docs.google.com/spreadsheets/d/1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4/edit?usp=sharing
```
or spreadsheet id (which is a part of the url)
```
1VTtCiYgxjAwcIw7UM1_BSaxC3rzIpr0HwXZwd2OlPD4
```
typically you pass it directly to the `google_spreadsheet` function

**passing ranges**

You can pass explicit ranges to the `google_spreadsheet`:
1. sheet names
2. named ranges
3. any range in Google Sheet format ie. **sheet 1!A1:B7**


## The `spreadsheet_info` table
This table is repopulated after every load and keeps the information on loaded ranges:
* id of the spreadsheet
* name of the range as passed to the source
* string representation of the loaded range
* range above in parsed representation

## Running on Airflow (and some under the hood information)
Internally, the source loads all the data immediately in the `google_spreadsheet` before execution of the pipeline in `run`. No matter how many ranges you request, we make just two calls to the API to retrieve data. This works very well with typical scripts that create a dlt source with `google_spreadsheet` and then run it with `pipeline.run`.

In case of Airflow, the source is created and executed separately. In typical configuration where runner is a separate machine, **this will load data twice**.

**Moreover, you should not use `scc` decomposition in our Airflow helper**. It will create an instance of the source for each requested range in order to run a task that corresponds to it! Following our [Airflow deployment guide](https://dlthub.com/docs/walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer#2-modify-dag-file), this is how you should use `tasks.add_run` on `PipelineTasksGroup`:
```python
@dag(
schedule_interval='@daily',
start_date=pendulum.datetime(2023, 2, 1),
catchup=False,
max_active_runs=1,
default_args=default_task_args
)
def get_named_ranges():
tasks = PipelineTasksGroup("get_named_ranges", use_data_folder=False, wipe_local_data=True)

# import your source from pipeline script
from google_sheets import google_spreadsheet

pipeline = dlt.pipeline(
pipeline_name="get_named_ranges",
dataset_name="named_ranges_data",
destination='bigquery',
)

# do not use decompose to run `google_spreadsheet` in single task
tasks.add_run(pipeline, google_spreadsheet("1HhWHjqouQnnCIZAFa2rL6vT91YRN8aIhts22SUUR580"), decompose="none", trigger_rule="all_done", retries=0, provide_context=True)
```

Here, we chose BigQuery as the destination. To choose a different destination, replace `bigquery` with your choice of [destination.](https://dlthub.com/docs/dlt-ecosystem/destinations)

## Grab Google Sheets credentials

To read about grabbing the Google Sheets credentials and configuring the verified source, please refer to the [full documentation here.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets#google-sheets-api-authentication)

## Add credentials

1. Open `.dlt/secrets.toml`
2. From the .json that you downloaded earlier, copy “project_id”, “private_key”, and “client_email” as follows:

```toml
[sources.google_spreadsheet.credentials]
project_id = "set me up" # GCP Source project ID!
private_key = "set me up" # Unique private key !(Must be copied fully including BEGIN and END PRIVATE KEY)
client_email = "set me up" # Email for source service account
location = "set me up" #Project Location For ex. “US”

```

3. Enter credentials for your chosen destination as per the [docs](https://dlthub.com/docs/dlt-ecosystem/destinations/).

## Run the pipeline

1. Install the requirements by using the following command:

```bash
pip install -r requirements.txt
```

2. Run the pipeline by using the following command:

```bash
python3 google_sheets_pipelines.py
```

3. Use the following command to make sure that everything loaded as expected:

```bash
dlt pipeline google_sheets_pipeline show
```



💡 To explore additional customizations for this pipeline, we recommend referring to the official DLT Google Sheets documentation. It provides comprehensive information and guidance on how to further customize and tailor the pipeline to suit your specific needs. You can find the DLT Google Sheets documentation in [Setup Guide: Google Sheets](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets).
## Setup credentials
[We recommend to use service account for any production deployments](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets#google-sheets-api-authentication)

Loading
Loading