Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic databricks support #152

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -178,3 +178,6 @@ carto_credentials.json
.idea/codeStyles/codeStyleConfig.xml
.idea/codeStyles/Project.xml
.idea/.gitignore

# Vim
*.swp
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ init:
[ -d $(VENV) ] || python3 -m venv $(VENV)
$(BIN)/pip install -r requirements-dev.txt
$(BIN)/pre-commit install
$(BIN)/pip install -e .[snowflake,bigquery]
$(BIN)/pip install -e .[all]

lint:
$(BIN)/black raster_loader setup.py
Expand Down
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ pip install -U raster-loader

pip install -U raster-loader"[bigquery]"
pip install -U raster-loader"[snowflake]"
pip install -U raster-loader"[databricks]"
```

### Installing from source
Expand All @@ -31,6 +32,7 @@ cd raster-loader
pip install .
```


## Usage

There are two ways you can use Raster Loader:
Expand Down Expand Up @@ -150,6 +152,19 @@ project.
[ROADMAP.md](ROADMAP.md) contains a list of features and improvements planned for future
versions of Raster Loader.

### Installing for Development

```
make init
source env/bin/activate
```

Doing `which carto` should return something like `/my/local/filesystem/raster-loader/eenv/bin/carto` instead of the system-wide installation.

The `-e` flag passed to the `pip install` program will set the project and its dependencies in development mode. Changes to the project files
will be reflected in the `carto` command immedietly without the need to re-run any setup steps.


## Releasing

### 1. Create and merge a release PR updating the CHANGELOG
Expand Down
31 changes: 30 additions & 1 deletion docs/source/user_guide/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,25 @@ Snowflake:
To use the snowflake utilities, use the ``carto snowflake`` command. This command has
several subcommands, which are described below.

Using the Raster Loader with Databricks
-----------------------------------------

Before you can upload a raster file, you need to have set up the following in
Databricks:

#. A databricks instance host. Eg. `https://dbc-abcde12345-678f.cloud.databricks.com`
#. A cluster id (cluser MUST BE turned on)
#. A Personal Access Token (PAT). See `Databricks PAT Docs <https://docs.databricks.com/en/dev-tools/auth/pat.html>`_.
#. A catalog
#. A schema (in the same catalog)

To use the databricks utilities, use the ``carto databricks`` command. This command has
several subcommands, which are described below.

Uploading a raster layer
------------------------

To upload a raster file, use the ``carto [bigquery|snowflake] upload`` command.
To upload a raster file, use the ``carto [bigquery|snowflake|databricks] upload`` command.

The input raster must be a ``GoogleMapsCompatible`` raster. You can make your raster compatible
by converting it with the following GDAL command:
Expand Down Expand Up @@ -98,6 +113,20 @@ The same operation, performed with Snowflake, would be:
Authentication parameters are explicitly required in this case for Snowflake, since they
are not set up in the environment.

The same operation, performed with Databricks, would be:

.. code-block:: bash

carto databricks upload \
--host 'https://dbc-12345abc-123f.cloud.databricks.com' \
--token <token> \
--cluster-id '0123-456789-abc12345xyz' \
--catalog 'main' \
--schema default \
--file_path \
/path/to/my/raster/file/tif \
--table mydatabrickstable

If no band is specified, the first band of the raster will be uploaded. If the
``--band`` flag is set, the specified band will be uploaded. For example, the following
command uploads the second band of the raster:
Expand Down
3 changes: 2 additions & 1 deletion docs/source/user_guide/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,14 @@ To install from source:
In most cases, it is recommended to install Raster Loader in a virtual environment.
Use venv_ to create and manage your virtual environment.

The above will install the dependencies required to work with both BigQuery and Snowflake and. In case you only want to work with one of them, you can install the
The above will install the dependencies required to work with both BigQuery, Snowflake and Databricks. In case you only want to work with one of them, you can install the
dependencies for each of them separately:

.. code-block:: bash

pip install -U raster-loader"[bigquery]"
pip install -U raster-loader"[snowflake]"
pip install -U raster-loader"[databricks]"

After installing the Raster Loader package, you will have access to the
:ref:`carto CLI <cli>`. To make sure the installation was successful, run the
Expand Down
8 changes: 7 additions & 1 deletion docs/source/user_guide/use_with_python.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ For BigQuery, use ``BigQueryConnection``:

from raster_loader import BigQueryConnection

For Databricks, use ``DatabricksConnection``:

.. code-block:: python

from raster_loader import DatabricksConnection

Then, create a connection object with the appropriate parameters.

For Snowflake:
Expand Down Expand Up @@ -48,7 +54,7 @@ For example:

.. code-block:: python

connector.upload_raster(
connection.upload_raster(
file_path = 'path/to/raster.tif',
fqn = 'database.schema.tablename',
)
Expand Down
4 changes: 4 additions & 0 deletions raster_loader/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@
from raster_loader.io.snowflake import (
SnowflakeConnection,
)
from raster_loader.io.databricks import (
DatabricksConnection,
)

__all__ = [
"__version__",
"BigQueryConnection",
"SnowflakeConnection",
"DatabricksConnection",
]
172 changes: 172 additions & 0 deletions raster_loader/cli/databricks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
import click
from functools import wraps, partial

from raster_loader.utils import get_default_table_name
from raster_loader.io.databricks import DatabricksConnection


def catch_exception(func=None, *, handle=Exception):
if not func:
return partial(catch_exception, handle=handle)

@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except handle as e:
raise click.ClickException(str(e))

return wrapper


@click.group(context_settings=dict(help_option_names=["-h", "--help"]))
def databricks(args=None):
"""
Manage Databricks resources.
"""
pass


@databricks.command(help="Upload a raster file to Databricks.")
@click.option("--host", help="The Databricks host URL.", required=True)
@click.option("--token", help="The Databricks access token.", required=True)
@click.option(
"--cluster-id", help="The Databricks cluster ID.", required=True
) # New option
@click.option(
"--file_path", help="The path to the raster file.", required=False, default=None
)
@click.option(
"--file_url", help="The URL to the raster file.", required=False, default=None
)
@click.option("--catalog", help="The name of the catalog.", required=True)
@click.option("--schema", help="The name of the schema.", required=True)
@click.option("--table", help="The name of the table.", default=None)
@click.option(
"--band",
help="Band(s) within raster to upload. "
"Could repeat --band to specify multiple bands.",
default=[1],
multiple=True,
)
@click.option(
"--band_name",
help="Column name(s) used to store band (Default: band_<band_num>). "
"Could repeat --band_name to specify multiple bands column names. "
"List of column names HAVE to pair with --band list in the same order.",
default=[None],
multiple=True,
)
@click.option(
"--chunk_size", help="The number of blocks to upload in each chunk.", default=400
)
@click.option(
"--overwrite",
help="Overwrite existing data in the table if it already exists.",
default=False,
is_flag=True,
)
@click.option(
"--append",
help="Append records into a table if it already exists.",
default=False,
is_flag=True,
)
@click.option(
"--cleanup-on-failure",
help="Clean up resources if the upload fails. Useful for non-interactive scripts.",
default=False,
is_flag=True,
)
@catch_exception()
def upload(
host,
token,
cluster_id, # Accept cluster ID
file_path,
file_url,
catalog,
schema,
table,
band,
band_name,
chunk_size,
overwrite=False,
append=False,
cleanup_on_failure=False,
):
from raster_loader.io.common import (
get_number_of_blocks,
print_band_information,
get_block_dims,
)
import os
from urllib.parse import urlparse

if file_path is None and file_url is None:
raise ValueError("Need either a --file_path or --file_url")

if file_path and file_url:
raise ValueError("Only one of --file_path or --file_url must be provided.")

is_local_file = file_path is not None

# Check that band and band_name are the same length if band_name provided
if band_name != (None,):
if len(band) != len(band_name):
raise ValueError("Must supply the same number of band_names as bands")
else:
band_name = [None] * len(band)

# Pair band and band_name in a list of tuples
bands_info = list(zip(band, band_name))

# Create default table name if not provided
if table is None:
table = get_default_table_name(
file_path if is_local_file else urlparse(file_url).path, band
)

connector = DatabricksConnection(
host=host,
token=token,
cluster_id=cluster_id, # Pass cluster_id to DatabricksConnection
catalog=catalog,
schema=schema,
)

source = file_path if is_local_file else file_url

# Introspect raster file
num_blocks = get_number_of_blocks(source)
file_size_mb = 0
if is_local_file:
file_size_mb = os.path.getsize(file_path) / 1024 / 1024

click.echo("Preparing to upload raster file to Databricks...")
click.echo(f"File Path: {source}")
click.echo(f"File Size: {file_size_mb} MB")
print_band_information(source)
click.echo(f"Source Band(s): {band}")
click.echo(f"Band Name(s): {band_name}")
click.echo(f"Number of Blocks: {num_blocks}")
click.echo(f"Block Dimensions: {get_block_dims(source)}")
click.echo(f"Catalog: {catalog}")
click.echo(f"Schema: {schema}")
click.echo(f"Table: {table}")
click.echo(f"Number of Records Per Batch: {chunk_size}")

click.echo("Uploading Raster to Databricks")

connector.upload_raster(
source,
table,
bands_info,
chunk_size,
overwrite=overwrite,
append=append,
cleanup_on_failure=cleanup_on_failure,
)

click.echo("Raster file uploaded to Databricks")
exit(0)
9 changes: 9 additions & 0 deletions raster_loader/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,15 @@ def import_error_snowflake(): # pragma: no cover
raise ImportError(msg)


def import_error_databricks(): # pragma: no cover
msg = (
"Databricks client is not installed.\n"
"Please install Databricks dependencies to use this function.\n"
'run `pip install -U raster-loader"[databricks]"` to install from pypi.'
)
raise ImportError(msg)


class IncompatibleRasterException(Exception):
def __init__(self):
self.message = (
Expand Down
2 changes: 2 additions & 0 deletions raster_loader/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,8 @@ def rasterio_metadata(
metadata["num_blocks"] = int(width * height / block_width / block_height)
metadata["num_pixels"] = width * height
metadata["pixel_resolution"] = pixel_resolution
metadata["crs"] = raster_crs
metadata["transform"] = raster_dataset.transform

return metadata

Expand Down
Loading