Skip to content

Commit

Permalink
Cloud API: Demonstrate cluster mgmt. on behalf of a Jupyter Notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Nov 17, 2023
1 parent ddf5303 commit 15797d4
Show file tree
Hide file tree
Showing 4 changed files with 246 additions and 4 deletions.
203 changes: 203 additions & 0 deletions examples/cloud_import.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# CrateDB Cloud Import\n",
"\n",
"This is an example notebook demonstrating how to load data from\n",
"files using the [Import API] interface of [CrateDB Cloud] into\n",
"a [CrateDB Cloud Cluster].\n",
"\n",
"The supported file types are CSV, JSON, Parquet, optionally with\n",
"gzip compression. They can be acquired from the local filesystem,\n",
"or from remote HTTP and AWS S3 resources.\n",
"\n",
"[CrateDB Cloud]: https://cratedb.com/docs/cloud/\n",
"[CrateDB Cloud Cluster]: https://cratedb.com/docs/cloud/en/latest/reference/services.html\n",
"[Import API]: https://community.cratedb.com/t/importing-data-to-cratedb-cloud-clusters/1467\n",
"\n",
"## Setup\n",
"\n",
"To install the client SDK, use `pip`."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"#!pip install 'cratedb-toolkit'"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## Configuration\n",
"\n",
"The notebook assumes you are appropriately authenticated to the CrateDB Cloud\n",
"platform, for example using `croud login --idp azuread`. To inspect the list\n",
"of available clusters, run `croud clusters list`.\n",
"\n",
"For addressing a database cluster, and obtaining corresponding credentials,\n",
"the program uses environment variables, which you can define interactively,\n",
"or store them within a `.env` file.\n",
"\n",
"You can use those configuration snippet as a blueprint. Please adjust the\n",
"individual settings accordingly.\n",
"```shell\n",
"CRATEDB_CLOUD_CLUSTER_NAME=Hotzenplotz\n",
"CRATEDB_USERNAME='admin'\n",
"CRATEDB_PASSWORD='H3IgNXNvQBJM3CiElOiVHuSp6CjXMCiQYhB4I9dLccVHGvvvitPSYr1vTpt4'\n",
"```"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## Acquire Database Cluster\n",
"\n",
"As a first measure, acquire a resource handle, which manages a CrateDB Cloud\n",
"cluster instance.\n",
"\n",
"For effortless configuration, it will obtain configuration settings from\n",
"environment variables as defined above."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from cratedb_toolkit import ManagedCluster, InputOutputResource\n",
"\n",
"cluster = ManagedCluster.from_env().start()"
]
},
{
"cell_type": "markdown",
"source": [
"## Import Data\n",
"\n",
"From the [NAB Data Corpus], import the \"realKnownCause\" dataset. The dataset includes\n",
"temperature sensor data of an internal component of an industrial machine, with known\n",
"anomaly causes. The first anomaly is a planned shutdown of the machine. The second\n",
"anomaly is difficult to detect and directly led to the third anomaly, a catastrophic\n",
"failure of the machine.\n",
"\n",
"On this topic, we also recommend the notebook about [MLflow and CrateDB], where the\n",
"same dataset is used for time series anomaly detection and forecasting.\n",
"\n",
"[NAB Data Corpus]: https://github.com/numenta/NAB/tree/master/data\n",
"[MLflow and CrateDB]: https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/mlops-mlflow"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001B[36m==> Info: \u001B[0mStatus: REGISTERED (Your import job was received and is pending processing.)\n",
"\u001B[36m==> Info: \u001B[0mDone importing 22.70K records\n",
"\u001B[32m==> Success: \u001B[0mOperation completed.\n"
]
},
{
"data": {
"text/plain": "CloudJob(info={'cluster_id': '09be10b6-7d78-497b-842e-fbb47642d398', 'compression': 'none', 'dc': {'created': '2023-11-17T18:54:04.070000+00:00', 'modified': '2023-11-17T18:54:04.070000+00:00'}, 'destination': {'create_table': True, 'table': 'nab-machine-failure'}, 'file': None, 'format': 'csv', 'id': '56051dc3-ee8e-4a38-9066-73bcd427d05a', 'progress': {'bytes': 0, 'details': {'create_table_sql': None}, 'failed_files': 0, 'failed_records': 0, 'message': 'Import succeeded', 'percent': 100.0, 'processed_files': 1, 'records': 22695, 'total_files': 1, 'total_records': 22695}, 'schema': {'type': 'csv'}, 'status': 'SUCCEEDED', 'type': 'url', 'url': {'url': 'https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv'}}, found=True, _custom_status=None, _custom_message=None)"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Define data source.\n",
"url = \"https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv\"\n",
"source = InputOutputResource(url=url)\n",
"\n",
"# Invoke import job. Without `target` argument, the destination\n",
"# table name will be derived from the input file name.\n",
"cluster.load_table(source=source)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## Query Data\n",
"\n",
"In order to inspect if the dataset has been imported successfully, run an SQL\n",
"command sampling a few records."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 5,
"outputs": [
{
"data": {
"text/plain": "[{'timestamp': 1386021900000, 'value': 80.78327674},\n {'timestamp': 1386024000000, 'value': 81.37357535},\n {'timestamp': 1386024600000, 'value': 80.18124978},\n {'timestamp': 1386030300000, 'value': 82.88189183},\n {'timestamp': 1386030600000, 'value': 83.57965349}]"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Query data.\n",
"cluster.query('SELECT * FROM \"nab-machine-failure\" LIMIT 5;')"
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
10 changes: 7 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -101,13 +101,17 @@ all = [
"cratedb-toolkit[influxdb,io,mongodb]",
]
develop = [
"black<24",
"black[jupyter]<24",
"mypy==1.6.1",
"poethepoet<0.25",
"pyproject-fmt<1.4",
"ruff==0.1.3",
"validate-pyproject<0.16",
]
examples = [
"ipywidgets<9",
"notebook<8",
]
influxdb = [
"influxio==0.1.1",
]
Expand All @@ -128,7 +132,7 @@ release = [
"twine<5",
]
test = [
"pueblo[dataframe]@ git+https://github.com/pyveci/pueblo.git@develop",
"pueblo[dataframe,testing]@ git+https://github.com/pyveci/pueblo.git@develop",
"pytest<8",
"pytest-cov<5",
"pytest-mock<4",
Expand Down Expand Up @@ -176,7 +180,7 @@ non_interactive = true
[tool.pytest.ini_options]
addopts = """
-rfEXs -p pytester --strict-markers --verbosity=3
--cov --cov-report=term-missing --cov-report=xml
--cov=. --cov-report=term-missing --cov-report=xml
"""
minversion = "2.0"
log_level = "DEBUG"
Expand Down
35 changes: 34 additions & 1 deletion tests/cluster/test_examples.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Copyright (c) 2023, Crate.io Inc.
# Distributed under the terms of the AGPLv3 license, see LICENSE.
from pathlib import Path

import pytest
import responses
from pytest_notebook.nb_regression import NBRegressionFixture

import cratedb_toolkit

Expand Down Expand Up @@ -55,7 +59,7 @@ def test_example_cloud_cluster_with_deploy(mocker, mock_cloud_cluster_deploy):


@responses.activate
def test_example_cloud_import(mocker, mock_cloud_import):
def test_example_cloud_import_python(mocker, mock_cloud_import):
"""
Verify that the program `examples/cloud_import.py` works.
"""
Expand All @@ -73,3 +77,32 @@ def test_example_cloud_import(mocker, mock_cloud_import):
from examples.cloud_import import main

main()


@pytest.mark.skip(
"Does not work: Apparently, the 'responses' mockery " "is not properly activated when evaluating the notebook"
)
@responses.activate
def test_example_cloud_import_notebook(mocker, mock_cloud_cluster_exists):
"""
Verify the Jupyter Notebook example works.
"""

# Synthesize a valid environment.
mocker.patch.dict(
"os.environ",
{
"CRATEDB_CLOUD_SUBSCRIPTION_ID": "f33a2f55-17d1-4f21-8130-b6595d7c52db",
# "CRATEDB_CLOUD_CLUSTER_ID": "e1e38d92-a650-48f1-8a70-8133f2d5c400", # noqa: ERA001
"CRATEDB_CLOUD_CLUSTER_NAME": "testcluster",
"CRATEDB_USERNAME": "crate",
},
)

# Exercise Notebook.
fixture = NBRegressionFixture(
diff_ignore=("/metadata/language_info", "/metadata/widgets", "/cells/*/execution_count"),
)
here = Path(__file__).parent.parent.parent
notebook = here / "examples" / "cloud_import.ipynb"
fixture.check(str(notebook))
2 changes: 2 additions & 0 deletions tests/io/test_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ def test_import_csv_dask(cratedb, dummy_csv):
"""
Invoke convenience function `import_csv_dask`, and verify database content.
"""
pytest.importorskip("dask")
result = cratedb.database.import_csv_dask(filepath=dummy_csv, tablename="foobar")
assert result is None

Expand All @@ -42,6 +43,7 @@ def test_import_csv_dask_with_progressbar(cratedb, dummy_csv):
This time, use `progress=True` to make Dask display a progress bar.
However, the code does not verify it.
"""
pytest.importorskip("dask")
result = cratedb.database.import_csv_dask(filepath=dummy_csv, tablename="foobar", progress=True)
assert result is None

Expand Down

0 comments on commit 15797d4

Please sign in to comment.