Cloud API: Demonstrate cluster mgmt. on behalf of a Jupyter Notebook

crate · Nov 17, 2023 · 15797d4 · 15797d4
1 parent ddf5303
commit 15797d4
Show file tree

Hide file tree

Showing 4 changed files with 246 additions and 4 deletions.
diff --git a/examples/cloud_import.ipynb b/examples/cloud_import.ipynb
@@ -0,0 +1,203 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# CrateDB Cloud Import\n",
+    "\n",
+    "This is an example notebook demonstrating how to load data from\n",
+    "files using the [Import API] interface of [CrateDB Cloud] into\n",
+    "a [CrateDB Cloud Cluster].\n",
+    "\n",
+    "The supported file types are CSV, JSON, Parquet, optionally with\n",
+    "gzip compression. They can be acquired from the local filesystem,\n",
+    "or from remote HTTP and AWS S3 resources.\n",
+    "\n",
+    "[CrateDB Cloud]: https://cratedb.com/docs/cloud/\n",
+    "[CrateDB Cloud Cluster]: https://cratedb.com/docs/cloud/en/latest/reference/services.html\n",
+    "[Import API]: https://community.cratedb.com/t/importing-data-to-cratedb-cloud-clusters/1467\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "To install the client SDK, use `pip`."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "#!pip install 'cratedb-toolkit'"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Configuration\n",
+    "\n",
+    "The notebook assumes you are appropriately authenticated to the CrateDB Cloud\n",
+    "platform, for example using `croud login --idp azuread`. To inspect the list\n",
+    "of available clusters, run `croud clusters list`.\n",
+    "\n",
+    "For addressing a database cluster, and obtaining corresponding credentials,\n",
+    "the program uses environment variables, which you can define interactively,\n",
+    "or store them within a `.env` file.\n",
+    "\n",
+    "You can use those configuration snippet as a blueprint. Please adjust the\n",
+    "individual settings accordingly.\n",
+    "```shell\n",
+    "CRATEDB_CLOUD_CLUSTER_NAME=Hotzenplotz\n",
+    "CRATEDB_USERNAME='admin'\n",
+    "CRATEDB_PASSWORD='H3IgNXNvQBJM3CiElOiVHuSp6CjXMCiQYhB4I9dLccVHGvvvitPSYr1vTpt4'\n",
+    "```"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Acquire Database Cluster\n",
+    "\n",
+    "As a first measure, acquire a resource handle, which manages a CrateDB Cloud\n",
+    "cluster instance.\n",
+    "\n",
+    "For effortless configuration, it will obtain configuration settings from\n",
+    "environment variables as defined above."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from cratedb_toolkit import ManagedCluster, InputOutputResource\n",
+    "\n",
+    "cluster = ManagedCluster.from_env().start()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Import Data\n",
+    "\n",
+    "From the [NAB Data Corpus], import the \"realKnownCause\" dataset. The dataset includes\n",
+    "temperature sensor data of an internal component of an industrial machine, with known\n",
+    "anomaly causes. The first anomaly is a planned shutdown of the machine. The second\n",
+    "anomaly is difficult to detect and directly led to the third anomaly, a catastrophic\n",
+    "failure of the machine.\n",
+    "\n",
+    "On this topic, we also recommend the notebook about [MLflow and CrateDB], where the\n",
+    "same dataset is used for time series anomaly detection and forecasting.\n",
+    "\n",
+    "[NAB Data Corpus]: https://github.com/numenta/NAB/tree/master/data\n",
+    "[MLflow and CrateDB]: https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/mlops-mlflow"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001B[36m==> Info: \u001B[0mStatus: REGISTERED (Your import job was received and is pending processing.)\n",
+      "\u001B[36m==> Info: \u001B[0mDone importing 22.70K records\n",
+      "\u001B[32m==> Success: \u001B[0mOperation completed.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": "CloudJob(info={'cluster_id': '09be10b6-7d78-497b-842e-fbb47642d398', 'compression': 'none', 'dc': {'created': '2023-11-17T18:54:04.070000+00:00', 'modified': '2023-11-17T18:54:04.070000+00:00'}, 'destination': {'create_table': True, 'table': 'nab-machine-failure'}, 'file': None, 'format': 'csv', 'id': '56051dc3-ee8e-4a38-9066-73bcd427d05a', 'progress': {'bytes': 0, 'details': {'create_table_sql': None}, 'failed_files': 0, 'failed_records': 0, 'message': 'Import succeeded', 'percent': 100.0, 'processed_files': 1, 'records': 22695, 'total_files': 1, 'total_records': 22695}, 'schema': {'type': 'csv'}, 'status': 'SUCCEEDED', 'type': 'url', 'url': {'url': 'https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv'}}, found=True, _custom_status=None, _custom_message=None)"
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Define data source.\n",
+    "url = \"https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv\"\n",
+    "source = InputOutputResource(url=url)\n",
+    "\n",
+    "# Invoke import job. Without `target` argument, the destination\n",
+    "# table name will be derived from the input file name.\n",
+    "cluster.load_table(source=source)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Query Data\n",
+    "\n",
+    "In order to inspect if the dataset has been imported successfully, run an SQL\n",
+    "command sampling a few records."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "[{'timestamp': 1386021900000, 'value': 80.78327674},\n {'timestamp': 1386024000000, 'value': 81.37357535},\n {'timestamp': 1386024600000, 'value': 80.18124978},\n {'timestamp': 1386030300000, 'value': 82.88189183},\n {'timestamp': 1386030600000, 'value': 83.57965349}]"
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Query data.\n",
+    "cluster.query('SELECT * FROM \"nab-machine-failure\" LIMIT 5;')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -101,13 +101,17 @@ all = [
   "cratedb-toolkit[influxdb,io,mongodb]",
 ]
 develop = [
-  "black<24",
+  "black[jupyter]<24",
   "mypy==1.6.1",
   "poethepoet<0.25",
   "pyproject-fmt<1.4",
   "ruff==0.1.3",
   "validate-pyproject<0.16",
 ]
+examples = [
+  "ipywidgets<9",
+  "notebook<8",
+]
 influxdb = [
   "influxio==0.1.1",
 ]
@@ -128,7 +132,7 @@ release = [
   "twine<5",
 ]
 test = [
-  "pueblo[dataframe]@ git+https://github.com/pyveci/pueblo.git@develop",
+  "pueblo[dataframe,testing]@ git+https://github.com/pyveci/pueblo.git@develop",
   "pytest<8",
   "pytest-cov<5",
   "pytest-mock<4",
@@ -176,7 +180,7 @@ non_interactive = true
 [tool.pytest.ini_options]
 addopts = """
   -rfEXs -p pytester --strict-markers --verbosity=3
-  --cov --cov-report=term-missing --cov-report=xml
+  --cov=. --cov-report=term-missing --cov-report=xml
   """
 minversion = "2.0"
 log_level = "DEBUG"

diff --git a/tests/cluster/test_examples.py b/tests/cluster/test_examples.py
@@ -1,6 +1,10 @@
 # Copyright (c) 2023, Crate.io Inc.
 # Distributed under the terms of the AGPLv3 license, see LICENSE.
+from pathlib import Path
+
+import pytest
 import responses
+from pytest_notebook.nb_regression import NBRegressionFixture
 
 import cratedb_toolkit
 
@@ -55,7 +59,7 @@ def test_example_cloud_cluster_with_deploy(mocker, mock_cloud_cluster_deploy):
 
 
 @responses.activate
-def test_example_cloud_import(mocker, mock_cloud_import):
+def test_example_cloud_import_python(mocker, mock_cloud_import):
     """
     Verify that the program `examples/cloud_import.py` works.
     """
@@ -73,3 +77,32 @@ def test_example_cloud_import(mocker, mock_cloud_import):
     from examples.cloud_import import main
 
     main()
+
+
+@pytest.mark.skip(
+    "Does not work: Apparently, the 'responses' mockery " "is not properly activated when evaluating the notebook"
+)
+@responses.activate
+def test_example_cloud_import_notebook(mocker, mock_cloud_cluster_exists):
+    """
+    Verify the Jupyter Notebook example works.
+    """
+
+    # Synthesize a valid environment.
+    mocker.patch.dict(
+        "os.environ",
+        {
+            "CRATEDB_CLOUD_SUBSCRIPTION_ID": "f33a2f55-17d1-4f21-8130-b6595d7c52db",
+            # "CRATEDB_CLOUD_CLUSTER_ID": "e1e38d92-a650-48f1-8a70-8133f2d5c400",  # noqa: ERA001
+            "CRATEDB_CLOUD_CLUSTER_NAME": "testcluster",
+            "CRATEDB_USERNAME": "crate",
+        },
+    )
+
+    # Exercise Notebook.
+    fixture = NBRegressionFixture(
+        diff_ignore=("/metadata/language_info", "/metadata/widgets", "/cells/*/execution_count"),
+    )
+    here = Path(__file__).parent.parent.parent
+    notebook = here / "examples" / "cloud_import.ipynb"
+    fixture.check(str(notebook))
diff --git a/tests/io/test_import.py b/tests/io/test_import.py
@@ -28,6 +28,7 @@ def test_import_csv_dask(cratedb, dummy_csv):
     """
     Invoke convenience function `import_csv_dask`, and verify database content.
     """
+    pytest.importorskip("dask")
     result = cratedb.database.import_csv_dask(filepath=dummy_csv, tablename="foobar")
     assert result is None
 
@@ -42,6 +43,7 @@ def test_import_csv_dask_with_progressbar(cratedb, dummy_csv):
     This time, use `progress=True` to make Dask display a progress bar.
     However, the code does not verify it.
     """
+    pytest.importorskip("dask")
     result = cratedb.database.import_csv_dask(filepath=dummy_csv, tablename="foobar", progress=True)
     assert result is None