Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#26: Added user guide #67

Merged
merged 31 commits into from
Oct 15, 2024
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
65893a2
Initiated the user guide
umitbuyuksahin Sep 29, 2022
d7a6fd6
Added a sample execution figure and explanations
umitbuyuksahin Sep 30, 2022
b665fc7
Merge branch 'main' into doc/#26-add-user-guide
ckunki Oct 7, 2024
1ce9d39
Initial review and update of existing PR
ckunki Oct 7, 2024
e861ab4
Descibed BucketFS connection in the User Guide
ckunki Oct 7, 2024
8c5899e
renamed CustomQueryHandler to ExampleQueryHandler
ckunki Oct 7, 2024
cf2872e
Renamed variable in user guide sample query
ckunki Oct 8, 2024
7655552
Merge branch 'main' into doc/#26-add-user-guide
ckunki Oct 8, 2024
80495a7
Updated user guide
ckunki Oct 9, 2024
da6c79e
Fixed bug in query_handler_runner_udf.py
ckunki Oct 9, 2024
378b1f8
Updated user guide based on empiric results
ckunki Oct 9, 2024
ced8c82
Fixed unit tests
ckunki Oct 9, 2024
df56d51
Apply suggestions from code review
ckunki Oct 10, 2024
82250db
Renamed CustomQueryHandler to ExampleQueryHandler
ckunki Oct 10, 2024
b61626c
Moved instructions for building the SLC to the developer guide
ckunki Oct 10, 2024
8ae62ad
Fixed review findings
ckunki Oct 10, 2024
6242c37
Added sample file for current experiments
ckunki Oct 10, 2024
37dc979
Fixed some review findings in the User Guide
ckunki Oct 11, 2024
81be6df
Merge branch 'main' into doc/#26-add-user-guide
ckunki Oct 11, 2024
20b3d8a
Apply suggestions from code review
ckunki Oct 11, 2024
58e717b
Updated images with sample execution
ckunki Oct 11, 2024
09a0584
Removed sample file udf-6.sql
ckunki Oct 11, 2024
e1843bd
Formatted paragraphs and remove comments
ckunki Oct 11, 2024
b466809
Remove comment in user guide
ckunki Oct 11, 2024
b335b1d
Apply suggestions from code review
ckunki Oct 14, 2024
14860aa
Fixed review findings
ckunki Oct 14, 2024
cd47569
Apply suggestions from code review
ckunki Oct 14, 2024
5ed9bba
Fixed review findings
ckunki Oct 14, 2024
616a9f8
Use python version 3.10 for version check
ckunki Oct 14, 2024
2d968f9
Update doc/user_guide/user_guide.md
ckunki Oct 15, 2024
b1372f3
Added separate section for example
ckunki Oct 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/changes/changes_0.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ Code name:
### Documentation

* #9: Added README file
* #26: Added user guide

## Dependency Updates

Expand Down
64 changes: 61 additions & 3 deletions doc/developer_guide/developer_guide.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,66 @@
# Developer Guide


In this developer guide we explain how you can build this project.
The developer guide explains how to maintain and develop the Advanced Analytics Framework (AAF).

* [developer_environment](developer_environment.md)
* [building_documentation](building_documentation.md)

## Building and Installing the AAF Script Language Container (SLC)

The following command builds the SLC for the AAF

```shell
poetry run nox -s build_language_container
```

Installing the SLC ins described in the [AAF User Guide](../user_guide/user_guide.md#script-language-container-slc).

## Running Tests

AAF comes with different automated tests implemented in different programming languages and requiring different environments:

| Language | Category | Database | Environment |
|----------|---------------------------------|----------|-------------|
| Python | Unit tests | no | poetry |
tkilias marked this conversation as resolved.
Show resolved Hide resolved
| Python | Integration tests with database | yes | _dev_env_ |
| Python | Integration tests w/o database | no | _dev_env_ |
| Lua | Unit tests | no | _dev_env_ |

### The Special _Development Environment_

For tests marked with Environment _dev_env_ you need to
* Install the LUA environment
* Install the AAF into the _Development Environment_
* Run the tests in the _Development Environment_

The Development Environment
* Activates AAF's conda environment <!-- Why is this required? What does it do in particular? -->
ckunki marked this conversation as resolved.
Show resolved Hide resolved
* Sets the environment variables `LUA_PATH`, `LUA_CPATH`, and `PATH` for executing lua scripts

The following commands installs the LUA environment and the AAF within the _Development Environment_:
```shell
poetry run -- nox -s install_lua_environment
poetry run -- nox -s run_in_dev_env -- poetry install
```

### Python Unit Tests

You can execute the unit tests without special preparation in the regular poetry environment:

```shell
poetry run pytest tests/unit_tests
```

### Python Integration Tests with and w/o database

The following commands run integration tests w/o and with database
```shell
poetry run -- nox -s run_python_test -- -- tests/integration_tests/without_db/
poetry run -- nox -s run_python_test -- -- --backend=onprem tests/integration_tests/with_db/
```

### Lua Unit Tests

The following command executes the Lua Unit Tests:
```shell
poety run nox -s run_lua_unit_tests
```
Binary file added doc/images/sample_execution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion doc/images/system_design_diagram.drawio
ckunki marked this conversation as resolved.
Show resolved Hide resolved

This file was deleted.

280 changes: 280 additions & 0 deletions doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# Advanced Analytics Framework User Guide

The Advanced Analytics Framework (AAF) enables implementing complex data analysis algorithms with Exasol. Users can use the features of AAF in their custom implementations.

## Table of Contents

* [Setup](#setup)
* [Usage](#usage)
* [Custom Algorithms](#custom-algorithms)

## Setup

### Exasol database

* The Exasol cluster must already be running with version 7.1 or later.
* Database connection information and credentials are needed for the database itself as also for the BucketFS.

### BucketFS Connection

AAF employs some Lua scripts and User Defined Functions (UDFs). The Lua scripts are orchestrating the UDFs while the UDFs are performing the actual analytic functions.

AAF keeps a common state of execution and passes input data and results between Lua and UDFs via files in the Bucket File System (BucketFS) of the Exasol database.
ckunki marked this conversation as resolved.
Show resolved Hide resolved

The following SQL statements create such a connection to the BucketFS:

```sql
CREATE OR REPLACE CONNECTION '<CONNECTION_NAME>'
TO '{
"backend": "<BACKEND>",
"url": "<HOST>:<PORT>",
"service_name": "<SERVICE_NAME>",
"bucket_name": "<BUCKET_NAME>",
"path": "<PATH>",
"verify": <VERIFY>,
"host": "<SAAS_HOST>",
"account_id": "<SAAS_ACCOUNT_ID>",
"database_id": "<SAAS_DATABASE_ID>",
"pat": "<SAAS_PAT>"
}'
USER '{"username": "<USER_NAME>"}'
IDENTIFIED BY '{"password": "<PASSWORD>"}' ;
```

The list of elements in the connection's parameter called `TO` depends on the backend you want to use. There are two different backends: `onprem` and `saas`.

The following table shows all elements for each of the backends.

| Backend | Parameter | Required? | Default value | Description |
|----------|----------------------|-----------|----------------|--------------------------------------------------------------------|
| (any) | `<CONNECTION_NAME>` | yes | - | Name of the connection |
| (any) | `<USER_NAME>` | - | `true` | Name of the user accessing the Bucket (requires write permissions) |
| (any) | `<PASSWORD>` | - | `true` | Password for accessing the Bucket (requires write permissions) |
| (any) | `<BACKEND>` | yes | - | Which backend to use, must be either `onprem` or `saas` |
| `onprem` | `<HOST>` | yes | - | Fully qualified Hostname or ip Address |
| `onprem` | `<PORT>` | - | `2580` | Port of the BucketFS Service |
| `onprem` | `<SERVICE_NAME>` | yes | `bfsdefault` | Name of the BucketFS Service |
| `onprem` | `<BUCKET_NAME>` | yes | `default` | Name of the Bucket |
| `onprem` | `<PATH>` | - | (empty / root) | Path inside the Bucket |
| `onprem` | `<VERIFY>` | - | `true` | Whether to apply TLS security to the connection |
| `saas` | `<SAAS_ACCOUNT_ID>` | yes | - | Account ID for accessing an SaaS database instance |
| `saas` | `<SAAS_DATABASE_ID>` | yes | - | Database ID of an Exasol SaaS database instance |
| `saas` | `<SAAS_PAT>` | yes | - | Personal access token for accessing an SaaS database instance |

### AAF Python Package

The latest version of AAF can be obtained from [pypi](https://pypi.org), see also the [releases on GitHub](https://github.com/exasol/advanced-analytics-framework/releases):

```bash
pip install exasol-advanced-analytics-framework
```

### Script Language Container (SLC)

Exasol executes User Defined Functions (UDFs) in an isolated Container whose root filesystem is derived from a Script Language Container (SLC).

Running the AAF requires a SLC. The following command
* downloads the specified version `<VERSION>` (preferrably the latest) of a prebuilt AAF SLC from the [AAF releases](https://github.com/exasol/advanced-analytics-framework/releases/latest) on GitHub,
* uploads the file into the BucketFS,
* and registers it to the database.

The variable `$LANGUAGE_ALIAS` will be reused in [Additional Scripts](#additional-scripts).

```shell
LANGUAGE_ALIAS=PYTHON3_AAF
python -m exasol_advanced_analytics_framework.deploy language-container \
--dsn "$DB_HOST:$DB_PORT" \
--db-user "$DB_USER" \
--db-pass "$DB_PASSWORD" \
--bucketfs-name "$BUCKETFS_NAME" \
--bucketfs-host "$BUCKETFS_HOST" \
--bucketfs-port "$BUCKETFS_PORT" \
--bucketfs-user "$BUCKETFS_USER" \
--bucketfs-password "$BUCKETFS_PASSWORD" \
--bucket "$BUCKETFS_NAME" \
--path-in-bucket "$PATH_IN_BUCKET" \
--version "$VERSION" \
--language-alias "$LANGUAGE_ALIAS"
tkilias marked this conversation as resolved.
Show resolved Hide resolved
```

### Additional Scripts

Besides the BucketFS connection, the SLC, and the Python package AAF also requires some additional Lua scripts to be created in the Exasol database.

The following command deploys the additional scripts to the specified `DB_SCHEMA` using the `LANGUAGE_ALIAS` of the SLC:

```shell
python -m exasol_advanced_analytics_framework.deploy scripts \
--dsn "$DB_HOST:DB_PORT" \
--db-user "$DB_USER" \
--db-pass "$DB_PASSWORD" \
--schema "$DB_SCHEMA" \
--language-alias "$LANGUAGE_ALIAS"
```

## Usage

The entry point of this framework is `AAF_RUN_QUERY_HANDLER` script. This script is simply a query loop which is responsible for executing the implemented algorithm.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to explain that an algorithm is implemented as QueryHandler and what a QueryHandler is.

Copy link
Contributor

@ckunki ckunki Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to separate ticket #195


This script takes the necessary parameters to execute the desired algorithm in string json format. The json input includes two main parts:

* `query_handler` : Details of the algorithm implemented by the user.
* `temporary_output`: Information where the temporary outputs (such as tables, bucketfs files, ...) of the query handler are kept. These temporary outputs will be removed after the execution of the QueryHandler.

The following SQL statement shows how to call an AAF query handler:

```sql
EXECUTE SCRIPT AAF_RUN_QUERY_HANDLER('{
"query_handler": {
"factory_class": {
"module": "<CLASS_MODULE>",
"name": "<CLASS_NAME>"
},
"parameter": "<CLASS_PARAMETERS>",
"udf": {
"schema": "<UDF_DB_SCHEMA>",
"name": "<UDF_NAME>"
}
},
"temporary_output": {
"bucketfs_location": {
"connection_name": "<BUCKETFS_CONNECTION_NAME>",
"directory": "<BUCKETFS_DIRECTORY>"
},
"schema_name": "<TEMP_DB_SCHEMA>"
}
}');
```

See [Implementation of Custom Algorithms](#implementation-of-custom-algorithms) for a complete example.

### Parameters

| Parameter | Required? | Description |
|------------------------------|-----------|-------------------------------------------------------------------------------|
| `<CLASS_NAME>` | yes | Name of the query handler class |
| `<CLASS_MODULE>` | yes | Module name of the query handler class |
| `<CLASS_PARAMETERS>` | yes | Parameters of the query handler class encoded as string |
| `<UDF_NAME>` | - | Name of Python UDF script that contains the algorithm implemented by the user |
| `<UDF_DB_SCHEMA>` | - | Schema name where the UDF script is deployed |
| `<BUCKETFS_CONNECTION_NAME>` | yes | BucketFS connection name which is used to create temporary bucketfs files |
| `<BUCKETFS_DIRECTORY>` | yes | Directory in BucketFS for the temporary bucketfs files |
| `<TEMP_DB_SCHEMA>` | yes | Database Schema for temporary database objects, e.g. tables |

Please take care to provide a string value for `<CLASS_PARAMETERS>`. Simple data types like `float`, `int`, `bool` will be converted to a String, while a Json object or an array is represented as a string value with an unusable reference, e.g. `table: 0x14823bd38580`.
ckunki marked this conversation as resolved.
Show resolved Hide resolved

## Custom Algorithms

### Deployment Options

Using the AAF requires to implement a custom algorithm using one of the following alternatives
* Adhoc implementation within a UDF
* Build a custom extension

#### Building a custom extension

* Create a python package that depends on the python package of the AAF and that implements the query handler of the custom algorithm and its factory class.
* Create an associated SLC which has the python package installed
* GitHub repository [python-extension-common](https://github.com/exasol/python-extension-common/) provides more detailed documentation and automation.
* Leave out entry `udf` from the json input to use the default UDF.
* Values `<CLASS_MODULE>` and `<CLASS_NAME>` must reflect the _module_ and _class name_ of the `QueryHandler` implemented in the custom SLC.


### Implementation of the Custom Algorithm

Each algorithm should extend the `UDFQueryHandler` abstract class and then implement the following methods:
* `start()`: This method is called at the first execution of the query hander, that is, in the first iteration. It returns a result object: Either _Finish_ or _Continue_.
* The _Finish_ result object contains the final result of the implemented algorithm.
* The _Continue_ object contains the query list that will be executed before the next iteration and whose results are used as input for the next iteration.
* `handle_query_result()`: This method is called at the following iterations to handle the result of the queries of the previous iteration.

tkilias marked this conversation as resolved.
Show resolved Hide resolved
Here is an example class definition using an adhoc implementation within the UDF. The example uses the module `builtins` and dynamically adds ExampleQueryHandler and ExampleQueryHandlerFactory to it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add section Example


```python
--/
CREATE OR REPLACE PYTHON3_AAF SET SCRIPT "MY_SCHEMA"."MY_QUERY_HANDLER_UDF"(...)
EMITS (outputs VARCHAR(2000000)) AS

from typing import Union
from exasol_advanced_analytics_framework.udf_framework.udf_query_handler import UDFQueryHandler
from exasol_advanced_analytics_framework.query_handler.context.query_handler_context import QueryHandlerContext
from exasol_advanced_analytics_framework.query_result.query_result import QueryResult
from exasol_advanced_analytics_framework.query_handler.result import Result, Continue, Finish
from exasol_advanced_analytics_framework.query_handler.query.select_query import SelectQuery, SelectQueryWithColumnDefinition
from exasol_data_science_utils_python.schema.column import Column
from exasol_data_science_utils_python.schema.column_name import ColumnName
from exasol_data_science_utils_python.schema.column_type import ColumnType


class ExampleQueryHandler(UDFQueryHandler):
def __init__(self, parameter: str, query_handler_context: QueryHandlerContext):
super().__init__(parameter, query_handler_context)
self.parameter = parameter
self.query_handler_context = query_handler_context

def start(self) -> Union[Continue, Finish[str]]:
query_list = [
SelectQuery("SELECT 1 FROM DUAL"),
SelectQuery("SELECT 2 FROM DUAL")]
query_handler_return_query = SelectQueryWithColumnDefinition(
query_string="SELECT 5 AS 'return_column' FROM DUAL",
output_columns=[
Column(ColumnName("return_column"), ColumnType("INTEGER"))])

return Continue(
query_list=query_list,
input_query=query_handler_return_query)

def handle_query_result(self, query_result: QueryResult) -> Union[Continue, Finish[str]]:
return_value = query_result.return_column
result = 2 ** return_value
return Finish(result=result)

import builtins
builtins.ExampleQueryHandler=ExampleQueryHandler # required for pickle

class ExampleQueryHandlerFactory:
def create(self, parameter: str, query_handler_context: QueryHandlerContext):
return builtins.ExampleQueryHandler(parameter, query_handler_context)

builtins.ExampleQueryHandlerFactory=ExampleQueryHandlerFactory

from exasol_advanced_analytics_framework.udf_framework.query_handler_runner_udf \
import QueryHandlerRunnerUDF

udf = QueryHandlerRunnerUDF(exa)

def run(ctx):
return udf.run(ctx)
/


EXECUTE SCRIPT MY_SCHEMA.AAF_RUN_QUERY_HANDLER('{
"query_handler": {
"factory_class": {
"module": "builtins",
"name": "ExampleQueryHandlerFactory"
},
"parameter": "bla-bla",
"udf": {
"schema": "MY_SCHEMA",
"name": "MY_QUERY_HANDLER_UDF"
}
},
"temporary_output": {
"bucketfs_location": {
"connection_name": "BFS_CON",
"directory": "temp"
},
"schema_name": "TEMP_SCHEMA"
}
}');
```

The figure below illustrates the execution of this algorithm implemented in class `ExampleQueryHandler`.
* When method `start()` is called, it executes two queries and an additional `input_query` to obtain the input for the next iteration.
* After the first iteration is completed, the framework calls method the `handle_query_result` with the `query_result` of the `input_query` of the previous iteration.

In this example, the algorithm is finished at this iteration and returns 2<sup>_return value_</sup> as final result.

![Sample Execution](../images/sample_execution.png "Sample Execution")
tkilias marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ def _wrap_return_query(self,
temporary_view_name = query_handler_context.get_temporary_view_name()
query_handler_udf_name = \
UDFNameBuilder.create(
name="AAF_QUERY_HANDLER_UDF",
name=self.exa.meta.script_name,
schema=SchemaName(self.exa.meta.script_schema)
)
query_create_view = \
Expand Down
Loading
Loading