Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#190 enabled to generate a dynamic module for custom udf #202

Merged
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
a7902de
Added support to create dynamic modules
ckunki Oct 18, 2024
9d05c4c
Updated dynamic_modules, preferring functions over a class
ckunki Oct 18, 2024
e231ac1
#190: Enabled to generate a dynamic module for custom UDF
ckunki Oct 21, 2024
69ad70d
Experiment for GH workflow
ckunki Oct 21, 2024
237a0cb
Experiment for GH workflow 2
ckunki Oct 21, 2024
e94b29a
Experiment for GH workflow 3
ckunki Oct 21, 2024
24df407
Experiment for GH workflow 4
ckunki Oct 21, 2024
c1b7068
Reset changes to file create_query_loop.sql
ckunki Oct 21, 2024
0b7a4cd
Experiment for GH workflow 5
ckunki Oct 21, 2024
3e60314
Experiment for GH workflow 6
ckunki Oct 21, 2024
73db337
Updated pre-commit hook to use nox task
ckunki Oct 21, 2024
4c8bdbb
Updated create_query_loop.sql
ckunki Oct 21, 2024
dcb07f1
Updated create_query_loop.sql
ckunki Oct 21, 2024
b5314ab
Experiment for GH workflow 7
ckunki Oct 21, 2024
2d8ac83
Experiment for GH workflow 8
ckunki Oct 21, 2024
1a8e0c4
Removed dead code from pre-commit
ckunki Oct 21, 2024
658d90b
Added check for user guide up-to-date
ckunki Oct 21, 2024
d03f8ee
Experiment for GH workflow 9
ckunki Oct 21, 2024
7c1b470
Updated user guide
ckunki Oct 21, 2024
5a0c798
Updated developer guide
ckunki Oct 21, 2024
88daa71
Additional changes to example
ckunki Oct 21, 2024
dfe2319
Merge branch 'main' into refactoring/#190-Enabled_to_generate_a_dynam…
ckunki Oct 21, 2024
443bde6
Adde integration test
ckunki Oct 21, 2024
3384a78
Fixed integration test
ckunki Oct 21, 2024
ee0b129
Fixed integration test 2
ckunki Oct 21, 2024
f39824f
Fixed integration test 3
ckunki Oct 22, 2024
9baf2b1
Cleanup, removed SQL template
ckunki Oct 22, 2024
647dfc1
Fixed integration test 4
ckunki Oct 22, 2024
4d97386
Updated user guide
ckunki Oct 22, 2024
9502fff
Fixed types in user guide and added some additional sentences.
ckunki Oct 22, 2024
645b8a0
Updated example in user guide
ckunki Oct 22, 2024
ff940bb
delete file example/__init__.py
ckunki Oct 22, 2024
76f0bf8
Simplified example generator
ckunki Oct 22, 2024
209352b
Removed experimental local test
ckunki Oct 22, 2024
670c77a
Small refactoring
ckunki Oct 22, 2024
4fdd329
Apply suggestions from code review
ckunki Oct 22, 2024
35932a6
fixed review findings
ckunki Oct 22, 2024
ff55086
Fixed changelog entry
ckunki Oct 22, 2024
ffd56f6
Fixed review findings
ckunki Oct 23, 2024
62a089c
Replaced updating the user guide by simple references to example files
ckunki Oct 24, 2024
2cdeb0d
Fixed lua unit tests
ckunki Oct 24, 2024
7045dff
Fixed review findings
ckunki Oct 24, 2024
8656120
Update doc/user_guide/user_guide.md
ckunki Oct 24, 2024
0877c1e
Fixed review findings 2
ckunki Oct 24, 2024
bfa6c44
Updated the user guide to following the review findings
ckunki Oct 24, 2024
75cd145
Fixed review findings
ckunki Oct 24, 2024
408ac95
Fixed character case of example_module in the user guide
ckunki Oct 24, 2024
008854a
Added setting the language alias to the user guide
ckunki Oct 25, 2024
cf8bd2b
Shortened user guide
ckunki Oct 25, 2024
9d94ad3
Fixed integration test
ckunki Oct 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/check-code-generation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Check Code Generation

on:
push:
branches-ignore:
- main

jobs:
check_code_generation:
name: Lua Amalgate and Example in User Guide
strategy:
fail-fast: false
matrix:
python-version: [ "3.10" ]
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4

- name: Setup Python & Poetry Environment
uses: exasol/python-toolbox/.github/actions/[email protected]
with:
python-version: ${{ matrix.python-version }}

- name: Install Development Environment
run: poetry run nox -s install_dev_env

- name: Poetry install
run: poetry run -- nox -s run_in_dev_env -- poetry install

- name: Amalgate Lua Scripts
run: poetry run nox -s amalgate_lua_scripts

- name: Check if re-generated files differ from commit
run: git diff --exit-code
45 changes: 0 additions & 45 deletions .github/workflows/check-packaging.yml

This file was deleted.

1 change: 1 addition & 0 deletions doc/changes/changes_0.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Code name:
* #176: Updated usage of `exasol-bucketfs` to new API
* #185: Removed directory and script for building SLC AAF
* #191: Renamed UDF json element "parameters" to "parameter"
* #190: Added dynamic module generation and used it in the example UDF in the user guide
* #178: Fixed names of mock objects:
* Renamed `testing.mock_query_handler_runner.MockQueryHandlerRunner` to `query_handler.python_query_handler_runner.PythonQueryHandlerRunner`
* Renamed method `PythonQueryHandlerRunner.execute_query()` to `execute_queries()`
Expand Down
14 changes: 14 additions & 0 deletions doc/developer_guide/developer_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,20 @@ poetry run nox -s build_language_container

Installing the SLC ins described in the [AAF User Guide](../user_guide/user_guide.md#script-language-container-slc).

## Update Generated Files

AAF contains some generated files that are committed to git, including:
* The amalgated Lua script [create_query_loop.sql](https://github.com/exasol/advanced-analytics-framework/blob/main/exasol_advanced_analytics_framework/resources/outputs/create_query_loop.sql)
* The examples in the user guide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still true

Copy link
Contributor Author

@ckunki ckunki Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I removed the main instructions but forgot to update the bullet list.

See next push fixing the bullet list, too.


The amalgated Lua script originates from the files in the directory [exasol_advanced_analytics_framework/lua/src](https://github.com/exasol/advanced-analytics-framework/blob/main/exasol_advanced_analytics_framework/lua/src/).

The following command updates the amalgated script:

```shell
poetry run nox -s amalgate_lua_scripts
```

## Running Tests

AAF comes with different automated tests implemented in different programming languages and requiring different environments:
Expand Down
88 changes: 88 additions & 0 deletions doc/user_guide/example-udf-script/create.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
--/
CREATE OR REPLACE PYTHON3_AAF SET SCRIPT "EXAMPLE_SCHEMA"."MY_QUERY_HANDLER_UDF"(...)
EMITS (outputs VARCHAR(2000000)) AS

from typing import Union
from exasol_advanced_analytics_framework.udf_framework.udf_query_handler import UDFQueryHandler
from exasol_advanced_analytics_framework.udf_framework.dynamic_modules import create_module
from exasol_advanced_analytics_framework.query_handler.context.query_handler_context import QueryHandlerContext
from exasol_advanced_analytics_framework.query_result.query_result import QueryResult
from exasol_advanced_analytics_framework.query_handler.result import Result, Continue, Finish
from exasol_advanced_analytics_framework.query_handler.query.select_query import SelectQuery, SelectQueryWithColumnDefinition
from exasol_advanced_analytics_framework.query_handler.context.proxy.bucketfs_location_proxy import \
BucketFSLocationProxy
from exasol_data_science_utils_python.schema.column import Column
from exasol_data_science_utils_python.schema.column_name import ColumnName
from exasol_data_science_utils_python.schema.column_type import ColumnType
from datetime import datetime
from exasol.bucketfs import as_string


example_module = create_module("example_module")

class ExampleQueryHandler(UDFQueryHandler):

def __init__(self, parameter: str, query_handler_context: QueryHandlerContext):
super().__init__(parameter, query_handler_context)
self.parameter = parameter
self.query_handler_context = query_handler_context
self.bfs_proxy = None
self.db_table_proxy = None

def _bfs_file(self, proxy: BucketFSLocationProxy):
return proxy.bucketfs_location() / "temp_file.txt"

def start(self) -> Union[Continue, Finish[str]]:
def sample_content(key: str) -> str:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return f"{timestamp} {key} {self.parameter}"

def table_query_string(statement: str, **kwargs):
table_name = self.db_table_proxy._db_object_name.fully_qualified
return statement.format(table_name=table_name, **kwargs)

def table_query(statement: str, **kwargs):
return SelectQuery(table_query_string(statement, **kwargs))

self.bfs_proxy = self.query_handler_context.get_temporary_bucketfs_location()
self._bfs_file(self.bfs_proxy).write(sample_content("bucketfs"))
self.db_table_proxy = self.query_handler_context.get_temporary_table_name()
query_list = [
table_query('CREATE TABLE {table_name} ("c1" VARCHAR(100), "c2" INTEGER)'),
table_query("INSERT INTO {table_name} VALUES ('{value}', 4)",
value=sample_content("table-insert")),
]
query_handler_return_query = SelectQueryWithColumnDefinition(
query_string=table_query_string('SELECT "c1", "c2" from {table_name}'),
output_columns=[
Column(ColumnName("c1"), ColumnType("VARCHAR(100)")),
Column(ColumnName("c2"), ColumnType("INTEGER")),
])
return Continue(
query_list=query_list,
input_query=query_handler_return_query)

def handle_query_result(self, query_result: QueryResult) -> Union[Continue, Finish[str]]:
c1 = query_result.c1
c2 = query_result.c2
bfs_content = as_string(self._bfs_file(self.bfs_proxy).read())
return Finish(result=f"Final result: from query '{c1}', {c2} and bucketfs: '{bfs_content}'")


example_module.add_to_module(ExampleQueryHandler)

class ExampleQueryHandlerFactory:
def create(self, parameter: str, query_handler_context: QueryHandlerContext):
return example_module.ExampleQueryHandler(parameter, query_handler_context)

example_module.add_to_module(ExampleQueryHandlerFactory)

from exasol_advanced_analytics_framework.udf_framework.query_handler_runner_udf \
import QueryHandlerRunnerUDF

udf = QueryHandlerRunnerUDF(exa)

def run(ctx):
return udf.run(ctx)

/
20 changes: 20 additions & 0 deletions doc/user_guide/example-udf-script/execute.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
EXECUTE SCRIPT "AAF_DB_SCHEMA"."AAF_RUN_QUERY_HANDLER"('{
"query_handler": {
"factory_class": {
"module": "example_module",
"name": "ExampleQueryHandlerFactory"
},
"parameter": "bla-bla",
"udf": {
"schema": "EXAMPLE_SCHEMA",
"name": "MY_QUERY_HANDLER_UDF"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already run with the word example, should we use it everywhere (half a joke)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, during my experiments, I switched from MY_SCHEMA to MY_EXAMPLE to avoid false positive test results due to my test database being already partially initialized.

Should I replace the name MY_QUERY_HANDLER_UDF by EXAMPLE_QUERY_HANDLER_UDF?

}
},
"temporary_output": {
"bucketfs_location": {
"connection_name": "BFS_CON",
"directory": "temp"
},
"schema_name": "EXAMPLE_TEMP_SCHEMA"
}
}')
13 changes: 13 additions & 0 deletions doc/user_guide/proxies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## AAF Proxies

The Advanced Analytics Framework (AAF) uses _Object Proxies_ to manage temporary objects.

An _Object Proxy_
* Encapsulates a temporary object
* Provides a reference enabling using the object, i.e. its name incl. the database schema or the path in the BucketFS
* Ensures the object is removed when leaving the current scope, e.g. the Query Handler.

All Object Proxies are derived from class `exasol_advanced_analytics_framework.query_handler.context.proxy.object_proxy.ObjectProxy`:
* `BucketFSLocationProxy` encapsulates a location in the BucketFS
* `DBObjectNameProxy` encapsulates a database object, e.g. a table

104 changes: 27 additions & 77 deletions doc/user_guide/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,10 +108,12 @@ python -m exasol_advanced_analytics_framework.deploy scripts \
--dsn "$DB_HOST:DB_PORT" \
--db-user "$DB_USER" \
--db-pass "$DB_PASSWORD" \
--schema "$DB_SCHEMA" \
--schema "$AAF_DB_SCHEMA" \
--language-alias "$LANGUAGE_ALIAS"
```

The name of the database schema must match the schema when executing the script.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unclear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See next push.


## Usage

The entry point of this framework is `AAF_RUN_QUERY_HANDLER` script. This script is simply a query loop which is responsible for executing the implemented algorithm.
Expand All @@ -124,7 +126,7 @@ This script takes the necessary parameters to execute the desired algorithm in s
The following SQL statement shows how to call an AAF query handler:

```sql
EXECUTE SCRIPT AAF_RUN_QUERY_HANDLER('{
EXECUTE SCRIPT <AAF_DB_SCHEMA>.AAF_RUN_QUERY_HANDLER('{
"query_handler": {
"factory_class": {
"module": "<CLASS_MODULE>",
Expand Down Expand Up @@ -152,6 +154,7 @@ See [Implementing a Custom Algorithm as Example Query Handler](#implementing-a-c

| Parameter | Required? | Description |
|------------------------------|-----------|-------------------------------------------------------------------------------|
| `<AAF_DB_SCHEMA>` | yes | Name of the database schema containing the default Query Handler, See [Additional Scripts](#additional-scripts) |
| `<CLASS_NAME>` | yes | Name of the query handler class |
| `<CLASS_MODULE>` | yes | Module name of the query handler class |
| `<CLASS_PARAMETERS>` | yes | Parameters of the query handler class encoded as string |
Expand Down Expand Up @@ -188,88 +191,31 @@ Each algorithm should extend the `UDFQueryHandler` abstract class and then imple

### Concrete Example Using an Adhoc Implementation Within the UDF

The example uses the module `builtins` and dynamically adds `ExampleQueryHandler` and `ExampleQueryHandlerFactory` to it.

```python
--/
CREATE OR REPLACE PYTHON3_AAF SET SCRIPT "MY_SCHEMA"."MY_QUERY_HANDLER_UDF"(...)
EMITS (outputs VARCHAR(2000000)) AS

from typing import Union
from exasol_advanced_analytics_framework.udf_framework.udf_query_handler import UDFQueryHandler
from exasol_advanced_analytics_framework.query_handler.context.query_handler_context import QueryHandlerContext
from exasol_advanced_analytics_framework.query_result.query_result import QueryResult
from exasol_advanced_analytics_framework.query_handler.result import Result, Continue, Finish
from exasol_advanced_analytics_framework.query_handler.query.select_query import SelectQuery, SelectQueryWithColumnDefinition
from exasol_data_science_utils_python.schema.column import Column
from exasol_data_science_utils_python.schema.column_name import ColumnName
from exasol_data_science_utils_python.schema.column_type import ColumnType


class ExampleQueryHandler(UDFQueryHandler):
def __init__(self, parameter: str, query_handler_context: QueryHandlerContext):
super().__init__(parameter, query_handler_context)
self.parameter = parameter
self.query_handler_context = query_handler_context

def start(self) -> Union[Continue, Finish[str]]:
query_list = [
SelectQuery("SELECT 1 FROM DUAL"),
SelectQuery("SELECT 2 FROM DUAL")]
query_handler_return_query = SelectQueryWithColumnDefinition(
query_string="SELECT 5 AS 'return_column' FROM DUAL",
output_columns=[
Column(ColumnName("return_column"), ColumnType("INTEGER"))])

return Continue(
query_list=query_list,
input_query=query_handler_return_query)
The example dynamically creates a python module `xyz` and adds classes `ExampleQueryHandler` and `ExampleQueryHandlerFactory` to it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give the module a more meaningful name than "xyz". If you were doing this manually you would probably call the module "example_query_handler.py" or "example_qh.py", or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about example_module?


def handle_query_result(self, query_result: QueryResult) -> Union[Continue, Finish[str]]:
return_value = query_result.return_column
result = 2 ** return_value
return Finish(result=result)
In order to execute the example successfully you need to
1. [Create a BucketFS connection](#bucketfs-connection)
2. Activate the AAF's SLC
3. Make sure the database schemas used in the example exist.

import builtins
builtins.ExampleQueryHandler=ExampleQueryHandler # required for pickle
The example assumes
* the name for the BucketFS Connection `<CONNECTION_NAME>` to be `BFS_CON`
* the name for the AAF database schema `<AAF_DB_SCHEMA` to be `AAF_DB_SCHEMA`, see [Additional Scripts](#additional-scripts)

class ExampleQueryHandlerFactory:
def create(self, parameter: str, query_handler_context: QueryHandlerContext):
return builtins.ExampleQueryHandler(parameter, query_handler_context)
The following SQL statements activate the AAF's SLC and create the required database schemas unless they already exist:

builtins.ExampleQueryHandlerFactory=ExampleQueryHandlerFactory

from exasol_advanced_analytics_framework.udf_framework.query_handler_runner_udf \
import QueryHandlerRunnerUDF
```shell
ALTER SESSION SET SCRIPT_LANGUAGES='R=builtin_r JAVA=builtin_java PYTHON3=builtin_python3 PYTHON3_AAF=localzmq+protobuf:///bfsdefault/default/temp/exasol_advanced_analytics_framework_container_release?lang=python#/buckets/bfsdefault/default/temp/exasol_advanced_analytics_framework_container_release/exaudf/exaudfclient_py3';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied your proposal, @tkilias.
How about the assumptions?
Maybe we first try to identify all assumptions?

Thinking about it: maybe we can omit the ALTER SESSION statement, as all this is already done by the AAF deploy command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See next push without ALTER SESSION command.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


udf = QueryHandlerRunnerUDF(exa)
create schema IF NOT EXISTS "EXAMPLE_SCHEMA";
create schema IF NOT EXISTS "EXAMPLE_TEMP_SCHEMA";
```

def run(ctx):
return udf.run(ctx)
/
The following files contain the SQL statements for creating and executing the UDF script
* [example-udf-script/create.sql](example-udf-script/create.sql)
* [example-udf-script/execute.sql](example-udf-script/execute.sql)


EXECUTE SCRIPT MY_SCHEMA.AAF_RUN_QUERY_HANDLER('{
"query_handler": {
"factory_class": {
"module": "builtins",
"name": "ExampleQueryHandlerFactory"
},
"parameter": "bla-bla",
"udf": {
"schema": "MY_SCHEMA",
"name": "MY_QUERY_HANDLER_UDF"
}
},
"temporary_output": {
"bucketfs_location": {
"connection_name": "BFS_CON",
"directory": "temp"
},
"schema_name": "TEMP_SCHEMA"
}
}');
```
### Sequence Diagram

The figure below illustrates the execution of this algorithm implemented in class `ExampleQueryHandler`.
* When method `start()` is called, it executes two queries and an additional `input_query` to obtain the input for the next iteration.
Expand All @@ -278,3 +224,7 @@ The figure below illustrates the execution of this algorithm implemented in clas
In this example, the algorithm is finished at this iteration and returns 2<sup>_return value_</sup> as final result.

![Sample Execution](../images/sample_execution.png "Sample Execution")

## Additional Information

* [Object Proxies](proxies.md) for managing temporary locations in the database and BucketFS
2 changes: 1 addition & 1 deletion exasol_advanced_analytics_framework/lua/src/query_loop.lua
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ function M.prepare_init_query(arguments, meta)
local udf_schema <const> = udf['schema']
local udf_name <const> = udf['name']

local full_qualified_udf_name <const> = string.format("%s.%s", udf_schema, udf_name)
local full_qualified_udf_name <const> = string.format("\"%s\".\"%s\"", udf_schema, udf_name)
local udf_args <const> = string.format("(%d,'%s','%s','%s','%s','%s','%s','%s')",
iter_num,
temporary_bfs_location_conn,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ test_query_handler_runner = {
parameter = "param"
},
},
query = "SELECT UDF_SCHEMA.UDF_NAME(" ..
query = "SELECT \"UDF_SCHEMA\".\"UDF_NAME\"(" ..
"0,'bfs_conn','directory','db_name_1122334455_1','temp_schema'," ..
"'cls_name','package.module','param')",
return_query_result = {
Expand Down
Loading