Skip to content

Commit

Permalink
feat: add basic validation
Browse files Browse the repository at this point in the history
  • Loading branch information
PaulKalho committed Nov 10, 2024
1 parent c543401 commit 771b5b1
Show file tree
Hide file tree
Showing 7 changed files with 202 additions and 41 deletions.
123 changes: 92 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,40 +8,27 @@ You can install the package via pip once it's published:
pip install scystream-sdk
```

## Usage

```python3
from scystream.sdk.core import entrypoint
from scystream.sdk.scheduler import Scheduler


@entrypoint
def example_task():
print("Executing example_task...")
### Compute Blocks and their configs
One of the central concepts of scystream are the so-called **Compute Blocks**.

A Compute Block describes an independent programm, that acts as some kind of worker
which will be scheduled using the scystream-core application.
This worker executes a task (e.g. a NLP task, a crwaling task).

@entrypoint
def another_task(task_name):
print(f"Executing another_task with task name: {task_name}")
Each worker can have multiple entrypoints, each aiming to solve one task.
These entrypoints can be configured from the outside using the **Settings**.
These are basically ENV-Variables, which will be parsed & validated using pydantic.

This SDK aims to implement helper functions and other requirements we expect each
Compute Block to have.

def main():
Scheduler.list_entrypoints()
Scheduler.execute_function("example_task")
Scheduler.execute_function("another_task", "ScheduledTask")
To understand the concept of such a Compute Block even more, take a look at the
config below.


if __name__ == "__main__":
main()

```

### Compute Block Config Files
We expect every repository which will be used within the scystream application
to contain a `Compute Block Config File`, the `cbc.yaml`, within the root directory.

This yaml-file describes the compute block itself.
It shows the entrypoints, their inputs and outputs.
to contain a **Compute Block Config File**, the `cbc.yaml`, within the root directory.
This `cbc.yaml` will be used to define the entrypoints, the inputs & outputs each
Compute Block offers, necessary for the scystream-frontend to understand.

This is an example `cbc.yaml`:

Expand Down Expand Up @@ -85,7 +72,7 @@ entrypoints:
description: "Analyze the runtimes"
inputs:
run_durations:
description: "Teble that contains all runtimes and dates"
description: "Table that contains all runtimes and dates"
type: "db_table"
config:
RUN_DURATIONS_TABLE_NAME: "run_durations_nlp"
Expand All @@ -97,7 +84,10 @@ entrypoints:
CSV_OUTPUT_PATH: "outputs/statistics.csv"
```
To read and validate such a config file u can proceed as follows:
For now, you have to write this config file on your own. However, at some
point you will be able to generate this config from your code.
To read and validate such a config file you can proceed as follows:
```python3
from scystream.sdk.config.config_loader import load_config
Expand All @@ -121,15 +111,86 @@ load_config(config_file_name="test.yaml", config_path="configs/")

the `config_path` is the path relative to your root directory

## Basic Usage of the SDK

```python3
from scystream.sdk.core import entrypoint
from scystream.sdk.scheduler import Scheduler
@entrypoint
def example_task():
print("Executing example_task...")
@entrypoint
def another_task(task_name):
print(f"Executing another_task with task name: {task_name}")
def main():
Scheduler.list_entrypoints()
Scheduler.execute_function("example_task")
Scheduler.execute_function("another_task", "ScheduledTask")
if __name__ == "__main__":
main()
```

## Defining Settings and Using them.

Earlier, we already wrote about **Settings**.
Each Input & Output can be configured using these settings.
There are also Global Settings, refered to as `envs` in the `cbc.yaml`

Below you can find a simple example of how we define & validate these settings.
Therefore you should use the `BaseENVSettings` class.

```python3
from scystream.sdk.core import entrypoint
from scystream.sdk.env.settings import BaseENVSettings
class GlobalSettings(BaseENVSettings):
LANGUAGE: str = "de"
class TopicModellingEntrypointSettings(BaseENVSettings):
TXT_SRC_PATH: str # if no default provided, setting this ENV manually is a MUST
@entrypoint(TopicModellingEntrypointSettings) # Pass it to the Entrypoint
def topic_modelling(settings):
print(f"Running topic modelling, using file: {settings.TXT_SRC_PATH}")
@entrypoint
def test_entrypint():
print("This entrypoint does not have any configs.")
```

We recommend defining your `GlobalSettings` in an extra file and "exporting" the loaded
Settings to make them accessible to other files.
See an example below:

```python3
from scystream.sdk.env.settings import BaseENVSettings
class GlobalSettings(BaseENVSettings):
LANGUAGE: str = "de"
GLOBAL_SETTINGS = GlobalSettings.load_settings()
```

You can then use the loaded `GLOBAL_SETTINGS` in your other files, by importing them.

## Development of the SDK

### Installation

1. Create a venv
1. Create a venv and use it

```bash
python3 -m venv .venv
source .venv/bin/activate
```

2. Install the package within the venv
Expand Down
33 changes: 26 additions & 7 deletions scystream/sdk/core.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,34 @@
import functools

from typing import Callable, Type, Optional
from .env.settings import BaseENVSettings
from pydantic import ValidationError

_registered_functions = {}


def entrypoint(func):
"""Decorator to mark a function as an entrypoint."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
_registered_functions[func.__name__] = func
return wrapper
def entrypoint(settings_class: Optional[Type[BaseENVSettings]] = None):
"""
Decorator to mark a function as an entrypoint.
It also loads and injects the settings of the entrypoint.
"""
def decorator(func: Callable):
@functools.wraps(func)
def wrapper(*args, **kwargs):
if settings_class is not None:
# Load settings
try:
settings = settings_class.load_settings()
except ValidationError as e:
raise ValueError(f"Invalid environment configuration: {e}")

return func(settings, *args, **kwargs)
else:
return func(*args, **kwargs)

_registered_functions[func.__name__] = wrapper
return wrapper
return decorator


def get_registered_functions():
Expand Down
30 changes: 30 additions & 0 deletions scystream/sdk/env/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from pydantic_settings import BaseSettings, SettingsConfigDict
from typing import Type

ENV_FILE_ENCODING = "utf-8"


class BaseENVSettings(BaseSettings):
"""
This class acts as the BaseClass which can be used to define custom
ENV-Variables which can be used across the ComputeBlock & for entrypoints
This definition, and pydantic, will then take care of validating the envs
"""

model_config = SettingsConfigDict(
env_file_encoding=ENV_FILE_ENCODING,
case_sensitive=True,
extra="ignore"
)

@classmethod
def load_settings(
cls: Type["BaseENVSettings"],
env_file: str = ".env"
) -> "BaseENVSettings":
"""
load_settings loads the env file. The name of the env_file can be
passed as an argument.
Returns the parsed ENVs
"""
return cls(_env_file=env_file, _env_file_encoding=ENV_FILE_ENCODING)
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
packages=find_packages(),
install_requires=[
"pydantic>=2.9.2",
"PyYAML>=6.0.2"
"PyYAML>=6.0.2",
"pydantic-settings>=2.6.1"
],
classifiers=[
"Programming Language :: Python :: 3",
Expand Down
2 changes: 1 addition & 1 deletion tests/test_config_files/valid_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ entrypoints:
description: "Analyze the runtimes"
inputs:
run_durations:
description: "Teble that contains all runtimes and dates"
description: "Table that contains all runtimes and dates"
type: "db_table"
config:
RUN_DURATIONS_TABLE_NAME: "run_durations_nlp"
Expand Down
2 changes: 1 addition & 1 deletion tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

class TestEntrypoint(unittest.TestCase):
def test_entrypoint_registration(self):
@entrypoint
@entrypoint()
def dummy_function():
return "Hello"

Expand Down
50 changes: 50 additions & 0 deletions tests/test_settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import unittest
import os
from scystream.sdk.core import entrypoint
from scystream.sdk.env.settings import BaseENVSettings


class WithDefaultSettings(BaseENVSettings):
DUMMY_SETTING: str = "this is a dummy setting"


class NoDefaultSetting(BaseENVSettings):
DUMMY_SETTING: str


class TestSettings(unittest.TestCase):
def test_entrypoint_with_setting_default(self):
@entrypoint(WithDefaultSettings)
def with_default_settings(settings):
return settings.DUMMY_SETTING

result = with_default_settings()
self.assertEqual(result, "this is a dummy setting")

"""
environment is set
"""
os.environ["DUMMY_SETTING"] = "overridden setting"
result = with_default_settings()
self.assertEqual(result, "overridden setting")
del os.environ["DUMMY_SETTING"]

def test_entrypoint_with_no_setting_default(self):
@entrypoint(NoDefaultSetting)
def with_no_default_settings(settings):
return settings.DUMMY_SETTING

with self.assertRaises(ValueError):
with_no_default_settings()

"""
environemnt is set
"""
os.environ["DUMMY_SETTING"] = "required setting"
result = with_no_default_settings()
self.assertEqual(result, "required setting")
del os.environ["DUMMY_SETTING"]


if __name__ == "__main__":
unittest.main()

0 comments on commit 771b5b1

Please sign in to comment.