Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1090] Concepts & Guides for helper columns and on-demand features #333

Merged
merged 11 commits into from
Dec 19, 2023
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions docs/concepts/fs/feature_group/on_demand_feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
description: On-demand feature computation.
---

# On-demand features

Features are defined as on-demand when their value cannot be pre-computed beforehand, rather they need to be computed in real-time during inference. This is achieved by implementing the on-demand features as a Python function in a Python module. Also ensure that the same version of the Python module is installed in both the feature and inference pipelines.

In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline.

<img src="../../../../assets/images/concepts/fs/on-demand-feature.png">

20 changes: 20 additions & 0 deletions docs/user_guides/fs/feature_view/batch-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,26 @@ It is very common that ML models are deployed in a "batch" setting where ML pipe
Dataset<Row> ds = featureView.getBatchData("20220620", "20220627")
```

## Retrieve batch data with primary keys and event time
In certain scenarios, for example for time series model input data needs to be sorted according to primary key(s) and event time combination.
Or one might want to merge predictions back to original input data for postmortem analysis. However, primary key(s) and event time usually are not included in the feature view query as
they are not features used for training. To get them pass following attributes to `get_batch_data` method:
`primary_keys=True` and/or `event_time=True`.
davitbzh marked this conversation as resolved.
Show resolved Hide resolved

=== "Python"
```python
# get batch data
df = feature_view.get_batch_data(
start_time = "20220620",
end_time = "20220627",
primary_keys=True,
event_time=True
) # return a dataframe with primary keys and event time
```
!!! note
If event time column has the same name in feature groups included in feature view query then the event time of the left most feature group in the query will be returned. If they have different names then
all of them will be returned. Join prefix doesn't have any influence on this behaviour.
davitbzh marked this conversation as resolved.
Show resolved Hide resolved

For Python-clients, handling small or moderately-sized data, we recommend enabling the [ArrowFlight Server with DuckDB](../../../setup_installation/common/arrow_flight_duckdb.md), which will provide significant speedups over Spark/Hive for reading batch data.
If the service is enabled, and you want to read this particular batch data with Hive instead, you can set the read_options to `{"use_hive": True}`.
```python
Expand Down
170 changes: 170 additions & 0 deletions docs/user_guides/fs/feature_view/helper-columns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
description: Using Helper columns in Feature View queries for online/batch inference and training dataset.
---

SirOibaf marked this conversation as resolved.
Show resolved Hide resolved
# Helper columns
Hopsworks Feature Store provides functionality to define two types of helper columns `inference_helper_columns` and `training_helper_columns` for [feature views](./overview.md).
davitbzh marked this conversation as resolved.
Show resolved Hide resolved

!!! note
Both inference and training helper column name(s) must be part of the `Query` object. If helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to prepended
to the original column name when defining helper column list.

# Inference Helper columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a h2 title not h1, i.e., put another # at the beginning

`inference_helper_columns` are a list of feature names that are not used for training the model itself but are used for extra information during online or batch inference.
For example computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) like distance between previous and current place of transaction `loc_delta_t_minus_1` in credit card fraud detection system.
Feature `loc_delta_t_minus_1` will be computed using previous transaction coordinates `longitude` and `latitude` that needs to fetched from the feature store and compared to the new transaction coordinates that arrives at inference application.
In this use case `longitude` and `latitude` are `inference_helper_columns`. They are not used for training but are necessary for computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) `loc_delta_t_minus_1`.

=== "Python"

!!! example "Define inference columns for feature views."
```python
# define query object
query = label_fg.select("fraud_label")\
.join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"]))

# define feature view with helper columns
feature_view = fs.get_or_create_feature_view(
name='fv_with_helper_col',
version=1,
query=query,
labels=["fraud_label"],
transformation_functions=transformation_functions,
inference_helper_columns=["longitude", "latitude"],
)
```

### Retrieval
When retrieving data for model inference, helper columns will be omitted. However, they can be optionally fetched with inference or training data.

### Batch inference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want #### here.


=== "Python"

!!! example "Fetch inference helper column values and compute on-demand features during batch inference."
```python

# import feature functions
from feature_functions import location_delta, time_delta

# Fetch feature view object
feature_view = fs.get_feature_view(
name='fv_with_helper_col',
version=1,
)

# Fetch feature data for batch inference with helper columns
df = feature_view.get_batch_data(start_time=start_time, end_time=end_time, inference_helpers=True)
df['longitude_prev'] = df['longitude'].shift(-1)
df['latitute_prev'] = df['latitute'].shift(-1)

# compute location delta
df['loc_delta_t_minus_1'] = df.apply(lambda row: location_delta(row['longitude'],
row['latitute'],
row['longitude_prev'],
row['latitute_prev']), axis=1)

# prepare datatame for prediction
df = df[[f.name for f in feature_view.features if not (f.label or f.inference_helper_column or f.training_helper_column)]]
```

### Online inference
Copy link
Contributor

@SirOibaf SirOibaf Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#### here.


=== "Python"

!!! example "Fetch inference helper column values and compute on-demand features during online inference."
```python

from feature_functions import location_delta, time_delta

# Fetch feature view object
feature_view = fs.get_feature_view(
name='fv_with_helper_col',
version=1,
)

# Fetch feature data for batch inference without helper columns
df_without_inference_helpers = feature_view.get_batch_data()

# Fetch feature data for batch inference with helper columns
df_with_inference_helpers = feature_view.get_batch_data(inference_helpers=True)

# here cc_num, longitute and lattitude are provided as parameters to the application
cc_num = ...
longitude = ...
latitute = ...

# get previous transaction location of this credit card
inference_helper = feature_view.get_inference_helper({"cc_num": cc_num}, return_type="dict")

# compute location delta
loc_delta_t_minus_1 = location_delta(longitude,
latitute,
inference_helper['longitude'],
inference_helper['latitute'])


# Now get assembled feature vector for prediction
feature_vector = feature_view.get_feature_vector({"cc_num": cc_num},
passed_features={"loc_delta_t_minus_1": loc_delta_t_minus_1}
)
```


## Training Helper columns
`training_helper_columns` are a list of feature names that are not the part of the model schema itself but are used during training for the extra information.
For example one might want to use feature like `category` of the purchased product to assign different weights.

=== "Python"

!!! example "Define training helper columns for feature views."
```python
# define query object
query = label_fg.select("fraud_label")\
.join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"]))

# define feature view with helper columns
feature_view = fs.get_or_create_feature_view(
name='fv_with_helper_col',
version=1,
query=query,
labels=["fraud_label"],
transformation_functions=transformation_functions,
training_helper_columns=["category"]
)
```

### Retrieval
When retrieving training data helper columns will be omitted. However, they can be optionally fetched.

=== "Python"

!!! example "Fetch training data with or without inference helper column values."
```python

# import feature functions
from feature_functions import location_delta, time_delta

# Fetch feature view object
feature_view = fs.get_feature_view(
name='fv_with_helper_col',
version=1,
)

# Create and training data with training helper columns
TEST_SIZE = 0.2
X_train, X_test, y_train, y_test = feature_view.train_test_split(
description='transactions fraud training dataset',
test_size=TEST_SIZE,
training_helper_columns=True
)

# Get existing training data with training helper columns
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
training_dataset_version=1,
training_helper_columns=True
)
```

!!! note
To use helper columns with materialized training dataset it needs to be created with `training_helper_columns=True`.
19 changes: 19 additions & 0 deletions docs/user_guides/fs/feature_view/training-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,25 @@ X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_da
X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)
```

## Read training data with primary key(s) and event time
In certain scenarios, for example for time series analysis training data needs to be sorted according to primary key(s) and event time combination.
However, they usually are not included in the feature view query as they are not features used for training. To get them pass following attributes
`primary_keys=True` and/or `event_time=True`.
davitbzh marked this conversation as resolved.
Show resolved Hide resolved


```python
# get a training dataset
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1,
primary_keys=True,
event_time=True)
```

!!! note
If event time column has the same name in feature groups included in the parent feature view query then the event time of the left most feature group in the query will be returned. If they have different names then
all of them will be returned. Join prefix doesn't have any influence on this behaviour.
davitbzh marked this conversation as resolved.
Show resolved Hide resolved

To use primary key(s) and event time column with materialized training datasets it needs to be created with `primary_keys=True` and/or `with_event_time=True`.

## Deletion
To clean up unused training data, you can delete all training data or for a particular version. Note that all metadata of training data and materialised files stored in HopsFS will be deleted and cannot be recreated anymore.
```python
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- Spine Group: concepts/fs/feature_group/spine_group.md
- Data Validation/Stats/Alerts: concepts/fs/feature_group/fg_statistics.md
- Versioning: concepts/fs/feature_group/versioning.md
- On-Demand Feature: concepts/fs/feature_group/on_demand_feature.md
- Feature Views:
- Overview: concepts/fs/feature_view/fv_overview.md
- Offline API: concepts/fs/feature_view/offline_api.md
Expand Down Expand Up @@ -84,6 +85,7 @@ nav:
- Feature vectors: user_guides/fs/feature_view/feature-vectors.md
- Feature server: user_guides/fs/feature_view/feature-server.md
- Query: user_guides/fs/feature_view/query.md
- Helper Columns: user_guides/fs/feature_view/helper-columns.md
- Transformation Functions: user_guides/fs/feature_view/transformation-function.md
- Spines: user_guides/fs/feature_view/spine-query.md
- Compute Engines: user_guides/fs/compute_engines.md
Expand Down
Loading