Skip to content

Commit

Permalink
[FSTORE-1090] Concepts & Guides for helper columns and on-demand feat…
Browse files Browse the repository at this point in the history
…ures (#333)

Co-authored-by: Fabio Buso <[email protected]>
Co-authored-by: davitbzh <[email protected]>
  • Loading branch information
3 people committed Dec 19, 2023
1 parent 8d18206 commit d9bf08e
Show file tree
Hide file tree
Showing 6 changed files with 220 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions docs/concepts/fs/feature_group/on_demand_feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
description: On-demand feature computation.
---

# On-demand features

Features are defined as on-demand when their value cannot be pre-computed beforehand, rather they need to be computed in real-time during inference. This is achieved by implementing the on-demand features as a Python function in a Python module. Also ensure that the same version of the Python module is installed in both the feature and inference pipelines.

In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline.

<img src="../../../../assets/images/concepts/fs/on-demand-feature.png">

18 changes: 18 additions & 0 deletions docs/user_guides/fs/feature_view/batch-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,24 @@ It is very common that ML models are deployed in a "batch" setting where ML pipe
Dataset<Row> ds = featureView.getBatchData("20220620", "20220627")
```

## Retrieve batch data with primary keys and event time
For certain use cases, e.g. time series models, the input data needs to be sorted according to the primary key(s) and event time combination. Or one might want to merge predictions back with the original input data for postmortem analysis.
Primary key(s) and event time are not usually included in the feature view query as they are not features used for training.
To retrieve the primary key(s) and/or event time when retrieving batch data for inference, you need to set the parameters `primary_keys=True` and/or `event_time=True`.

=== "Python"
```python
# get batch data
df = feature_view.get_batch_data(
start_time = "20220620",
end_time = "20220627",
primary_keys=True,
event_time=True
) # return a dataframe with primary keys and event time
```
!!! note
If the event time columns have the same name across all the feature groups included in the feature view, then only the event time of the label feature group (left most feature group in the query) will be returned. If they have different names, then all of them will be returned. The Join prefix does not have any influence on this behaviour.

For Python-clients, handling small or moderately-sized data, we recommend enabling the [ArrowFlight Server with DuckDB](../../../setup_installation/common/arrow_flight_duckdb.md), which will provide significant speedups over Spark/Hive for reading batch data.
If the service is enabled, and you want to read this particular batch data with Hive instead, you can set the read_options to `{"use_hive": True}`.
```python
Expand Down
170 changes: 170 additions & 0 deletions docs/user_guides/fs/feature_view/helper-columns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
description: Using Helper columns in Feature View queries for online/batch inference and training dataset.
---

# Helper columns
Hopsworks Feature Store provides a functionality to define two types of helper columns `inference_helper_columns` and `training_helper_columns` for [feature views](./overview.md).

!!! note
Both inference and training helper column name(s) must be part of the `Query` object. If helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to prepended
to the original column name when defining helper column list.

## Inference Helper columns
`inference_helper_columns` are a list of feature names that are not used for training the model itself but are used for extra information during online or batch inference.
For example computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) like distance between previous and current place of transaction `loc_delta_t_minus_1` in credit card fraud detection system.
Feature `loc_delta_t_minus_1` will be computed using previous transaction coordinates `longitude` and `latitude` that needs to fetched from the feature store and compared to the new transaction coordinates that arrives at inference application.
In this use case `longitude` and `latitude` are `inference_helper_columns`. They are not used for training but are necessary for computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) `loc_delta_t_minus_1`.

=== "Python"

!!! example "Define inference columns for feature views."
```python
# define query object
query = label_fg.select("fraud_label")\
.join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"]))
# define feature view with helper columns
feature_view = fs.get_or_create_feature_view(
name='fv_with_helper_col',
version=1,
query=query,
labels=["fraud_label"],
transformation_functions=transformation_functions,
inference_helper_columns=["longitude", "latitude"],
)
```

### Retrieval
When retrieving data for model inference, helper columns will be omitted. However, they can be optionally fetched with inference or training data.

#### Batch inference

=== "Python"

!!! example "Fetch inference helper column values and compute on-demand features during batch inference."
```python

# import feature functions
from feature_functions import location_delta, time_delta
# Fetch feature view object
feature_view = fs.get_feature_view(
name='fv_with_helper_col',
version=1,
)

# Fetch feature data for batch inference with helper columns
df = feature_view.get_batch_data(start_time=start_time, end_time=end_time, inference_helpers=True)
df['longitude_prev'] = df['longitude'].shift(-1)
df['latitute_prev'] = df['latitute'].shift(-1)

# compute location delta
df['loc_delta_t_minus_1'] = df.apply(lambda row: location_delta(row['longitude'],
row['latitute'],
row['longitude_prev'],
row['latitute_prev']), axis=1)

# prepare datatame for prediction
df = df[[f.name for f in feature_view.features if not (f.label or f.inference_helper_column or f.training_helper_column)]]
```

#### Online inference

=== "Python"

!!! example "Fetch inference helper column values and compute on-demand features during online inference."
```python

from feature_functions import location_delta, time_delta
# Fetch feature view object
feature_view = fs.get_feature_view(
name='fv_with_helper_col',
version=1,
)

# Fetch feature data for batch inference without helper columns
df_without_inference_helpers = feature_view.get_batch_data()

# Fetch feature data for batch inference with helper columns
df_with_inference_helpers = feature_view.get_batch_data(inference_helpers=True)

# here cc_num, longitute and lattitude are provided as parameters to the application
cc_num = ...
longitude = ...
latitute = ...
# get previous transaction location of this credit card
inference_helper = feature_view.get_inference_helper({"cc_num": cc_num}, return_type="dict")

# compute location delta
loc_delta_t_minus_1 = location_delta(longitude,
latitute,
inference_helper['longitude'],
inference_helper['latitute'])


# Now get assembled feature vector for prediction
feature_vector = feature_view.get_feature_vector({"cc_num": cc_num},
passed_features={"loc_delta_t_minus_1": loc_delta_t_minus_1}
)
```


## Training Helper columns
`training_helper_columns` are a list of feature names that are not the part of the model schema itself but are used during training for the extra information.
For example one might want to use feature like `category` of the purchased product to assign different weights.

=== "Python"

!!! example "Define training helper columns for feature views."
```python
# define query object
query = label_fg.select("fraud_label")\
.join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"]))
# define feature view with helper columns
feature_view = fs.get_or_create_feature_view(
name='fv_with_helper_col',
version=1,
query=query,
labels=["fraud_label"],
transformation_functions=transformation_functions,
training_helper_columns=["category"]
)
```

### Retrieval
When retrieving training data helper columns will be omitted. However, they can be optionally fetched.

=== "Python"

!!! example "Fetch training data with or without inference helper column values."
```python

# import feature functions
from feature_functions import location_delta, time_delta
# Fetch feature view object
feature_view = fs.get_feature_view(
name='fv_with_helper_col',
version=1,
)

# Create and training data with training helper columns
TEST_SIZE = 0.2
X_train, X_test, y_train, y_test = feature_view.train_test_split(
description='transactions fraud training dataset',
test_size=TEST_SIZE,
training_helper_columns=True
)

# Get existing training data with training helper columns
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
training_dataset_version=1,
training_helper_columns=True
)
```

!!! note
To use helper columns with materialized training dataset it needs to be created with `training_helper_columns=True`.
18 changes: 18 additions & 0 deletions docs/user_guides/fs/feature_view/training-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,24 @@ X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_da
X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)
```

## Read training data with primary key(s) and event time
For certain use cases, e.g. time series models, the input data needs to be sorted according to the primary key(s) and event time combination.
Primary key(s) and event time are not usually included in the feature view query as they are not features used for training.
To retrieve the primary key(s) and/or event time when retrieving training data, you need to set the parameters `primary_keys=True` and/or `event_time=True`.


```python
# get a training dataset
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1,
primary_keys=True,
event_time=True)
```

!!! note
If the event time columns have the same name across all the feature groups included in the feature view, then only the event time of the label feature group (left most feature group in the query) will be returned. If they have different names, then all of them will be returned. The Join prefix does not have any influence on this behaviour.

To use primary key(s) and event time column with materialized training datasets it needs to be created with `primary_keys=True` and/or `with_event_time=True`.

## Deletion
To clean up unused training data, you can delete all training data or for a particular version. Note that all metadata of training data and materialised files stored in HopsFS will be deleted and cannot be recreated anymore.
```python
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- Spine Group: concepts/fs/feature_group/spine_group.md
- Data Validation/Stats/Alerts: concepts/fs/feature_group/fg_statistics.md
- Versioning: concepts/fs/feature_group/versioning.md
- On-Demand Feature: concepts/fs/feature_group/on_demand_feature.md
- Feature Views:
- Overview: concepts/fs/feature_view/fv_overview.md
- Offline API: concepts/fs/feature_view/offline_api.md
Expand Down Expand Up @@ -84,6 +85,7 @@ nav:
- Feature vectors: user_guides/fs/feature_view/feature-vectors.md
- Feature server: user_guides/fs/feature_view/feature-server.md
- Query: user_guides/fs/feature_view/query.md
- Helper Columns: user_guides/fs/feature_view/helper-columns.md
- Transformation Functions: user_guides/fs/feature_view/transformation-function.md
- Spines: user_guides/fs/feature_view/spine-query.md
- Compute Engines: user_guides/fs/compute_engines.md
Expand Down

0 comments on commit d9bf08e

Please sign in to comment.