[FSTORE-1090] Concepts & Guides for helper columns and on-demand feat…

…ures (#333) Co-authored-by: Fabio Buso <[email protected]> Co-authored-by: davitbzh <[email protected]>
logicalclocks · Dec 19, 2023 · d9bf08e · d9bf08e
1 parent 8d18206
commit d9bf08e
Show file tree

Hide file tree

Showing 6 changed files with 220 additions and 0 deletions.
diff --git a/docs/assets/images/concepts/fs/on-demand-feature.png b/docs/assets/images/concepts/fs/on-demand-feature.png
diff --git a/docs/concepts/fs/feature_group/on_demand_feature.md b/docs/concepts/fs/feature_group/on_demand_feature.md
@@ -0,0 +1,12 @@
+---
+description: On-demand feature computation.
+---
+
+# On-demand features
+
+Features are defined as on-demand when their value cannot be pre-computed beforehand, rather they need to be computed in real-time during inference. This is achieved by implementing the on-demand features as a Python function in a Python module. Also ensure that the same version of the Python module is installed in both the feature and inference pipelines.
+
+In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline. 
+
+<img src="../../../../assets/images/concepts/fs/on-demand-feature.png">
+
diff --git a/docs/user_guides/fs/feature_view/batch-data.md b/docs/user_guides/fs/feature_view/batch-data.md
@@ -16,6 +16,24 @@ It is very common that ML models are deployed in a "batch" setting where ML pipe
     Dataset<Row> ds = featureView.getBatchData("20220620", "20220627")
     ```
 
+## Retrieve batch data with primary keys and event time
+For certain use cases, e.g. time series models, the input data needs to be sorted according to the primary key(s) and event time combination. Or one might want to merge predictions back with the original input data for postmortem analysis.
+Primary key(s) and event time are not usually included in the feature view query as they are not features used for training.
+To retrieve the primary key(s) and/or event time when retrieving batch data for inference, you need to set the parameters `primary_keys=True` and/or `event_time=True`.
+
+=== "Python"
+    ```python
+    # get batch data
+    df = feature_view.get_batch_data(
+    start_time = "20220620",
+    end_time = "20220627",
+    primary_keys=True,
+    event_time=True
+    ) # return a dataframe with primary keys and event time
+    ```
+!!! note
+    If the event time columns have the same name across all the feature groups included in the feature view, then only the event time of the label feature group (left most feature group in the query) will be returned. If they have different names, then all of them will be returned. The Join prefix does not have any influence on this behaviour.
+
 For Python-clients, handling small or moderately-sized data, we recommend enabling the [ArrowFlight Server with DuckDB](../../../setup_installation/common/arrow_flight_duckdb.md), which will provide significant speedups over Spark/Hive for reading batch data.
 If the service is enabled, and you want to read this particular batch data with Hive instead, you can set the read_options to `{"use_hive": True}`.
 ```python

diff --git a/docs/user_guides/fs/feature_view/helper-columns.md b/docs/user_guides/fs/feature_view/helper-columns.md
@@ -0,0 +1,170 @@
+---
+description: Using Helper columns in Feature View queries for online/batch inference and training dataset.
+---
+
+# Helper columns
+Hopsworks Feature Store provides a functionality to define two types of helper columns `inference_helper_columns` and `training_helper_columns` for [feature views](./overview.md).
+
+!!! note
+    Both inference and training helper column name(s) must be part of the `Query` object. If helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to prepended
+    to the original column name when defining helper column list.
+
+## Inference Helper columns
+`inference_helper_columns` are a list of feature names that are not used for training the model itself but are used for extra information during online or batch inference. 
+For example computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) like distance between previous and current place of transaction `loc_delta_t_minus_1` in credit card fraud detection system.
+Feature `loc_delta_t_minus_1` will be computed using previous transaction coordinates `longitude` and `latitude` that needs to fetched from the feature store and compared to the new transaction coordinates that arrives at inference application. 
+In this use case `longitude` and `latitude` are `inference_helper_columns`. They are not used for training but are necessary for computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) `loc_delta_t_minus_1`.
+
+=== "Python"
+
+    !!! example "Define inference columns for feature views."
+        ```python
+        # define query object 
+        query = label_fg.select("fraud_label")\
+                        .join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"])) 
+        
+        # define feature view with helper columns
+        feature_view = fs.get_or_create_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+            query=query,
+            labels=["fraud_label"],
+            transformation_functions=transformation_functions,
+            inference_helper_columns=["longitude", "latitude"],
+        )
+        ```
+
+### Retrieval
+When retrieving data for model inference, helper columns will be omitted. However, they can be optionally fetched with inference or training data.
+
+#### Batch inference
+
+=== "Python"
+
+    !!! example "Fetch inference helper column values and compute on-demand features during batch inference."
+        ```python
+
+        # import feature functions
+        from feature_functions import location_delta, time_delta
+        
+        # Fetch feature view object  
+        feature_view = fs.get_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+        )
+
+        # Fetch feature data for batch inference with helper columns
+        df = feature_view.get_batch_data(start_time=start_time, end_time=end_time, inference_helpers=True)
+        df['longitude_prev'] = df['longitude'].shift(-1)
+        df['latitute_prev'] = df['latitute'].shift(-1)
+
+        # compute location delta
+        df['loc_delta_t_minus_1'] = df.apply(lambda row: location_delta(row['longitude'], 
+                                                                        row['latitute'],
+                                                                        row['longitude_prev'], 
+                                                                        row['latitute_prev']), axis=1)
+
+        # prepare datatame for prediction
+        df = df[[f.name for f in feature_view.features if not (f.label or f.inference_helper_column or f.training_helper_column)]]
+        ```
+
+#### Online inference
+
+=== "Python"
+
+    !!! example "Fetch inference helper column values and compute on-demand features during online inference."
+        ```python
+
+        from feature_functions import location_delta, time_delta
+        
+        # Fetch feature view object  
+        feature_view = fs.get_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+        )
+
+        # Fetch feature data for batch inference without helper columns
+        df_without_inference_helpers = feature_view.get_batch_data()
+
+        # Fetch feature data for batch inference with helper columns
+        df_with_inference_helpers = feature_view.get_batch_data(inference_helpers=True)
+
+        # here cc_num, longitute and lattitude are provided as parameters to the application
+        cc_num = ...
+        longitude = ...
+        latitute = ...
+        
+        # get previous transaction location of this credit card
+        inference_helper = feature_view.get_inference_helper({"cc_num": cc_num}, return_type="dict")
+
+        # compute location delta 
+        loc_delta_t_minus_1 = location_delta(longitude, 
+                                             latitute, 
+                                             inference_helper['longitude'], 
+                                             inference_helper['latitute'])
+
+
+        # Now get assembled feature vector for prediction
+        feature_vector = feature_view.get_feature_vector({"cc_num": cc_num}, 
+                                                          passed_features={"loc_delta_t_minus_1": loc_delta_t_minus_1}
+                                                         )
+        ```
+
+
+## Training Helper columns
+`training_helper_columns` are a list of feature names that are not the part of the model schema itself but are used during training for the extra information. 
+For example one might want to use feature like `category` of the purchased product to assign different weights.
+
+=== "Python"
+
+    !!! example "Define training helper columns for feature views."
+        ```python
+        # define query object 
+        query = label_fg.select("fraud_label")\
+                        .join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"])) 
+        
+        # define feature view with helper columns
+        feature_view = fs.get_or_create_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+            query=query,
+            labels=["fraud_label"],
+            transformation_functions=transformation_functions,
+            training_helper_columns=["category"]
+        )
+        ```
+
+### Retrieval
+When retrieving training data helper columns will be omitted. However, they can be optionally fetched.
+
+=== "Python"
+
+    !!! example "Fetch training data with or without inference helper column values."
+        ```python
+
+        # import feature functions
+        from feature_functions import location_delta, time_delta
+        
+        # Fetch feature view object  
+        feature_view = fs.get_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+        )
+
+        # Create and training data with training helper columns
+        TEST_SIZE = 0.2
+        X_train, X_test, y_train, y_test = feature_view.train_test_split(
+            description='transactions fraud training dataset',
+            test_size=TEST_SIZE,
+             training_helper_columns=True
+        )
+
+        # Get existing training data with training helper columns
+        X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
+             training_dataset_version=1,
+             training_helper_columns=True
+        )
+        ``` 
+
+!!! note
+    To use helper columns with materialized training dataset it needs to be created with `training_helper_columns=True`.  
diff --git a/docs/user_guides/fs/feature_view/training-data.md b/docs/user_guides/fs/feature_view/training-data.md
@@ -94,6 +94,24 @@ X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_da
 X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)
 ```
 
+## Read training data with primary key(s) and event time
+For certain use cases, e.g. time series models, the input data needs to be sorted according to the primary key(s) and event time combination. 
+Primary key(s) and event time are not usually included in the feature view query as they are not features used for training.
+To retrieve the primary key(s) and/or event time when retrieving training data, you need to set the parameters `primary_keys=True` and/or `event_time=True`.
+
+
+```python
+# get a training dataset
+X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1, 
+                                                                     primary_keys=True,
+                                                                     event_time=True)
+```
+
+!!! note
+    If the event time columns have the same name across all the feature groups included in the feature view, then only the event time of the label feature group (left most feature group in the query) will be returned. If they have different names, then all of them will be returned. The Join prefix does not have any influence on this behaviour.
+
+    To use primary key(s) and event time column with materialized training datasets it needs to be created with `primary_keys=True` and/or `with_event_time=True`.  
+
 ## Deletion
 To clean up unused training data, you can delete all training data or for a particular version. Note that all metadata of training data and materialised files stored in HopsFS will be deleted and cannot be recreated anymore.
 ```python

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -28,6 +28,7 @@ nav:
               - Spine Group: concepts/fs/feature_group/spine_group.md
               - Data Validation/Stats/Alerts: concepts/fs/feature_group/fg_statistics.md
               - Versioning: concepts/fs/feature_group/versioning.md
+              - On-Demand Feature: concepts/fs/feature_group/on_demand_feature.md
           - Feature Views:
               - Overview: concepts/fs/feature_view/fv_overview.md
               - Offline API: concepts/fs/feature_view/offline_api.md
@@ -84,6 +85,7 @@ nav:
               - Feature vectors: user_guides/fs/feature_view/feature-vectors.md
               - Feature server: user_guides/fs/feature_view/feature-server.md
               - Query: user_guides/fs/feature_view/query.md
+              - Helper Columns: user_guides/fs/feature_view/helper-columns.md
               - Transformation Functions: user_guides/fs/feature_view/transformation-function.md
               - Spines: user_guides/fs/feature_view/spine-query.md
           - Compute Engines: user_guides/fs/compute_engines.md