logicalclocks · SirOibaf · Dec 19, 2023 · Nov 24, 2023 · Nov 27, 2023 · Dec 17, 2023
diff --git a/docs/assets/images/concepts/fs/on-demand-feature.png b/docs/assets/images/concepts/fs/on-demand-feature.png
diff --git a/docs/concepts/fs/feature_group/on_demand_feature.md b/docs/concepts/fs/feature_group/on_demand_feature.md
@@ -0,0 +1,12 @@
+---
+description: On-demand feature computation.
+---
+
+# On-demand features
+
+Features are defined as on-demand when their value cannot be pre-computed beforehand, rather they need to be computed in real-time during inference. This is achieved by implementing the on-demand features as a Python function in a Python module. Also ensure that the same version of the Python module is installed in both the feature and inference pipelines.
+
+In the image below shows an example of a housing price model that demonstrates how to implement an on-demand feature, a zip code (or post code) that is computed using longitude/latitude parameters. In your online application, longitude and latitude are provided as parameters to the application, and the same python function used to calculate the zip code in the feature pipeline is used to compute the zip code in the Online Inference pipeline. 
+
+<img src="../../../../assets/images/concepts/fs/on-demand-feature.png">
+
diff --git a/docs/user_guides/fs/feature_view/batch-data.md b/docs/user_guides/fs/feature_view/batch-data.md
@@ -16,6 +16,26 @@ It is very common that ML models are deployed in a "batch" setting where ML pipe
     Dataset<Row> ds = featureView.getBatchData("20220620", "20220627")
     ```
 
+## Retrieve batch data with primary keys and event time
+In certain scenarios, for example for time series model input data needs to be sorted according to primary key(s) and event time combination.  
+Or one might want to merge predictions back to original input data for postmortem analysis. However,  primary key(s) and event time usually are not included in the feature view query as 
+they are not features used for training. To get them pass following attributes to `get_batch_data` method:  
+`primary_keys=True` and/or `event_time=True`.
+
+=== "Python"
+    ```python
+    # get batch data
+    df = feature_view.get_batch_data(
+    start_time = "20220620",
+    end_time = "20220627",
+    primary_keys=True,
+    event_time=True
+    ) # return a dataframe with primary keys and event time
+    ```
+!!! note
+    If event time column has the same name in feature groups included in feature view query then the event time of the left most feature group in the query will be returned. If they have different names then
+    all of them will be returned. Join prefix doesn't have any influence on this behaviour. 
+
 For Python-clients, handling small or moderately-sized data, we recommend enabling the [ArrowFlight Server with DuckDB](../../../setup_installation/common/arrow_flight_duckdb.md), which will provide significant speedups over Spark/Hive for reading batch data.
 If the service is enabled, and you want to read this particular batch data with Hive instead, you can set the read_options to `{"use_hive": True}`.
 ```python

diff --git a/docs/user_guides/fs/feature_view/helper-columns.md b/docs/user_guides/fs/feature_view/helper-columns.md
@@ -0,0 +1,170 @@
+---
+description: Using Helper columns in Feature View queries for online/batch inference and training dataset.
+---
+
+# Helper columns
+Hopsworks Feature Store provides functionality to define two types of helper columns `inference_helper_columns` and `training_helper_columns` for [feature views](./overview.md).
+
+!!! note
+    Both inference and training helper column name(s) must be part of the `Query` object. If helper column name(s) belong to feature group that is part of a `Join` with `prefix` defined, then this prefix needs to prepended
+    to the original column name when defining helper column list.
+
+# Inference Helper columns
+`inference_helper_columns` are a list of feature names that are not used for training the model itself but are used for extra information during online or batch inference. 
+For example computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) like distance between previous and current place of transaction `loc_delta_t_minus_1` in credit card fraud detection system.
+Feature `loc_delta_t_minus_1` will be computed using previous transaction coordinates `longitude` and `latitude` that needs to fetched from the feature store and compared to the new transaction coordinates that arrives at inference application. 
+In this use case `longitude` and `latitude` are `inference_helper_columns`. They are not used for training but are necessary for computing [on-demand feature](../../../concepts/fs/feature_group/on_demand_feature.md) `loc_delta_t_minus_1`.
+
+=== "Python"
+
+    !!! example "Define inference columns for feature views."
+        ```python
+        # define query object 
+        query = label_fg.select("fraud_label")\
+                        .join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"])) 
+
+        # define feature view with helper columns
+        feature_view = fs.get_or_create_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+            query=query,
+            labels=["fraud_label"],
+            transformation_functions=transformation_functions,
+            inference_helper_columns=["longitude", "latitude"],
+        )
+        ```
+
+### Retrieval
+When retrieving data for model inference, helper columns will be omitted. However, they can be optionally fetched with inference or training data.
+
+### Batch inference
+
+=== "Python"
+
+    !!! example "Fetch inference helper column values and compute on-demand features during batch inference."
+        ```python
+
+        # import feature functions
+        from feature_functions import location_delta, time_delta
+
+        # Fetch feature view object  
+        feature_view = fs.get_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+        )
+
+        # Fetch feature data for batch inference with helper columns
+        df = feature_view.get_batch_data(start_time=start_time, end_time=end_time, inference_helpers=True)
+        df['longitude_prev'] = df['longitude'].shift(-1)
+        df['latitute_prev'] = df['latitute'].shift(-1)
+
+        # compute location delta
+        df['loc_delta_t_minus_1'] = df.apply(lambda row: location_delta(row['longitude'], 
+                                                                        row['latitute'],
+                                                                        row['longitude_prev'], 
+                                                                        row['latitute_prev']), axis=1)
+
+        # prepare datatame for prediction
+        df = df[[f.name for f in feature_view.features if not (f.label or f.inference_helper_column or f.training_helper_column)]]
+        ```
+
+### Online inference
+
+=== "Python"
+
+    !!! example "Fetch inference helper column values and compute on-demand features during online inference."
+        ```python
+
+        from feature_functions import location_delta, time_delta
+
+        # Fetch feature view object  
+        feature_view = fs.get_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+        )
+
+        # Fetch feature data for batch inference without helper columns
+        df_without_inference_helpers = feature_view.get_batch_data()
+
+        # Fetch feature data for batch inference with helper columns
+        df_with_inference_helpers = feature_view.get_batch_data(inference_helpers=True)
+
+        # here cc_num, longitute and lattitude are provided as parameters to the application
+        cc_num = ...
+        longitude = ...
+        latitute = ...
+
+        # get previous transaction location of this credit card
+        inference_helper = feature_view.get_inference_helper({"cc_num": cc_num}, return_type="dict")
+
+        # compute location delta 
+        loc_delta_t_minus_1 = location_delta(longitude, 
+                                             latitute, 
+                                             inference_helper['longitude'], 
+                                             inference_helper['latitute'])
+
+
+        # Now get assembled feature vector for prediction
+        feature_vector = feature_view.get_feature_vector({"cc_num": cc_num}, 
+                                                          passed_features={"loc_delta_t_minus_1": loc_delta_t_minus_1}
+                                                         )
+        ```
+
+
+## Training Helper columns
+`training_helper_columns` are a list of feature names that are not the part of the model schema itself but are used during training for the extra information. 
+For example one might want to use feature like `category` of the purchased product to assign different weights.
+
+=== "Python"
+
+    !!! example "Define training helper columns for feature views."
+        ```python
+        # define query object 
+        query = label_fg.select("fraud_label")\
+                        .join(trans_fg.select(["amount", "loc_delta_t_minus_1", "longitude", "latitude", "category"])) 
+
+        # define feature view with helper columns
+        feature_view = fs.get_or_create_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+            query=query,
+            labels=["fraud_label"],
+            transformation_functions=transformation_functions,
+            training_helper_columns=["category"]
+        )
+        ```
+
+### Retrieval
+When retrieving training data helper columns will be omitted. However, they can be optionally fetched.
+
+=== "Python"
+
+    !!! example "Fetch training data with or without inference helper column values."
+        ```python
+
+        # import feature functions
+        from feature_functions import location_delta, time_delta
+
+        # Fetch feature view object  
+        feature_view = fs.get_feature_view(
+            name='fv_with_helper_col',
+            version=1,
+        )
+
+        # Create and training data with training helper columns
+        TEST_SIZE = 0.2
+        X_train, X_test, y_train, y_test = feature_view.train_test_split(
+            description='transactions fraud training dataset',
+            test_size=TEST_SIZE,
+             training_helper_columns=True
+        )
+
+        # Get existing training data with training helper columns
+        X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
+             training_dataset_version=1,
+             training_helper_columns=True
+        )
+        ``` 
+
+!!! note
+    To use helper columns with materialized training dataset it needs to be created with `training_helper_columns=True`.  
diff --git a/docs/user_guides/fs/feature_view/training-data.md b/docs/user_guides/fs/feature_view/training-data.md
@@ -94,6 +94,25 @@ X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_da
 X_train, X_val, X_test, y_train, y_val, y_test = feature_view.get_train_validation_test_split(training_dataset_version=1)
 ```
 
+## Read training data with primary key(s) and event time
+In certain scenarios, for example for time series analysis training data needs to be sorted according to primary key(s) and event time combination.  
+However, they usually are not included in the feature view query as they are not features used for training. To get them pass following attributes  
+`primary_keys=True` and/or `event_time=True`.
+
+
+```python
+# get a training dataset
+X_train, X_test, y_train, y_test = feature_view.get_train_test_split(training_dataset_version=1, 
+                                                                     primary_keys=True,
+                                                                     event_time=True)
+```
+
+!!! note
+    If event time column has the same name in feature groups included in the parent feature view query then the event time of the left most feature group in the query will be returned. If they have different names then
+    all of them will be returned. Join prefix doesn't have any influence on this behaviour.
+
+    To use primary key(s) and event time column with materialized training datasets it needs to be created with `primary_keys=True` and/or `with_event_time=True`.  
+
 ## Deletion
 To clean up unused training data, you can delete all training data or for a particular version. Note that all metadata of training data and materialised files stored in HopsFS will be deleted and cannot be recreated anymore.
 ```python

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -28,6 +28,7 @@ nav:
               - Spine Group: concepts/fs/feature_group/spine_group.md
               - Data Validation/Stats/Alerts: concepts/fs/feature_group/fg_statistics.md
               - Versioning: concepts/fs/feature_group/versioning.md
+              - On-Demand Feature: concepts/fs/feature_group/on_demand_feature.md
           - Feature Views:
               - Overview: concepts/fs/feature_view/fv_overview.md
               - Offline API: concepts/fs/feature_view/offline_api.md
@@ -84,6 +85,7 @@ nav:
               - Feature vectors: user_guides/fs/feature_view/feature-vectors.md
               - Feature server: user_guides/fs/feature_view/feature-server.md
               - Query: user_guides/fs/feature_view/query.md
+              - Helper Columns: user_guides/fs/feature_view/helper-columns.md
               - Transformation Functions: user_guides/fs/feature_view/transformation-function.md
               - Spines: user_guides/fs/feature_view/spine-query.md
           - Compute Engines: user_guides/fs/compute_engines.md