Skip to content

Commit

Permalink
Hopsworks Python library installation documentation improvements (#431)
Browse files Browse the repository at this point in the history
* Hopsworks Python library installation documentation improvements

- Remmove references to pip install hsfs and hsfs.connection()
- Improve the documentation for the installation of the Python library (Including profiles)
- Add documentation for the installation of the Java library

* Typo

* Fix for review
  • Loading branch information
SirOibaf committed Jan 3, 2025
1 parent 3ea3f5d commit 24a0de2
Show file tree
Hide file tree
Showing 20 changed files with 223 additions and 468 deletions.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
115 changes: 87 additions & 28 deletions docs/user_guides/client_installation/index.md
Original file line number Diff line number Diff line change
@@ -1,56 +1,115 @@
---
description: Documentation on how to install the Hopsworks and HSFS Python libraries, including the specific requirements for Mac OSX and Windows.
description: Documentation on how to install the Hopsworks Python and Java library.
---
# Client Installation Guide

## Hopsworks (including Feature Store and MLOps)
The Hopsworks client library is required to connect to the Hopsworks Feature Store and MLOps services from your local machine or any other Python environment such as Google Colab or AWS Sagemaker. Execute the following command to install the full Hopsworks client library in your Python environment:
## Hopsworks Python library

The Hopsworks Python client library is required to connect to Hopsworks from your local machine or any other Python environment such as Google Colab or AWS Sagemaker. Execute the following command to install the Hopsworks client library in your Python environment:

!!! note "Virtual environment"
It is recommended to use a virtual python environment instead of the system environment used by your operating system, in order to avoid any side effects regarding interfering dependencies.

```bash
pip install hopsworks
```
Supported versions of Python: 3.8, 3.9, 3.10, 3.11, 3.12 ([PyPI ↗](https://pypi.org/project/hopsworks/))

!!! attention "OSX Installation"
Hopsworks latest version should work on OSX systems without any additional requirements. However if installing an older version of the Hopsworks SDK you might need to install `librdkafka` manually. Checkout the documentation for the specific version you are installing.

!!! attention "Windows/Conda Installation"

On Windows systems you might need to install twofish manually before installing hopsworks, if you don't have the Microsoft Visual C++ Build Tools installed. In that case, it is recommended to use a conda environment and run the following commands:

```bash
conda install twofish
pip install hopsworks
pip install hopsworks[python]
```

## Feature Store only
To only install the Hopsworks Feature Store client library, execute the following command:
```bash
pip install hopsworks[python]
```
Supported versions of Python: 3.8, 3.9, 3.10, 3.11, 3.12 ([PyPI ↗](https://pypi.org/project/hopsworks/))

### Profiles

The Hopsworks library has several profiles that bring additional dependencies and enable additional functionalities:

| Profile Name | Description |
| ------------------ | ------------- |
| No Profile | This is the base installation. Supports interacting with the feature store metadata, model registry and deployments. It also supports reading and writing from the feature store from PySpark environments. |
| `python` | This profile enables reading and writing from/to the feature store from a Python environment |
| `great-expectations` | This profile installs the [Great Expectations](https://greatexpectations.io/) Python library and enables data validation on feature pipelines |
| `polars` | This profile installs the [Polars](https://pola.rs/) library and enables reading and writing Polars DataFrames |

You can install all the above profiles with the following command:

```bash
pip install hsfs[python]
# or if using zsh
pip install 'hsfs[python]'
pip install hopsworks[python,great-expectations,polars]
```
Supported versions of Python: 3.8, 3.9, 3.10, 3.11, 3.12 ([PyPI ↗](https://pypi.org/project/hsfs/))

!!! attention "OSX Installation"
Hopsworks latest version should work on OSX systems without any additional requirements. However if installing an older version of the Hopsworks SDK you might need to install `librdkafka` manually. Checkout the documentation for the specific version you are installing.
## HSFS Java Library:

!!! attention "Windows/Conda Installation"
If you want to interact with the Hopsworks Feature Store from environments such as Spark, Flink or Beam, you can use the Hopsworks Feature Store (HSFS) Java library.

On Windows systems you might need to install twofish manually before installing hsfs, if you don't have the Microsoft Visual C++ Build Tools installed. In that case, it is recommended to use a conda environment and run the following commands:

```bash
conda install twofish
pip install hsfs[python]
```
!!!note "Feature Store Only"

The Java library only allows interaction with the Feature Store component of the Hopsworks platform. Additionally each environment might restrict the supported API operation. You can see which API operation is supported by which environment [here](../fs/compute_engines)

The HSFS library is available on the Hopsworks' Maven repository. If you are using Maven as build tool, you can add the following in your `pom.xml` file:

```
<repositories>
<repository>
<id>Hops</id>
<name>Hops Repository</name>
<url>https://archiva.hops.works/repository/Hops/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
```

The library has different builds targeting different environments:

### Spark

The `artifactId` for the Spark build is `hsfs-spark-spark{spark.version}`, if you are using Maven as build tool, you can add the following dependency:

```
<dependency>
<groupId>com.logicalclocks</groupId>
<artifactId>hsfs-spark-spark3.1</artifactId>
<version>${hsfs.version}</version>
</dependency>
```

Hopsworks provides builds for Spark 3.1, 3.3 and 3.5. The builds are also provided as JAR files which can be downloaded from [Hopsworks repository](https://repo.hops.works/master/hsfs)

### Flink

The `artifactId` for the Flink build is `hsfs-flink`, if you are using Maven as build tool, you can add the following dependency:

```
<dependency>
<groupId>com.logicalclocks</groupId>
<artifactId>hsfs-flink</artifactId>
<version>${hsfs.version}</version>
</dependency>
```

### Beam

The `artifactId` for the Beam build is `hsfs-beam`, if you are using Maven as build tool, you can add the following dependency:

```
<dependency>
<groupId>com.logicalclocks</groupId>
<artifactId>hsfs-beam</artifactId>
<version>${hsfs.version}</version>
</dependency>
```

## Next Steps

If you are using a local python environment and want to connect to the Hopsworks Feature Store, you can follow the [Python Guide](../integrations/python.md#generate-an-api-key) section to create an API Key and to get started.
If you are using a local python environment and want to connect to Hopsworks, you can follow the [Python Guide](../integrations/python.md#generate-an-api-key) section to create an API Key and to get started.

## Other environments

Expand Down
8 changes: 4 additions & 4 deletions docs/user_guides/fs/sharing/sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,12 +64,12 @@ To access features from a shared feature store you need to first retrieve the ha
To retrieve the handle use the get_feature_store() method and provide the name of the shared feature store

```python
import hsfs
import hopsworks

connection = hsfs.connection()
project = hopsworks.login()

project_feature_store = connection.get_feature_store()
shared_feature_store = connection.get_feature_store(name="name_of_shared_feature_store")
project_feature_store = project.get_feature_store()
shared_feature_store = project.get_feature_store(name="name_of_shared_feature_store")
```

### Step 2: Fetch feature groups
Expand Down
7 changes: 3 additions & 4 deletions docs/user_guides/fs/storage_connector/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,10 @@ We retrieve a storage connector simply by its unique name.

=== "PySpark"
```python
import hsfs
import hopsworks
# Connect to the Hopsworks feature store
hsfs_connection = hsfs.connection()
# Retrieve the metadata handle
feature_store = hsfs_connection.get_feature_store()
project = hopsworks.login()
feature_store = project.get_feature_store()
# Retrieve storage connector
connector = feature_store.get_storage_connector('connector_name')
```
Expand Down
128 changes: 10 additions & 118 deletions docs/user_guides/integrations/databricks/api_key.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Hopsworks API key

In order for the Databricks cluster to be able to communicate with the Hopsworks Feature Store, the clients running on Databricks need to be able to access a Hopsworks API key.
In order for the Databricks cluster to be able to communicate with Hopsworks, clients running on Databricks need to be able to access a Hopsworks API key.

## Generate an API key

Expand All @@ -15,127 +15,19 @@ For instructions on how to generate an API key follow this [user guide](../../pr

!!! hint "API key as Argument"
To get started quickly, without saving the Hopsworks API in a secret storage, you can simply supply it as an argument when instantiating a connection:
```python hl_lines="6"
import hsfs
conn = hsfs.connection(
host='my_instance', # DNS of your Feature Store instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='my_project', # Name of your Hopsworks Feature Store project
api_key_value='apikey', # The API key to authenticate with Hopsworks
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```

## Store the API key

### AWS

#### Step 1: Create an instance profile to attach to your Databricks clusters

Go to the *AWS IAM* choose *Roles* and click on *Create Role*. Select *AWS Service* as the type of trusted entity and *EC2* as the use case as shown below:

<p align="center">
<figure>
<img src="../../../../assets/images/guides/integrations/create-instance-profile.png" alt="Create an instance profile">
<figcaption>Create an instance profile</figcaption>
</figure>
</p>

Click on *Next: Permissions*, *Next:Tags*, and then *Next: Review*. Name the instance profile role and then click *Create role*.

#### Step 2: Storing the API Key

**Option 1: Using the AWS Systems Manager Parameter Store**

In the AWS Management Console, ensure that your active region is the region you use for Databricks.
Go to the *AWS Systems Manager* choose *Parameter Store* and select *Create Parameter*.
As name enter `/hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key` replacing `[MY_DATABRICKS_ROLE]` with the name of the AWS role you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select *Secure String* as type and create the parameter.

<p align="center">
<figure>
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_parameter_store.png" alt="Storing the Feature Store API key in the Parameter Store">
<figcaption>Storing the Feature Store API key in the Parameter Store</figcaption>
</figure>
</p>


Once the API Key is stored, you need to grant access to it from the AWS role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
In the AWS Management Console, go to *IAM*, select *Roles* and then search for the role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
Select *Add inline policy*. Choose *Systems Manager* as service, expand the *Read* access level and check *GetParameter*.
Expand Resources and select *Add ARN*.
Enter the region of the *Systems Manager* as well as the name of the parameter **WITHOUT the leading slash** e.g. *hopsworks/role/[MY_DATABRICKS_ROLE]/type/api-key* and click *Add*.
Click on *Review*, give the policy a name and click on *Create policy*.

<p align="center">
<figure>
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_parameter_store_policy.png" alt="Configuring the access policy for the Parameter Store">
<figcaption>Configuring the access policy for the Parameter Store</figcaption>
</figure>
</p>


**Option 2: Using the AWS Secrets Manager**

In the AWS management console ensure that your active region is the region you use for Databricks.
Go to the *AWS Secrets Manager* and select *Store new secret*. Select *Other type of secrets* and add *api-key*
as the key and paste the API key created in the previous step as the value. Click next.

<p align="center">
<figure>
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_step_1.png" alt="Storing a Feature Store API key in the Secrets Manager Step 1">
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 1</figcaption>
</figure>
</p>

As secret name, enter *hopsworks/role/[MY_DATABRICKS_ROLE]* replacing [MY_DATABRICKS_ROLE] with the AWS role you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters). Select next twice and finally store the secret.
Then click on the secret in the secrets list and take note of the *Secret ARN*.

<p align="center">
<figure>
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_step_2.png" alt="Storing a Feature Store API key in the Secrets Manager Step 2">
<figcaption>Storing a Feature Store API key in the Secrets Manager Step 2</figcaption>
</figure>
</p>

Once the API Key is stored, you need to grant access to it from the AWS role that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
In the AWS Management Console, go to *IAM*, select *Roles* and then the role that that you have created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).
Select *Add inline policy*. Choose *Secrets Manager* as service, expand the *Read* access level and check *GetSecretValue*.
Expand Resources and select *Add ARN*. Paste the ARN of the secret created in the previous step.
Click on *Review*, give the policy a name and click on *Create policy*.

<p align="center">
<figure>
<img src="../../../../assets/images/guides/integrations/databricks/aws/databricks_secrets_manager_policy.png" alt="Configuring the access policy for the Secrets Manager">
<figcaption>Configuring the access policy for the Secrets Manager</figcaption>
</figure>
</p>

#### Step 3: Allow Databricks to use the AWS role created in Step 1

First you need to get the AWS role used by Databricks for deployments as described in [this step](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-3-note-the-iam-role-used-to-create-the-databricks-deployment). Once you get the role name, go to *AWS IAM*, search for the role, and click on it. Then, select the *Permissions* tab, click on *Add inline policy*, select the *JSON* tab, and paste the following snippet. Replace *[ACCOUNT_ID]* with your AWS account id, and *[MY_DATABRICKS_ROLE]* with the AWS role name created in [Step 1](#step-1-create-an-instance-profile-to-attach-to-your-databricks-clusters).

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PassRole",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::[ACCOUNT_ID]:role/[MY_DATABRICKS_ROLE]"
}
]
}
```python hl_lines="6"
import hopsworks
project = hopsworks.login(
host='my_instance', # DNS of your Feature Store instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='my_project', # Name of your Hopsworks Feature Store project
api_key_value='apikey', # The API key to authenticate with Hopsworks
)
fs = project.get_feature_store() # Get the project's default feature store
```

Click *Review Policy*, name the policy, and click *Create Policy*. Then, go to your Databricks workspace and follow [this step](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html#step-5-add-the-instance-profile-to-databricks) to add the instance profile to your workspace. Finally, when launching Databricks clusters, select *Advanced* settings and choose the instance profile you have just added.


### Azure

On Azure we currently do not support storing the API key in a secret storage. Instead just store the API key in a file in your Databricks workspace so you can access it when connecting to the Feature Store.

## Next Steps

Continue with the [configuration guide](configuration.md) to finalize the configuration of the Databricks Cluster to communicate with the Hopsworks Feature Store.
42 changes: 10 additions & 32 deletions docs/user_guides/integrations/databricks/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,38 +90,16 @@ When a cluster is configured for a specific project user, all the operations wit
At the end of the configuration, Hopsworks will start the cluster.
Once the cluster is running users can establish a connection to the Hopsworks Feature Store from Databricks:

!!! note "API key on Azure"
Please note, for Azure it is necessary to store the Hopsworks API key locally on the cluster as a file. As we currently do not support storing the API key on an Azure Secret Management Service as we do for AWS. Consult the [API key guide for Azure](api_key.md#azure), for more information.

=== "AWS"

```python
import hsfs
conn = hsfs.connection(
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='secretsmanager', # Either parameterstore or secretsmanager
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```

=== "Azure"

```python
import hsfs
conn = hsfs.connection(
'my_instance', # DNS of your Feature Store instance
443, # Port to reach your Hopsworks instance, defaults to 443
'my_project', # Name of your Hopsworks Feature Store project
secrets_store='local',
api_key_file="featurestore.key", # For Azure, store the API key locally
secrets_store = "local",
hostname_verification=True # Disable for self-signed certificates
)
fs = conn.get_feature_store() # Get the project's default feature store
```
```python
import hopsworks
project = hopsworks.login(
host='my_instance', # DNS of your Hopsworks instance
port=443, # Port to reach your Hopsworks instance, defaults to 443
project='my_project', # Name of your Hopsworks project
api_key_value='apikey', # The API key to authenticate with Hopsworks
)
fs = project.get_feature_store() # Get the project's default feature store
```

## Next Steps

Expand Down
Loading

0 comments on commit 24a0de2

Please sign in to comment.