Start by ⭐️ starring lakeFS open source project.
This repository includes a Jupyter Notebook which you can run on your local machine. This notebook demonstrates integration of AWS Trino, utilizing Glue Catalog, with lakeFS.
- lakeFS installed and running in your AWS environment or in the lakeFS Cloud. If you don't have lakeFS already running then either use lakeFS Cloud which provides free lakeFS server on-demand with a single click or Deploy lakeFS on AWS.
-
Start by cloning this repository:
git clone https://github.com/treeverse/lakeFS-samples && cd lakeFS-samples/01_standalone_examples/aws-glue-trino
-
Change
lakeFS Endpoint URL
,Access Key
andSecret Key
intrino_configurations.json
file included in the Git repo inlakeFS-samples/01_standalone_examples/aws-glue-trino
folder. -
Run following AWS CLI command to create an EMR cluster. Change AWS
region_name
,log-uri
,ec2_subnet_name
before running the command. lakeFS Python SDK requires Python v3.9 or above. Python v3.9 is supported starting with EMR v7.0.0.aws emr create-cluster \ --release-label emr-7.0.0 \ --applications Name=Trino Name=JupyterEnterpriseGateway Name=Spark \ --configurations file://trino_configurations.json \ --region region_name \ --name lakefs_glue_trino_demo_cluster \ --log-uri s3://bucket-name/trino/logs/ \ --instance-type m5.xlarge \ --instance-count 1 \ --service-role EMR_DefaultRole \ --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=ec2_subnet_name
-
Set up an AWS EMR Studio. Use the same subnet as used to create EMR cluster.
-
Create and launch an AWS EMR Studio Workspace.
-
Attach an AWS EMR cluster to a Workspace
-
Upload
trino-glue-demo-notebook
included in the Git repo inlakeFS-samples/01_standalone_examples/aws-glue-trino
folder to AWS EMR Studio Workspace. Open this notebook, select PySpark kernel to run the notebook and follow the instructions in the notebook. -
Run following command to terminate the EMR Cluster once you finish the demo. Change
cluster_id
by the ID returned byaws emr create-cluster
command or checkcluster_id
in EMR UI.aws emr terminate-clusters --cluster-ids cluster_id
-
Stop or Delete AWS EMR Studio Workspace once you finish the demo.