- Tutorial Overview
- Prerequisites
- Create and Configure Azure Cosmos DB
- Update Azure Key Vault
- Update Synapse Workspace
- Execute Pipeline
- Query Data in Cosmos DB
This tutorial will provide an example of how to load Graph Data Connect (GDC) to gain insights into a Cosmos DB Gremlin API Graph database. By doing this, you will learn the key steps and Azure technologies required to build your own GDC based Graph database.
You will learn how to:
- Take GDC data already loaded into Azure Synapse and model and load the data into a CosmosDB Gremlin API
To complete this lab, you need the following:
- Microsoft Azure subscription
- If you do not have one, you can obtain one (for free) here: https://azure.microsoft.com/free
- The account used to perform the set up must have the Contributor role for the subscription granted to it, in order to be able to create the various infrastructure components described below
- The Azure subscription must be in the same tenant as the Office 365 tenant, as Graph Data Connect will only export data to an Azure subscription in the same tenant, not across tenants.
- Office 365 tenancy
- If you do not have one, you obtain one (for free) by signing up to the Office 365 Developer Program.
- Multiple Office 365 users with emails sent & received
- Access to at least two accounts that meet the following requirements:
- One of the two accounts must be a global tenant administrator
- That same account must have the global administrator role granted
- Workplace Analytics licenses
- Access to the Microsoft Graph Data Connect toolset is available through Workplace Analytics, which is licensed on a per-user, per-month basis.
- To learn more please see Microsoft Graph data connect policies and licensing
NOTE: The screenshots and examples used in this lab are from an Office 365 test tenant with fake email data from test users. You can use your own Office 365 tenant to perform the same steps. No data is written to Office 365.
The tutorial assumes that you already have Graph Data Connect in Azure Synapse. For an example of how to load that data into Azure Synapse, you can refer to the Coversation Lineage Tutorial.
- Open a browser and navigate to your Azure Portal at https://portal.azure.com
- In the search bar, type Azure Cosmos DB and then Click on Azure Cosmos DB in the Services list.
- Click on Create, then click on the Create button in the section labled Gremlin (Graph).
A. Select your prefered Subscription, Resource Group and Location.
B. Type in the Account Name you'd like to use for your Cosmos DB instance. We will refer to this name later in the tutorial ascosmos-name
.
C. Choose your prefered pricing option, then click on Review + Create and proceed to Create.
D. The Cosmos DB instance will take a little time to deply. - From the Overview page of the Azure Cosmos DB instance you just created, click on Data Explorer.
- Click on New Graph, then select the option to create a New Graph.
A. Enter a Database id and a Graph id and make a record of the values. We will reference them later in the tutorial asdatabase-id
andgraph-id
, respectively.
B. Choose Database throughput that will meet your needs.
C. For the Partition key, enter in/pk
.
D. Click OK.
In your Azure Keyvault, you will need to add the following keys:
- gremlinEndpoint with value
wss://<cosmos-name>.gremlin.cosmos.azure.com:443/
cosmos-name
is the name of your Cosmos DB instance created earlier.
- gremlinUsername with value
/dbs/<database-id>/colls/<graph-id>
database-id
andgraph-id
are the Database id and Graph id you entered in the steps above.
- gremlinPassword with your Cosmos DB Primary Key as the value.
- Download the file: gremlinpython-3.5.1-py2.py3-none-any.whl
- In your Synapse Workspace, in the Manage hub, click on Workspace packages.
- Click on Upload near the top of the window to open the Upload packages dialog box.
A. In the dialog box, click on the folder icon.
B. Navigate to the gremlinpython-3.5.1-py2.py3-none-any.whl file you downloaded and click open.
C. Click on the Upload button at the bottom of the dialog box. - In the Azure portal, navigate to the Overview page for your Synapse Workspace.
- Click Apache Spark pools in the left menu bar, then click on the spark pool you've previously created.
- Click on Packages in the left menu bar.
- Click on Select from workspace packages.
A. Check the box next to gremlinpython-3.5.1-py2.py3-none-any.whl, then click Select. - Click Save at the top of the window.
- Download the file: MGDCToCosmosDB.ipynb
- Inside your Azure Synapse workspace, navigate to the Develop hub.
- Click on the + symbol, then select Import.
- Navigate to the downloaded notebook file, select it, then click Open.
- Attach your spark pool to the notebook.
- Download the file: PL_MGDC_CosmosDB.zip
- Inside your Azure Synapse workspace, navigate to the Integrate hub.
- Click on the + symbol, then select Import from pipeline template.
- Navigate to and select PL_MGDC_CosmosDB.zip, then click Open.
- Open the PL_MGDC_CosmosDB pipeline and update the following pipeline parameters:
a.sql_database_name
- Set this to the name of your dedicated SQL pool.
b.sql_server_name
- Set this to the name of your Azure Synapse workspace.
c.keyvault_name
- Set this to the name of your Keyvault. - Click the notebook card in the pipeline and navigate to the Settings tab. a. Attach the MGDCToCosmosDB notebook here.
- In the Synapse workspace, navigate to the Integrate hub, then select the PL_MGDC_CosmosDB pipeline.
- Click on Trigger then New/Edit.
- In the Choose trigger... dropdown, select New
- Fill out the fields in the trigger with your prefered values, then click OK
- Publish the changes to the workspace.
- In the Synapse workspace, navigate to the Integrate hub, then click on PL_MGDC_CosmosDB pipeline.
- Click on Trigger, then Trigger Now.
With this step complete, our Cosmos DB instance should be populated with data.
- In the Azure portal, from the Overview page of the Azure Cosmos DB instance you created, click on Data Explorer.
- Expand your database and graph dropdowns, then click on the Graph option.
- Click the Load Graph button near the window center (or Execute Gremlin Query on the right) to do some basic exploration of the data.
NOTE: You may execute different queries to investigate more specific data. For example:
g.E()
to view the edge data.g.V('<office365-user>').as('b').bothE().as('e').select ('b', 'e')
, whereoffice365-user
would be a user that's available on your MGDC data set.
It is highly recommended that you use a robust graph visualization tool such as Linkurious for navigating the data. You can find instructions for setting up Linkurious for Cosmos DB here. You will need the information you recorded earlier in the tutorial to configure Linkurious.
There are also other recommended graph visualization tools aside from Linkurious.