Orchestra is not an official Google Product
- Overview
- Setting up your Orchestra environment in GCP
- Service Accounts
- Configuring Orchestra
- Additional info
Composer is a Google Cloud managed version of Apache Airflow, an open source project for managing ETL workflows. We use it for this solution as you are able to deploy your code to production simply by moving files to Google Cloud Storage. It also provides Monitoring, Logging and software installation, updates and bug fixes for Airflow are fully managed.
It is recommended that you install this solution through the Google Cloud Platform UI.
We recommend familiarising yourself with Composer here.
Orchestra is an open source project, built on top of Composer, that is custom operators for Airflow designed to solve the needs of Advertisers.
Orchestra lets Enterprise Clients build their Advertising Data Lake out of the box and customize it to their needs
Orchestra lets sophisticated clients automate workflows at scale for huge efficiency gains.
Orchestra is a fully open sourced Solution Toolkit for building enterprise data solutions on Airflow.
Composer and Big Query - two of the main Google Cloud Platform tools which Orchestra is based on - will require a GCP Project with a valid billing account.
See this article for more information Google Cloud Billing.
In you GCP Project menu (or directly through this link) access the API Library so that you can enable the following APIs:
- Cloud Composer
- Cloud Dataproc
- Cloud Storage APIs
- BigQuery
Follow these steps to create a Composer environment in Google Cloud Platform - please note that it can take up to 20/30 minutes.
Environment Variables, Tags and Configuration Properties (airflow.cfg) can all be left as standard and you can use the default values for number of nodes, machine types and disk size (you can use a smaller disk size if you want to save some costs).
Google Cloud uses service accounts to automate tasks between services. This includes other Google services such as DV360 and CM.
You can see full documentation for Service Accounts here:
https://cloud.google.com/iam/docs/service-accounts
By default you will see in the IAM section of your Project a default service account for Composer ("Cloud Composer Service Agent") and a default service account for Compute Engine ("Compute Engine default service account") - with their respective email addresses.
These service accounts have access to all Cloud APIs enabled for your project, making them a good fit for Orchestra. We recommend you use in particular the Compute Engine Service Account (i.e. "Compute Engine default service account" because it is the one used by the individual Compute Engine virtual machines that will run your tasks) as the main "Orchestra" service account.
If you wish to use another account, you will have to give it access to BigQuery and full permissions for the Storage APIs.
Your Service Account will need to be setup as a DV360 user so that it can access the required data from your DV360 account.
You need to have partner-level access to your DV360 account to be able to add a new user; follow the simple steps to create a new user in DV360, using this configuration:
- Give this user the email of the service account you wish to use.
- Select all the advertisers you want to be able to access
- Give** Read&Write** permissions
- Save!
You have now set up the Composer environment in GCP and granted the proper permissions to its default Service Account.
You're ready to configure Orchestra!
The Orchestra project will require several variables to run.
These can be set via the Admin section in the Airflow UI (accessible from the list of Composer Environments, clicking on the corresponding link under "Airflow Web server").
Area | Variable Name | Value | Needed For |
Cloud Project | gce_zone | Your Google Compute Engine Zone (you can find it under "Location" in the list of Composer Environments) | All |
Cloud Project | gcs_bucket | The Cloud Storage bucket for your Airflow DAGs (you can find a link to the bucket in the Environments page - see Image1) | All |
Cloud Project | cloud_project_id | The Project ID you can find in your GCP console homepage. | All |
BigQuery | erf_bq_dataset | The name of the BigQuery Dataset you wish to use - see image2 and documentation here. | ERFs |
DV360 | partner_ids | The list of partners ids from DV360, used for Entity Read Files, comma separated. | All |
DV360 | private_entity_types | A comma separated list of Private Entity Read Files you would like to import. | ERFs |
DV360 | sequential_erf_dag_name | The name of your dag as it will show up in the UI. Name it whatever makes sense for you (alphanumeric characters, dashes, dots and underscores exclusively). | ERFs |
DV360 | dv360_sdf_advertisers | Dictionary of partners (keys) and advertisers (values) which will be used to download SDFs. Initially you can set up the value to: {"partner_id": ["advertiser_id1", “advertiser_id2”]} and use the dv360_get_sdf_advertisers_from_report_dag dag to update it programmatically. | SDFs |
DV360 | dv360_sdf_advertisers_report_id | DV360 report ID which will be used to get a list of all active partners and advertisers. Initially, you can set up the value as: 1 and use the dv360_create_sdf_advertisers_report_dag dag to update it programmatically. | SDFs, Reports |
DV360 | number_of_advertisers_per_sdf_api_call | Number of advertiser IDs which will be included in each call to DV360 API to retrieve SDFs. Set up the value to: 1 | SDFs |
DV360 | sdf_api_version | SDF Version (column names, types, order) in which the entities will be returned. Set up the value to: 4.2 (no other versions are currently supported). | SDFs |
BigQuery | sdf_bq_dataset | The name of the BigQuery dataset you wish to use to store SDFs. | SDFs |
BigQuery | sdf_file_types | Comma separated value of SDF types that will be returned (e.g. LINE_ITEM, AD_GROUP). Currently, this solution supports: LINE_ITEM, AD_GROUP, AD, INSERTION_ORDER and CAMPAIGN. | SDFs |
Image1:
Image2:
As with any other Airflow deployment, you will need DAG files describing your Workflows to schedule and run your tasks; plus, you'll need hooks, operators and other libraries to help building those tasks.
You can find the core files for Orchestra in our github repository: clone the repo (or directly download the files)
You can then design the dags you wish to run and add them to the dags folder.
Upload all the DAGs and other required files to the DAGs Storage Folder that you can access from the Airflow UI.
This will automatically generate the DAGs and schedule them to run (you will be able to see them in the Airflow UI).
From now, you can use (the Composer-managed instance of) Airflow as you normally would - including the different available functionalities for scheduling, troubleshooting, …
Full details can be found here. Please note that files created by Composer are not automatically deleted and you will need to remove them manually or they will still incur. Same thing applies to the BigQuery datasets.
Orchestra is a Framework that allows powerful API access to your data. Liability for how you use that data is your own. It is important that all data you keep is secure and that you have legal permission to work and transfer all data you use. Orchestra can operate across multiple Partners, please be sure that this access is covered by legal agreements with your clients before implementing Orchestra. This project is covered by the Apache License.