This project implements an Airflow DAG that extracts data from Google Analytics 4 and loads it into BigQuery.
- Python and pip installed
- Google Cloud Platform account
- Access to a Google Analytics 4 property
- Git
# Install Apache Airflow
pip install apache-airflow
# Set Airflow home directory
export AIRFLOW_HOME=$(pwd)/airflow
# Start Airflow in standalone mode
airflow standalone
# Copy DAGs to Airflow directory
cp -r dags/* airflow/dags/
# Reserialize DAGs
airflow dags reserialize
# To stop Airflow when needed
pkill -f airflow
- Create a new GCP project
- Create a service account
- Execute
gcp_commands.sh
to:- Set up service account permissions
- Configure Cloud Composer environment
- Create Cloud Composer (Airflow) environment
- Store the service account key JSON in the
secrets
folder
- Visit Google Cloud Console Credentials
- Create OAuth 2.0 Client ID with the following scopes:
https://www.googleapis.com/auth/analytics.readonly
https://www.googleapis.com/auth/cloud-platform
https://www.googleapis.com/auth/analytics
- Store the OAuth 2.0 Client ID JSON in the
secrets
folder
- Link your Analytics account to the GCP project
- Note down the property ID
- Add the service account email to Analytics account (Property Management Settings)
- Run the test script to verify connection
- Configure Google Cloud Connection:
- Navigate to Admin > Connections > google_cloud_default
- Add service account key JSON
- Add Analytics scopes in Extra field:
https://www.googleapis.com/auth/analytics.readonly,https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/analytics
GitHub Actions workflow automatically pushes DAGs to the Cloud Composer environment's storage bucket.
-
Cloud Composer Environment:
-
Airflow Interface:
-
GCP Storage Buckets:
Run test_google_analytics.py
to verify data consistency. Output:
date activeUsers
0 2024-12-16 1
1 2024-12-18 2
2 2024-12-19 4
3 2024-12-21 1
4 2024-12-22 2
5 2024-12-23 1
6 2024-12-24 2
7 2024-12-26 6
8 2024-12-30 1
Cross-verify with BigQuery using:
SELECT
PARSE_DATE('%Y%m%d', CAST(date AS STRING)) as date,
SUM(activeUsers) as daily_active_users
FROM
`dag-task.custom_analytics_data.ga4_data`
WHERE
PARSE_DATE('%Y%m%d', CAST(date AS STRING))
BETWEEN PARSE_DATE('%Y%m%d', '20241216')
AND PARSE_DATE('%Y%m%d', '20241231')
GROUP BY
date_formatted
ORDER BY
date_formatted;
Output:
date,daily_active_users
2024-12-16,1
2024-12-18,2
2024-12-19,4
2024-12-21,1
2024-12-22,2
2024-12-23,1
2024-12-24,2
2024-12-26,6
2024-12-30,1
The Google Analytics demo account cannot be used with the Analytics Data API due to permissions limitations.