Skip to content

Commit

Permalink
Bug 1570368 - Update airflow from python2 to python3 (#1008)
Browse files Browse the repository at this point in the history
* Upgrade to 1.10.10 and pip-compile using python3.8

* pip-compile with python3.7

* Update Dockerfile from 2.7-slim to 3.7-slim

* Remove deprecated option from config

* Update relative imports in backported operators

* Replace urlparse to new location

* Replace iteritems with items

* Add instructions for pip-compile and cut out old material

* Update unittests for plugins

* Update README.md

Co-authored-by: Daniel Thorn <[email protected]>

* Update README.md

Co-authored-by: Daniel Thorn <[email protected]>

* Compile requirements using explicit source

* Add CI for checking requirements are generated correctly

* Leave off requirements.in from pip-compile call

* Recompile requirements.txt

* Pin kombu to 4.6.3

Co-authored-by: Daniel Thorn <[email protected]>
  • Loading branch information
acmiyaguchi and relud authored Jul 22, 2020
1 parent b7ed03f commit 4fb86a1
Show file tree
Hide file tree
Showing 13 changed files with 241 additions and 108 deletions.
16 changes: 14 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,26 @@ jobs:

test:
docker:
- image: python:2.7
- image: python:3.7
working_directory: ~/mozilla/telemetry-airflow
steps:
- checkout
- run: pip install tox
- run: python -m py_compile dags/*.py
- run: find . -name *.pyc -delete
- run: tox -e py27
- run: tox -e py37

verify-requirements:
docker:
- image: python:3.7
steps:
- checkout
- run:
name: Verify that requirements.txt contains the right dependencies for this python version
command: |
pip install pip-tools
pip-compile --quiet
git diff --exit-code requirements.txt
test-environment:
machine:
Expand Down
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
FROM python:2.7-slim
FROM python:3.7-slim
MAINTAINER Jannis Leidel <[email protected]>

# Due to AIRFLOW-6854, Python 3.7 is chosen as the base python version.

# add a non-privileged user for installing and running the application
RUN mkdir /app && \
chown 10001:10001 /app && \
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.dev
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:2.7-slim
FROM python:3.7-slim
MAINTAINER Jannis Leidel <[email protected]>

# add a non-privileged user for installing and running the application
Expand Down
94 changes: 22 additions & 72 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Telemetry-Airflow

[![CircleCi](https://circleci.com/gh/mozilla/telemetry-airflow.svg?style=shield&circle-token=62f4c1be98e5c9f36bd667edb7545fa736eed3ae)](https://circleci.com/gh/mozilla/telemetry-airflow)

# Telemetry-Airflow
Airflow is a platform to programmatically author, schedule and monitor workflows.

When workflows are defined as code, they become more maintainable, versionable,
Expand All @@ -13,23 +14,22 @@ surgeries on DAGs a snap. The rich user interface makes it easy to visualize
pipelines running in production, monitor progress, and troubleshoot issues when
needed.

### Prerequisites
## Prerequisites

This app is built and deployed with
[docker](https://docs.docker.com/) and
[docker-compose](https://docs.docker.com/compose/).

### Dependencies
### Updating Python dependencies

Most Airflow jobs are thin wrappers that spin up an EMR cluster for running
the job. Be aware that the configuration of the created EMR clusters depends
on finding scripts in an S3 location configured by the `SPARK_BUCKET` variable.
Those scripts are maintained in
[emr-bootstrap-spark](https://github.com/mozilla/emr-bootstrap-spark/)
and are deployed independently of this repository.
Changes in behavior of Airflow jobs not explained by changes in the source of the
Spark jobs or by changes in this repository
could be due to changes in the bootstrap scripts.
Add new Python dependencies into `requirements.in`. Run the following commands with the same Python
version specified by the Dockerfile.

```bash
# As of time of writing, python3.7
pip install pip-tools
pip-compile
```

### Build Container

Expand All @@ -42,6 +42,7 @@ make build
### Export Credentials

For now, DAGs that use the Databricks operator won't parse until the following environment variables are set (see issue #501):

```
AWS_SECRET_ACCESS_KEY
AWS_ACCESS_KEY_ID
Expand All @@ -55,57 +56,15 @@ Airflow database migration is no longer a separate step for dev but is run by th
## Testing

A single task, e.g. `spark`, of an Airflow dag, e.g. `example`, can be run with an execution date, e.g. `2018-01-01`, in the `dev` environment with:

```bash
export DEV_USERNAME=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_ACCESS_KEY_ID=...
make run COMMAND="test example spark 20180101"
```

The `DEV_USERNAME` is a short string used to identify your EMR instances.
This should be set to something like your IRC or Slack handle.

The container will run the desired task to completion (or failure).
Note that if the container is stopped during the execution of a task,
the task will be aborted. In the example's case, the Spark job will be
terminated.

The logs of the task can be inspected in real-time with:
```bash
docker logs -f telemetryairflow_scheduler_1
```

You can see task logs and see cluster status on
[the EMR console](https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2)

By default, the results will end up in the `telemetry-test-bucket` in S3.
If your desired task depends on other views, it will expect to be able to find those results
in `telemetry-test-bucket` too. It's your responsibility to run the tasks in correct
order of their dependencies.

**CAVEAT**: When running the `make run` multiple times it can spin
up multiple versions of the `web` container. It can also fail if you've never
run `make up` to initialize the database. An alternative form of the above is to
launch the containers and shell into the `web` container to run the `airflow
test` command.

In one terminal launch the docker containers:
```bash
make up
```

Note: initializing the web container will run the airflow initdb/upgradedb

In another terminal shell into the `web` container, making sure to also supply
the environment variables, then run the `airflow test` command:
```bash
export DEV_USERNAME=...
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
docker exec -ti -e DEV_USERNAME -e AWS_SECRET_ACCESS_KEY -e AWS_ACCESS_KEY_ID telemetry-airflow_web_1 /bin/bash
airflow test example spark 20180101
```

### Adding dummy credentials

Tasks often require credentials to access external credentials. For example, one may choose to store
Expand All @@ -125,6 +84,7 @@ click the docker icon in the menu bar, click on preferences and change the
available memory to 4GB.

To deploy the Airflow container on the docker engine, with its required dependencies, run:

```bash
make up
```
Expand All @@ -136,7 +96,6 @@ All DAGs are paused by default for local instances and our staging instance of A
In order to submit a DAG via the UI, you'll need to toggle the DAG from "Off" to "On".
You'll likely want to toggle the DAG back to "Off" as soon as your desired task starts running.


#### Workaround for permission issues

Users on Linux distributions will encounter permission issues with `docker-compose`.
Expand Down Expand Up @@ -168,24 +127,12 @@ Finally, run the testing command using docker-compose directly:
docker-compose exec web airflow test example spark 20180101
```

### Testing Dev Changes

*Note: This only works for `telemetry-batch-view` jobs*

A dev changes can be run by simply changing the `DEPLOY_TAG` environment variable
to whichever upstream branch you've pushed your local changes to.

Afterwards, you're going to need to:`make clean` and `make build` and `nohup make up &`

From there, you can either set the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in the
Dockerfile and run `make up` to get a local UI and run from there, or you can follow the
testing instructions above and use `make run`.

### Testing GKE Jobs (including BigQuery-etl changes)

For now, follow the steps outlined here to create a service account: https://bugzilla.mozilla.org/show_bug.cgi?id=1553559#c1.

Enable that service account in Airflow with the following:

```
make build && make up
./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS
Expand All @@ -208,7 +155,8 @@ Set the project in the DAG entry to be configured based on development environme
see the `ltv.py` job for an example of that.

From there, run the following:
```

```bash
make build && make up
./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS google_cloud_airflow_dataproc
```
Expand Down Expand Up @@ -248,7 +196,7 @@ variables:
- `AIRFLOW_SMTP_HOST` -- The SMTP server to use to send emails e.g.
`email-smtp.us-west-2.amazonaws.com`
- `AIRFLOW_SMTP_USER` -- The SMTP user name
- `AIRFLOW_SMTP_PASSWORD` -- The SMTP password
- `AIRFLOW_SMTP_PASSWORD` -- The SMTP password
- `AIRFLOW_SMTP_FROM` -- The email address to send emails from e.g.
`[email protected]`
- `URL` -- The base URL of the website e.g.
Expand All @@ -270,7 +218,9 @@ Also, please set
Both values should be set by using the cryptography module's fernet tool that
we've wrapped in a docker-compose call:

make secret
```bash
make secret
```

Run this for each key config variable, and **don't use the same for both!**

Expand Down
3 changes: 0 additions & 3 deletions airflow.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@
default_timezone = utc
log_filename_template = {{ ti.dag_id }}/{{ ti.task_id }}/{{ execution_date.strftime("%%Y-%%m-%%dT%%H:%%M:%%S") }}/{{ try_number }}.log

# The home folder for airflow, default is ~/airflow
airflow_home = $AIRFLOW_HOME

# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository
dags_folder = $AIRFLOW_HOME/dags
Expand Down
2 changes: 1 addition & 1 deletion dags/operators/backport/gcp_container_operator_1_10_7.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
from airflow.contrib.hooks.gcp_container_hook import GKEClusterHook

# Modified to import KubernetesPodOperator which imports 1.10.2 kube_client
from kubernetes_pod_operator_1_10_7 import KubernetesPodOperator
from .kubernetes_pod_operator_1_10_7 import KubernetesPodOperator

from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
Expand Down
2 changes: 1 addition & 1 deletion dags/operators/backport/kubernetes_pod_operator_1_10_7.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from airflow.contrib.kubernetes import pod_generator, pod_launcher

# import our own kube_client from 1.10.2. We also add pod name label to the pod.
from kube_client_1_10_2 import get_kube_client
from .kube_client_1_10_2 import get_kube_client

from airflow.contrib.kubernetes.pod import Resources
from airflow.utils.helpers import validate_key
Expand Down
2 changes: 1 addition & 1 deletion dags/operators/emr_spark_operator.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import boto3
from io import BytesIO
from gzip import GzipFile
from urlparse import urlparse
from urllib.parse import urlparse
import requests
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
Expand Down
2 changes: 1 addition & 1 deletion dags/operators/gcp_container_operator.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

# We import upstream GKEPodOperator/KubernetesPodOperator from 1.10.7, modified to point to kube_client
# from 1.10.2, because of some Xcom push breaking changes when using GKEPodOperator.
from backport.gcp_container_operator_1_10_7 import GKEPodOperator as UpstreamGKEPodOperator
from .backport.gcp_container_operator_1_10_7 import GKEPodOperator as UpstreamGKEPodOperator

KUBE_CONFIG_ENV_VAR = "KUBECONFIG"
GCLOUD_APP_CRED = "CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE"
Expand Down
2 changes: 1 addition & 1 deletion dags/utils/mozetl.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def mozetl_envvar(command, options, dev_options={}, other={}):

prefixed_options = {
"MOZETL_{}_{}".format(command.upper(), key.upper().replace("-", "_")): value
for key, value in options.iteritems()
for key, value in options.items()
}
prefixed_options["MOZETL_COMMAND"] = command
prefixed_options.update(other)
Expand Down
24 changes: 24 additions & 0 deletions requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
boto3
kombu==4.6.3 # CeleryExecutor issues with 1.10.2 supposedly fixed in 1.10.5 airflow, but still observed issues on 1.10.7
# removed hdfs
apache-airflow[celery,postgres,hive,jdbc,async,password,crypto,github_enterprise,datadog,statsd,s3,mysql,google_auth,gcp_api,kubernetes]==1.10.10
mozlogging
retrying
newrelic
redis
hiredis
requests
jsonschema
Flask-OAuthlib
pytz
werkzeug==0.16.0
# The next requirements are for kubernetes-client/python
urllib3>=1.24.2 # MIT
ipaddress>=1.0.17;python_version=="2.7" # PSF
websocket-client>=0.32.0,!=0.40.0,!=0.41.*,!=0.42.* # LGPLv2+
# Pin to older version, newer version has issues
JPype1==0.7.1
shelljob==0.5.6
# Fix no inspection available issue
# https://github.com/apache/airflow/issues/8211
SQLAlchemy==1.3.15
Loading

0 comments on commit 4fb86a1

Please sign in to comment.