PyBuilder plugin to simplify building projects for Amazon EMR. The following use cases are supported:
This project is based heavily on pybuilder_aws_plugin by Immobilienscout. It uses Boto3 for s3 uploads.
Add the following plugin dependency to your build.py
(will install directly
from PyPi and require the install_dependencies plugin):
use_plugin('python.install_dependencies')
use_plugin('pypi:pybuilder_emr_plugin')
After this you have the following additional tasks, which are explained below:
emr_package
emr_upload_to_s3
emr_release
This task assembles the Zip-file (a.k.a. the emr-zip) which will be
uploaded to S3 with the task emr_upload_to_s3
. The files are assembled using
a directory $target/emr-release. This task consists of the following steps:
Install every entry in build.py
, that is specified by using
project.depends_on()
, into a temporary directory via pip install -t
.
These will be included in the resulting emr-zip. Set the project property
install_dependencies_index_url
to use a custom index url (e.g. an internal
`PYPI server`__).
Note: This excludes boto, boto3 and pyspark as they are included in `AWS EMR dependencies`__ by default
All modules which are found in src/main/python/
are copied directly into
the lambda-zip.
All files which are found in src/main/resources/
are copied directly into
the lambda-zip.
The content of the scripts folder (src/main/scripts
) in a PyBuilder project
is normally intended to be placed in /usr/bin
. This plugin assumes this
directory contains scripts which are used as main.py argument to spark-submit,
therefore they are copied to the release directory and will be copied
to S3 with the task emr_upload_to_s3
. They are not part of the emr-zip
All these files are packed as a Zip-file, except the script files
This task uploads the generated zip and all script files to an S3 bucket. The bucket name is set in
build.py
:
project.set_property('emr.s3.bucket-name', 'my_emr_bucket')
The default acl for zips to be uploaded is bucket-owner-full-control
. But
if you need another acl you can overwrite this as follows in build.py
:
project.set_property('emr.s3.file-access-control', '<acl>')
Possible acl values are:
private
public-read
public-read-write
authenticated-read
bucket-owner-read
bucket-owner-full-control
For server side encryption use the properies
project.set_property('emr.s3.sse-kms-keyid', '<keyAlias>')
project.set_property('emr.s3.server-side-encryption', '<sse>')
Possible sse values are:
* aws:kms
* AES256
Furthermore, the plugin assumes that you already have a shell with enabled AWS access (exported keys or .boto or ...).
The uploaded files will be placed in a directory with the version number like:
v123/projectname.zip
and v123/main.py
.
Use the property bucket_prefix
to add a prefix to the uploaded
files. For example:
project.set_property('emr.s3.bucket-prefix', 'my_emr/')
This will upload the zip-file to the following key:
my_emr/v123/projectname.zip
These tasks copy the emr-zip and script files from the versioned path
to version independant path named latest
. For Example:
my_emr/v123/my-project.zip
is copied tomy_emr/latest/my-project.zip
This provides a simple release mechanism that follows the "latest greatest"
principle. Users can rely on the files under latest
to be the latest tested
version.
Use the property emr.s3.release-prefix
to modify your release prefix. For example:
project.set_property('emr.s3.release-prefix', 'LATEST/')
Copyright 2017, Oberbaum Concept UG
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.