Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support spark 2.2.0 #73

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
FROM sequenceiq/hadoop-docker:2.6.0
MAINTAINER SequenceIQ
FROM yohei1126/hadoop-docker:2.7.1
MAINTAINER yohei1126

#support for Hadoop 2.6.0
RUN curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz | tar -xz -C /usr/local/
RUN cd /usr/local && ln -s spark-1.6.1-bin-hadoop2.6 spark
#support for Hadoop 2.7.1
RUN curl -s http://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz | tar -xz -C /usr/local/
RUN cd /usr/local && ln -s spark-2.2.0-bin-hadoop2.7 spark
ENV SPARK_HOME /usr/local/spark
RUN mkdir $SPARK_HOME/yarn-remote-client
ADD yarn-remote-client $SPARK_HOME/yarn-remote-client

RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-1.6.1-bin-hadoop2.6/lib /spark
RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.2.0-bin-hadoop2.7/jars /spark

ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$HADOOP_PREFIX/bin

# update boot script
COPY bootstrap.sh /etc/bootstrap.sh
RUN chown root.root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh

#install R
RUN rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
RUN yum -y install R

ENTRYPOINT ["/etc/bootstrap.sh"]
33 changes: 12 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ Apache Spark on Docker
This repository contains a Docker file to build a Docker image with Apache Spark. This Docker image depends on our previous [Hadoop Docker](https://github.com/sequenceiq/hadoop-docker) image, available at the SequenceIQ [GitHub](https://github.com/sequenceiq) page.
The base Hadoop Docker image is also available as an official [Docker image](https://registry.hub.docker.com/u/sequenceiq/hadoop-docker/).

##Pull the image from Docker Repository
## Pull the image from Docker Repository
```
docker pull sequenceiq/spark:1.6.0
docker pull sequenceiq/spark:2.2.0
```

## Building the image
```
docker build --rm -t sequenceiq/spark:1.6.0 .
docker build --rm -t sequenceiq/spark:2.2.0
```

## Running the image
Expand All @@ -24,16 +24,16 @@ docker build --rm -t sequenceiq/spark:1.6.0 .
* in your /etc/hosts file add $(boot2docker ip) as host 'sandbox' to make it easier to access your sandbox UI
* open yarn UI ports when running container
```
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:2.2.0 bash
```
or
```
docker run -d -h sandbox sequenceiq/spark:1.6.0 -d
docker run -d -h sandbox sequenceiq/spark:2.2.0 -d
```

## Versions
```
Hadoop 2.6.0 and Apache Spark v1.6.0 on Centos
Hadoop 2.7.0 and Apache Spark v2.2.0 on Centos
```

## Testing
Expand All @@ -47,7 +47,7 @@ In yarn-client mode, the driver runs in the client process, and the application
```
# run the spark shell
spark-shell \
--master yarn-client \
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
Expand All @@ -66,26 +66,17 @@ Estimating Pi (yarn-cluster mode):
# note you must specify --files argument in cluster mode to enable metrics
spark-submit \
--class org.apache.spark.examples.SparkPi \
--files $SPARK_HOME/conf/metrics.properties \
--master yarn-cluster \
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar
$SPARK_HOME/examples/jars/spark-examples*.jar
```

Estimating Pi (yarn-client mode):

```
# execute the the following command which should print the "Pi is roughly 3.1418" to the screen
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop2.6.0.jar
```
### View Result in hadoop

visit `http://localhost:8088`

### Submitting from the outside of the container
To use Spark from outside of the container it is necessary to set the YARN_CONF_DIR environment variable to directory with a configuration appropriate for the docker. The repository contains such configuration in the yarn-remote-client directory.
Expand Down
11 changes: 4 additions & 7 deletions bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,18 @@ cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do echo == $cp;
# altering the core-site configuration
sed s/HOSTNAME/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml

# setting spark defaults
echo spark.yarn.jar hdfs:///spark/spark-assembly-1.6.0-hadoop2.6.0.jar > $SPARK_HOME/conf/spark-defaults.conf
cp $SPARK_HOME/conf/metrics.properties.template $SPARK_HOME/conf/metrics.properties

service sshd start
$HADOOP_PREFIX/sbin/start-dfs.sh
$HADOOP_PREFIX/sbin/start-yarn.sh

$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver


CMD=${1:-"exit 0"}
if [[ "$CMD" == "-d" ]];
then
service sshd stop
/usr/sbin/sshd -D -d
service sshd stop
/usr/sbin/sshd -D -d
else
/bin/bash -c "$*"
/bin/bash -c "$*"
fi