Get familiar with Spark shell
approx. 20-30 minutes
Java / Scala users follow Scala section
Do the following to update the labs to latest
$ cd ~/spark-labs
$ git pull # this will update the labs to latest
Scala shell:
$ cd ~/spark-labs
$ ~/spark/bin/spark-shell ## spark shell is in bin/ dir
Console output will look like
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_72)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
...
scala>
Spark shell UI is available on port 4040.
In browser go to : http://your_machine_address:4040 (use 'public' ip of machine)
Here is a sample screen shot:
==> Explore stage, storage, environment and executor tabs
==> Take note of 'Event Timeline', we will use this for monitoring our jobs later
==> Check spark master on port 8080, Do you the Spark shell application connected? Why (not)?
Within Spark shell, variable sc
is the SparkContext
Type sc
in scala prompt and see what happens.
Your output might look like this
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@5019fb90
To see all methods in sc variable, type sc.
and double-TAB
This will show all the available methods on sc
variable.
(This only works on Scala shell for now)
Try the following:
==> Print the name of application name
sc.appName
==> Find the 'Spark master' for the shell
sc.master
We have data files under spark-labs/data
Use test file : data/twinkle/sample.txt
.
The file has a favorite nursery rhyme
twinkle twinkle little star
how I wonder what you are
up above the world so high
like a diamond in the sky
twinkle twinkle little star
Let's load the file:
val f = sc.textFile("data/twinkle/sample.txt")
==> What is the 'type' of f ?
hint : type f
on the console
==> Inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?
==> Print the first line / record from RDD
hint : f.first()
==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?
==> Print first 3 lines of RDD
hint : f.take(???)
(provide the correct argument to take function)
==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?
==> Print all the content from the file
hint : f.collect()
==> How many lines are in the file?
hint : f.count()
==> Inspect the 'Jobs' section in Shell UI (in browser)
Also inspect the event time line
==> Inspect the 'Executor' section in Shell UI (in browser)
==> Quit spark-shell session Control + D
==> Quit spark-shell session, if you haven't done so yet.. Control + D
If Spark server is not running, start it as
$ ~/spark/sbin/start-all.sh
Use jps
command to inspect the java process. Your output might look like this.
731 Master
902 Jps
831 Worker
Spark master UI is available on port 8080. In browser go to : http://your_machine_address:8080 (use 'public' ip of machine)
Here is a sample screen shot:
$ ~/spark/bin/spark-shell
Once the shell starts, check the server UI on port 8080.
==> Do you see the shell connected as an application? why (not) ?
Make a note of Spark server uri (e.g spark://host_name:7077
)
==> Restart spark shell as follows
$ ~/spark/bin/spark-shell --master spark-server-uri
^^^^^^^^^^^^^^^^
update this to match your spark server
$ ~/spark/bin/spark-shell --master spark://localhost:7077
On an Amazon server you may have to use the internal ip for the spark server, such as
~/spark/bin/spark-shell --master spark://your_host_name:7077
On the ES VM you may have to use the localhost.localdomain. In all cases, follow what the spark master UI tells you.
==> Once the shell started, check both UIs
--- #### spark shell UI at port 4040Now our shell is connected to a server ==> Load file and test it as in Step (4)
Spark Shell by default prints logs at warning (WARN) level. If you want to change the logging level, do this at Spark shell
sc.setLogLevel("INFO")
If you don't want to see any logs, you can start Spark shell as follows. All the logs will be sent to 'logs' file.
$ ~/spark/bin/spark-shell 2> logs
- Using one terminal, start a shell and connect to master using Step 5.3
- Using second terminal (open one if you need to), start another shell connecting to the same master
- Check the master UI (port 8080). You would see some thing like this, can you explain the behavior?
import sys.process._
val result = "ls -al"!