<< back to main index / SQL labs
Using SQL statements with Spark SQL
None
20-30 mins
Change working directory to spark-labs
. This way, we can access data using relative paths (makes life simpler)
$ cd ~/spark-labs
$ ~/spark/bin/spark-shell
==> Create a dataframe
val clickstreamDF = sqlContext.read.json("data/click-stream/clickstream.json")
==> Register dataframe as a table
clickstreamDF.registerTempTable("clickstream")
==> Select all logs
val logs = sqlContext.sql("select * from clickstream")
// 'logs' is a dataframe
logs.show
Output might be like
+-------+-----------+----+-----------------+----+----------+-------------+------+
| action| campaign|cost| domain| ip| session| timestamp| user|
+-------+-----------+----+-----------------+----+----------+-------------+------+
|clicked|campaign_19| 118| youtube.com|ip_4|session_36|1420070400000|user_9|
|blocked|campaign_12| 5| facebook.com|ip_3|session_96|1420070400864|user_5|
|clicked| campaign_3| 54|sf.craigslist.org|ip_9|session_61|1420070401728|user_8|
...
==> Find records with only clicks
==> Count records from each domain, sort the output by most to least
==> Load domains
dataset and register it to table domains
val domainsDF = sqlContext.read.json("data/click-stream/domain-info.json")
domainsDF.show
domainsDF.registerTempTable("domains")
==> Join clickstreams
and domains
Hint : Join query syntax for joining two tables A, B
select A.*, B.* from A join B ON (A.x = B.y)
==> Count traffic per domain category (news, social ..etc)
Hint : query the joined datasets