Skip to content

Latest commit

 

History

History
93 lines (57 loc) · 2.28 KB

5.2-sql.md

File metadata and controls

93 lines (57 loc) · 2.28 KB

<< back to main index / SQL labs

Lab 5.2 : Spark SQL : SQL

Overview

Using SQL statements with Spark SQL

Depends On

None

Run time

20-30 mins


STEP 1: Start Spark Shell

Change working directory to spark-labs. This way, we can access data using relative paths (makes life simpler)

    $  cd ~/spark-labs
    $   ~/spark/bin/spark-shell

STEP 2: Load Clickstream data

==> Create a dataframe

    val clickstreamDF = sqlContext.read.json("data/click-stream/clickstream.json")

==> Register dataframe as a table

    clickstreamDF.registerTempTable("clickstream")

STEP 4: Querying using SQL

==> Select all logs

val logs = sqlContext.sql("select * from clickstream")
// 'logs' is a dataframe

logs.show

Output might be like

+-------+-----------+----+-----------------+----+----------+-------------+------+
| action|   campaign|cost|           domain|  ip|   session|    timestamp|  user|
+-------+-----------+----+-----------------+----+----------+-------------+------+
|clicked|campaign_19| 118|      youtube.com|ip_4|session_36|1420070400000|user_9|
|blocked|campaign_12|   5|     facebook.com|ip_3|session_96|1420070400864|user_5|
|clicked| campaign_3|  54|sf.craigslist.org|ip_9|session_61|1420070401728|user_8|
...

==> Find records with only clicks

==> Count records from each domain, sort the output by most to least

STEP 4: Joining Datasets

==> Load domains dataset and register it to table domains

    val domainsDF = sqlContext.read.json("data/click-stream/domain-info.json")
    domainsDF.show
    domainsDF.registerTempTable("domains")

==> Join clickstreams and domains
Hint : Join query syntax for joining two tables A, B

select A.*, B.* from A  join B  ON (A.x = B.y) 

==> Count traffic per domain category (news, social ..etc)
Hint : query the joined datasets

STEP 5: Explore DAGs