Lesson 15: Hadoop and Impala Exercises

What is Hadoop?

Hadoop is an open-source project containing a MapReduce job scheduler and a distributed filesystem (HDFS). Both are open-sourced versions of Google projects:

Hadoop MapReduce :: Google Map Reduce Paper
HDFS :: Google File System [http://research.google.com/archive/gfs.html]

Hadoop sets itself apart from HPC systems by integrating computing and storage. Instead of pulling data on to a computing cluster, compute tasks are sent to the data. Tasks are completed where the data lives.

Jobs are typically run using the Map Reduce paradigm with two phases:

Map: Filtering and transforming the data
Reduce: Aggregate records based on some key and aggregation function

Same idea is seen commonly in functional programming languages

x = xrange(10)
x2 = [ i^2 for i in x] # map phase, each element is transformed by square operation
total  = sum(x2) # reduce phase, all elements are combined into a single number

HDFS

HDFS is a distributed filesystem that serves as the backbone of Hadoop.

HDFS is easily scalable - more storage is just more machines. Also data is replicated across many machines so data is available even with disk or node failures.

Caveats:

All files are append only - no updates or edits.
All data is replicated, so if any machine fails no data is lost (but also larger storage overhead)

Hadoop lets us access files on the special filesystem (which coexists with the local files) using similar command line tools.

To list files, we have something similar to `ls

hadoop fs -ls

To cat files, we have to also use the hadoop dfs command

hadoop fs -cat /path/to/file

If we want to copy something from the local filesystem to HDFS, we can use the copyFromLocal flag

hdfs fs -copyFromLocal <local_file> <folder_on_hdfs>

Now we can access this file on hdfs using cat

hdfs fs -cat <some_file> | less

Hive

Hive is a data warehouse system for Hadoop to handle querying data on HDFS. Hive presents a SQL-like language (HiveQL) to query the data and translates this into MapReduce jobs. Most importantly you can use SQL skills without learning how to write MR program. HiveQL shares much of its syntax from MySQL.

The data is stored on HDFS and metadata is stored separately.

Hive DDL

Impala

The goal of Impala is to provide more interactive querying. The syntax mirrors that of Hive, however some of Hive's features not implemented yet.

Caveats

Impala is not launching MR jobs
No fault-tolerance, queries will fail

impala-shell

Creating a Database and Table

create database mydb;
use mydb;

create table mytable
	(
		id int,
		description string
	);

Loading Data

insert into mytable (id, description) values (1, 'This is text'), (2, 'This is text');

OR

load data inpath '/path/to/files' into mytables;

Specifying Data Storage

In the command above we were able to create a table, but we also have to state something about how the underlying data is stored. This is done with the ROW FORMAT command.

create table mytable
	(
		id int,
		description string
	)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

We specified that the column delimiter was a tab and the underlying files were text. There are alternatives as well.

create external table if not exists salary_data
(Id int,
Title string,
FullDescription string,
LocationRaw string,
LocationNormalized string,
ContractType string,
ContractTime string,
Company string,
Category string,
SalaryRaw string,
SalaryNormalized float,
SourceName string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/admin/salary_data'

External Vs Internal Tables

Traditionally, you can load data into a table with the follwing command:

load data inpath '/path/to/files' into table;

This command will move the files from /path/to/files to an a folder managed by Hive. Important note is that Hive will need write permissions on the folder to move the files. Alternatively, you can create an external table.

create external table mytable (...);

This will leave the files where they are and manage them here. Unlike internal tables, dropping the table will not delete data.

Querying with Impala

What is the average salary per Category?

select category, AVG(salarynormalized) from salary_data group by category

Print 15 entries where users are from London

select * from salary_data where locationormalized == 'London' limit 15;

What is the title of the highest salary job?

select title 
from salary_data t1
join
(select MAX(salarynormalized) as max_salary from salary_data) t2
on t1.salarynormalized = t2.max_salary

On your own

What is the average salary per source?
How many job listings are there per ContractType?
What is the most frequently occurring Title?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly