-
Notifications
You must be signed in to change notification settings - Fork 57
dataflow scratchpad for proposed syntax
Philip (flip) Kromer edited this page May 9, 2012
·
1 revision
-
you can have a consumer connect to a provider, or vice versa
- producer binds to a port, consumers connect to it: pub/sub
- consumers open a port, producer connects to many: megaphone
-
you can bring the provider on line first, and the consumers later, or vice versa.
You can run this from the commandline:
wukong count_followers.rb users.json followers_histogram.tsv
It will run in local mode, effectively doing
cat users.json | {the map block} | sort | {the reduce block} > followers_histogram.tsv
You can instead run it in Hadoop mode, and it will launch the job across a distributed Hadoop cluster
wukong --run=hadoop count_followers.rb users.json followers_histogram.tsv
-
tsv/csv
-
json
-
xml
-
avro
-
apache_log
-
flat
-
regexp
-
gz/bz2/zip/snappy
Data consists of
- record
- schema
- metadata
read('/foo/bar') # source( FileSource.new('/foo/bar') ) writes('/foo/bar') # sink( FileSink.new('/foo/bar') )
... | file('/foo/bar') # this we know is a source file('/foo/bar') | ... # this we know is a sink file('/foo/bar') # don't know; maybe we can guess later
Here is an example Wukong script, count_followers.rb
:
from :json
mapper do |user|
year_month = Time.parse(user[:created_at]).strftime("%Y%M")
emit [ user[:followers_count], year_month ]
end
reducer do
start{ @count = 0 }
each do |followers_count, year_month|
@count += 1
end
finally{ emit [*@group_key, @count] }
end