forked from ThinkBigAnalytics/scalding-workshop
-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME.html
338 lines (262 loc) · 13.2 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<style>
h1,
h2,
h3,
h4,
h5,
h6,
p,
blockquote {
margin: 0;
padding: 0;
}
body {
font-family: "Helvetica Neue", Helvetica, "Hiragino Sans GB", Arial, sans-serif;
font-size: 13px;
line-height: 18px;
color: #737373;
background-color: white;
margin: 10px 13px 10px 13px;
}
table {
margin: 10px 0 15px 0;
border-collapse: collapse;
}
td,th {
border: 1px solid #ddd;
padding: 3px 10px;
}
th {
padding: 5px 10px;
}
a {
color: #0069d6;
}
a:hover {
color: #0050a3;
text-decoration: none;
}
a img {
border: none;
}
p {
margin-bottom: 9px;
}
h1,
h2,
h3,
h4,
h5,
h6 {
color: #404040;
line-height: 36px;
}
h1 {
margin-bottom: 18px;
font-size: 30px;
}
h2 {
font-size: 24px;
}
h3 {
font-size: 18px;
}
h4 {
font-size: 16px;
}
h5 {
font-size: 14px;
}
h6 {
font-size: 13px;
}
hr {
margin: 0 0 19px;
border: 0;
border-bottom: 1px solid #ccc;
}
blockquote {
padding: 13px 13px 21px 15px;
margin-bottom: 18px;
font-family:georgia,serif;
font-style: italic;
}
blockquote:before {
content:"\201C";
font-size:40px;
margin-left:-10px;
font-family:georgia,serif;
color:#eee;
}
blockquote p {
font-size: 14px;
font-weight: 300;
line-height: 18px;
margin-bottom: 0;
font-style: italic;
}
code, pre {
font-family: Monaco, Andale Mono, Courier New, monospace;
}
code {
background-color: #fee9cc;
color: rgba(0, 0, 0, 0.75);
padding: 1px 3px;
font-size: 12px;
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
border-radius: 3px;
}
pre {
display: block;
padding: 14px;
margin: 0 0 18px;
line-height: 16px;
font-size: 11px;
border: 1px solid #d9d9d9;
white-space: pre-wrap;
word-wrap: break-word;
}
pre code {
background-color: #fff;
color:#737373;
font-size: 11px;
padding: 0;
}
sup {
font-size: 0.83em;
vertical-align: super;
line-height: 0;
}
* {
-webkit-print-color-adjust: exact;
}
@media screen and (min-width: 914px) {
body {
width: 854px;
margin:10px auto;
}
}
@media print {
body,code,pre code,h1,h2,h3,h4,h5,h6 {
color: black;
}
table, pre {
page-break-inside: avoid;
}
}
</style>
<title>Scalding Workshop/Tutorial README</title>
<script type="text/x-mathjax-config">MathJax.Hub.Config({tex2jax:{inlineMath:[['$$$','$$$']]}});</script><script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
</head>
<body>
<h1>Scalding Workshop/Tutorial README</h1>
<p><strong>Dean Wampler, Lightbend</strong><br/>
<a href="mailto:[email protected]?subject=Question%20about%20your%20Scalding%20Workshop">[email protected]</a><br/>
<a href="https://twitter.com/deanwampler">@deanwampler</a><br/>
<a href="http://lightbend.com">Lightbend</a></p>
<p><img src="images/scalding-logo-small.png" alt="Scalding logo" /></p>
<h2>About this Workshop/Tutorial</h2>
<p>This session is a half-day tutorial on Scalding and its place in the Hadoop ecosystem. <a href="https://github.com/twitter/scalding">Scalding</a> is a Scala API developed at Twitter for distributed data programming that uses the <a href="http://www.cascading.org/">Cascading</a> Java API, which in turn sits on top of Hadoop's Java API. However, Scalding, through Cascading, also offers a <em>local</em> mode that makes it easy to run jobs without using the Hadoop libraries, for simpler testing and learning. We'll use this feature for most of this session.</p>
<h2>Getting Started</h2>
<p>We use <a href="http://www.scala-sbt.org/">sbt</a>, the <em>de facto</em> Scala build tool, to resolve dependencies (such as the Scalding and Cascading jars), and to compile the one Hadoop example (but not the rest of the exercises...). You will need to install Git, Java, Scala, and sbt for this workshop, as we discuss next.</p>
<p><strong>Please do the following installation steps <em>before</em> the workshop!</strong></p>
<p>It helps to pick a work directory where you will install some of the packages. In what follows, we'll assume you're using <code>$HOME/fun</code> on Linux, Mac OSX, or Cygwin for Windows with the <code>bash</code> shell (or a similar shell) or you are using <code>C:\fun</code> on Windows.</p>
<h3>Git</h3>
<p>You'll need <a href="http://git-scm.com">git</a> to clone the workshop repository and optionally for other installs. See <a href="http://git-scm.com/book/en/Getting-Started-Installing-Git">Getting Started Installing Git</a> for details.</p>
<h3>This Workshop</h3>
<p>Once git is installed, <a href="https://github.com/deanwampler/scalding-workshop">clone this workshop from GitHub</a>. Use your favorite Git GUI or the command line. Using <code>bash</code>:</p>
<pre><code>cd $HOME/fun
git clone git://github.com/deanwampler/scalding-workshop.git
</code></pre>
<p>On Windows:</p>
<pre><code>cd C:\fun
git clone git://github.com/deanwampler/scalding-workshop.git
</code></pre>
<h3>Java v1.7 or Better</h3>
<p>If it's not already installed, install Java from <a href="http://www.java.com/en/download/help/download_options.xml">java.com</a>.</p>
<h3>Scala v2.11.7 (or v2.10.6)</h3>
<p>We'll use a build of Scalding for Scala v2.11.7 (although you can also use Scala v2.10.6). Install Scala following the instructions <a href="http://www.scala-lang.org/downloads">here</a>.</p>
<h3>SBT</h3>
<p>See the <a href="http://www.scala-sbt.org/">website for sbt</a> for installation instructions. Actually, what you install is a driver Java program. The actual version of <code>sbt</code> used will be bootstrapped for the project...</p>
<h2>Setting Up The Project and a Sanity Check</h2>
<p>Once you've completed these steps, we need to "bootstrap" the project with <code>sbt</code> and then run a "sanity check" script, our exercise 0.</p>
<p>The first of the following three commands changes to the root directory of the workshop. (We'll spend the whole session working in this directory.) The second command runs <code>sbt</code> to create an "assembly" (an all-inclusive jar file with all the dependent jars we need included - well, most of them...). Finally, the third and last command runs the sanity check script. We'll run it using a Scala script called <code>run</code> in the root directory of the project, which we'll use for all the exercises.</p>
<p>Using <code>bash</code> (assuming you installed the workshop in <code>$HOME/fun</code>):</p>
<pre><code>cd $HOME/fun/scalding-workshop
sbt assembly
./run scripts/SanityCheck0.scala
</code></pre>
<p>On Windows (assuming you installed the workshop in <code>C:\fun</code>):</p>
<pre><code>cd C:\fun\scalding-workshop
sbt assembly
scala run scripts/SanityCheck0.scala
</code></pre>
<p>The commands should run without error. If you get an error like <code>sbt not found</code> or <code>scala not found</code>, make sure these tools are on your command "path".</p>
<p>The <code>sbt assembly</code> command first runs an <code>update</code> task, which downloads all the dependencies, using the specification in <code>project/Build.scala</code>. You'll see lots of messages as it tries different repositories. Note that these dependencies will be downloaded to your <code>$HOME/.ivy2</code> directory (on *nix systems). <strong>This may take a while to run!!</strong></p>
<p>Next, the <code>assembly</code> task builds an all-inclusive "jar" (<em>Java ARchive</em>) file that includes all the dependencies, including Scalding and Hadoop. This jar file makes it easier to run Scalding scripts on Hadoop, because it simplifies working with dependency jars and the <code>CLASSPATH</code>. The output of <code>assembly</code> is <code>target/ScaldingWorkshop-X.Y.Z.jar</code>, where <code>X.Y.Z</code> will be the current version number for the workshop.</p>
<p>For completeness, note also that the version of <code>sbt</code> itself is specified in <code>project/build.properties</code>. There is also a <code>project/plugins.sbt</code> file that specifies some <code>sbt</code> plugins we use.</p>
<p>Finally, the <code>run</code> Scala script takes a moment to compile the Scalding script and then run it. The output is written to <code>output/SanityCheck0.txt</code>. (What's in that file?)</p>
<p>If you have Ruby installed on your system, there is a port of <code>run</code> in Ruby called <code>run.rb</code>. To use it, just replace the <code>run</code> command above with <code>run.rb</code>, for the *nix <code>bash</code> shell, or for Windows, use <code>ruby run.rb</code> instead of <code>scala run</code>.</p>
<p>See the Appendix below for "optional installs", if you decide to use Scalding after the tutorial you'll want to install some of these packages.</p>
<h2>Next Steps</h2>
<p>You can now start with the workshop itself. Go to the companion <a href="https://github.com/deanwampler/scalding-workshop/blob/master/Workshop.html">Workshop page</a>.</p>
<h2>Notes on Releases</h2>
<h3>V0.4.0</h3>
<p>Moved to Scala v2.10.3 and Scalding v0.9.0rc4. Refined some of the exercises and added one that uses Scalding's newer "type-safe" API.</p>
<h3>V0.3.X</h3>
<p>Moved to Scala v2.10.2 and Scalding v0.8.6. Completely reworked the build process and the script running process. Refined many of the exercises.</p>
<h3>V0.2.1</h3>
<p>Added a file missing from distribution. Refined the run scripts to work better with different Java versions.</p>
<h3>V0.2</h3>
<p>Refined several exercises and fixed bugs. Added <code>Makefile</code> for building releases. (Since removed...)</p>
<h3>V0.1</h3>
<p>First release for the StrangeLoop 2012 workshop.</p>
<h2>For Further Information</h2>
<p>See the <a href="https://github.com/twitter/scalding">Scalding GitHub page</a> for more information about Scalding. The <a href="https://github.com/twitter/scalding/wiki">wiki</a> is indispensable. The Scaladocs for Scalding are <a href="http://twitter.github.io/scalding/">here</a>.</p>
<p>I'm <a href="mailto:[email protected]">Dean Wampler</a> from <a href="http://lightbend.com">Lightbend</a>. I prepared this workshop. Send me email with <a href="mailto:[email protected]?subject=Question%20about%20your%20Scalding%20Workshop">questions about the workshop</a> or for <a href="mailto:[email protected]?subject=Hiring%20Dean%20Wampler">information about consulting and training</a> on Scala, Scalding, the <a href="http://lightbend.com/platform">Lightbend Reactive Platform</a>, and other Hadoop and <em>Big Data</em> technologies.</p>
<p>Some of the data used in these exercises was obtained from <a href="http://infochimps.com">InfoChimps</a>.</p>
<p><strong>NOTE:</strong> The first version of this workshop was written while I worked at Think Big Analytics. The original and now obsolete fork of the workshop is <a href="https://github.com/ThinkBigAnalytics/scalding-workshop">here</a>.</p>
<p><strong>Dean Wampler</strong><br/>
<a href="mailto:[email protected]?subject=Question%20about%20your%20Scalding%20Workshop">[email protected]</a><br/>
<a href="https://twitter.com/deanwampler">@deanwampler</a><br/></p>
<h2>Appendix - Optional Installs</h2>
<p>If you're serious about using Scalding, you should clone and build the Scalding repo itself. We'll talk briefly about it in the workshop, but it isn't required.</p>
<h3>Scalding from GitHub</h3>
<p>Clone <a href="https://github.com/twitter/scalding">Scalding from GitHub</a>. Using <code>bash</code> and assuming you'll clone it into <code>$HOME/fun</code>:</p>
<pre><code>cd $HOME/fun
git clone https://github.com/twitter/scalding.git
</code></pre>
<p>Windows is similar.</p>
<h3>Ruby v1.8.7 or v1.9.X</h3>
<p>Ruby is used as a platform-independent language for driver scripts by Scalding (e.g., their <code>scripts/scald.rb</code>). See <a href="http://ruby-lang.org">ruby-lang.org</a> for details on installing Ruby. Either version 1.8.7 or 1.9.X will work.</p>
<h3>Build Scalding</h3>
<p>Build Scalding according to its <a href="https://github.com/twitter/scalding/wiki/Getting-Started">Getting Started</a> page. By default, Twitter builds with Scala v2.9.3, but Scalding builds with 2.10.2 and the <code>project/Build.scala</code> file can be edited for this version.</p>
<p>Edit <code>project/Build.scala</code>. Near the top, you'll see a line <code>scalaVersion := 2.9.2</code> and next to it, a commented line for version 2.10.0. Comment out the line with 2.9.2 and uncomment the 2.10.0 line, then change the last zero to "2" or "3". Save your changes.</p>
<p>Now, here is a synopsis of the build steps. Using <code>bash</code>:</p>
<pre><code>cd $HOME/fun/scalding
sbt update
sbt assembly
</code></pre>
<p>On Windows:</p>
<pre><code>cd C:\fun\scalding
sbt update
sbt assembly
</code></pre>
<p>(The Getting Started page says to build the <code>test</code> target between <code>update</code> and <code>assembly</code>, but the later builds <code>test</code> itself.)</p>
<h3>Sanity Check</h3>
<p>Once you've built Scalding, run the following command as a sanity check to ensure everything is setup properly. Using <code>bash</code>:</p>
<pre><code>cd $HOME/fun/scalding
scripts/scald.rb --local tutorial/Tutorial0.scala
</code></pre>
<p>On Windows:</p>
<pre><code>cd C:\fun\scalding
ruby scripts\scald.rb --local tutorial/Tutorial0.scala
</code></pre>
</body>
</html>