forked from swcarpentry/web-data-python
-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-makedata.html
114 lines (112 loc) · 7.51 KB
/
05-makedata.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Working With Data on the Web</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Working With Data on the Web</h1>
<h2 class="subtitle">Publishing Data</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Write Python programs that share static data sets.</li>
</ul>
</div>
</div>
<p>In our <a href="01-getdata.html">previous lesson</a>, we built functions called <code>get_annual_mean_temp_by_country</code> and <code>diff_records</code> to download temperature data for different countries and find annual differences. The next step is to share our findings with the world by making publishing the data sets we generate. To do this, we have to answer three questions:</p>
<ul>
<li>How are we going to store the data?</li>
<li>How are people going to download it?</li>
<li>How are people going to find it?</li>
</ul>
<p>The first question is the easiest to answer: <code>diff_records</code> returns a list of (year, difference) pairs that we can write out as a CSV file:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> csv
<span class="kw">def</span> save_records(filename, records):
<span class="co">'''Save a list of [year, temp] pairs as CSV.'''</span>
<span class="kw">with</span> <span class="dt">open</span>(filename, <span class="st">'w'</span>) <span class="ch">as</span> raw:
writer = csv.writer(raw)
writer.writerows(records)</code></pre>
<p>Let’s test it:</p>
<pre class="sourceCode python"><code class="sourceCode python">save_records(<span class="st">'temp.csv'</span>, [[<span class="dv">1</span>, <span class="dv">2</span>], [<span class="dv">3</span>, <span class="dv">4</span>]])</code></pre>
<p>If we then look in the file <code>temp.csv</code>, we find:</p>
<pre><code>1,2
3,4</code></pre>
<p>as desired.</p>
<p>Now, where should this file go? The answer is clearly “a server”, since data on our laptop is only accessible when we’re online (and probably not even then, since most people don’t run a web server on their laptop). But where on the server, and what should we call it?</p>
<p>The answer to those questions depends on how the server is set up. On many multi-user Linux machines, users can create a directory called something like <code>public_html</code> under their home directory, and the web server will search in those directories. For example, if Nelle has a file called <code>thesis.pdf</code> in her <code>public_html</code> directory, the web server will find it when it gets the URL <code>http://the.server.name/u/nelle/thesis.pdf</code>. The specifics differ from one machine to the next, but the mechanism stays the same.</p>
<p>As for what we should call it, here we return to the key idea in REST: every data set should be identified by a “guessable” URL. In our case we’ll use the name <code>left-right.csv</code>, where <code>left</code> and <code>right</code> are the three-letter codes of the countries whose mean annual temperatures we are differencing. We can then tell people that if they want to compare Australia and Brazil, they should look for <code>http://the.server.name/u/nelle/AUS-BRA.csv</code>. (We use upper case to be consistent with the World Bank’s API.)</p>
<p>But what’s to prevent someone from creating a badly-named (and therefore unfindable) file? Someone could, for example, call <code>save_records('aus+bra.csv', records)</code>. To prevent this (or at least reduce the risk), let’s modify <code>save_records</code> as follows:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> csv
<span class="kw">def</span> save_records(left, right, records):
<span class="co">'''Save a list of [year, temp] pairs as CSV.'''</span>
filename = left + <span class="st">'-'</span> + right + <span class="st">'.csv'</span>
<span class="kw">with</span> <span class="dt">open</span>(filename, <span class="st">'w'</span>) <span class="ch">as</span> raw:
writer = csv.writer(raw)
writer.writerows(records)</code></pre>
<p>We can now call it like this:</p>
<pre class="sourceCode python"><code class="sourceCode python">save_records(<span class="st">'AUS'</span>, <span class="st">'BRA'</span>, [[<span class="dv">1</span>, <span class="dv">2</span>], [<span class="dv">3</span>, <span class="dv">4</span>]])</code></pre>
<p>and then check that the right output file has been created. Since we are bound to have the country codes anyway (having used them to look up our data), this is as little extra work as possible.</p>
<div id="testing-output" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Testing Output</h2>
</div>
<div class="panel-body">
<p>Modify <code>save_records</code> so that it can be tested using <code>cStringIO</code>.</p>
</div>
</div>
<div id="deciding-what-to-check" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Deciding What to Check</h2>
</div>
<div class="panel-body">
<p>Should <code>save_records</code> check that every record in its input is the same length? Why or why not?</p>
</div>
</div>
<div id="setting-up-locally" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Setting Up Locally</h2>
</div>
<div class="panel-body">
<p>Find out how to publish a file on your department’s server.</p>
</div>
</div>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>