06-findable.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Software Carpentry: Working With Data on the Web</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      <div class="banner">
        <a href="http://software-carpentry.org" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
        </a>
      </div>
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
          <h1 class="title">Working With Data on the Web</h1>
          <h2 class="subtitle">Making Data Findable</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Make data sets more useful by providing metadata.</li>
</ul>
</div>
</div>
<p>The final step in this lesson is to make the data we generate findable. It’s not enough to tell people what the rule is for creating filenames, since that doesn’t tell them what data sets we’ve actually generated. Instead, we need to create an <a href="reference.html#index">index</a> to tell them what files exist. For reasons we will see in a moment, that index should also tell them when each data set was generated.</p>
<p>Here’s the format we will use:</p>
<pre><code>2014-05-26,FRA,TCD,FRA-TCD.csv
2014-05-27,AUS,BRA,AUS-BRA.csv
2014-05-27,AUS,CAN,AUS-CAN.csv
2014-05-28,BRA,CAN,BRA-CAN.csv</code></pre>
<p>The four columns in this file are self-explanatory, but why do we bother to include the filename? After all, we can re-generate it easily given the two country codes. The answer is that while <em>we</em> know the rule for generating filenames, other people’s programs shouldn’t have to. Explicit is better than implicit.</p>
<p>Here’s a function that updates the index file every time we generate a new data file:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> time

<span class="kw">def</span> update_index(index_filename, left, right):
    <span class="co">&#39;&#39;&#39;Append a record to the index.&#39;&#39;&#39;</span>

    <span class="co"># Read existing data.</span>
    <span class="kw">with</span> <span class="dt">open</span>(index_filename, <span class="st">&#39;r&#39;</span>) <span class="ch">as</span> raw:
        reader = csv.reader(raw)
        records = []
        <span class="kw">for</span> r in reader:
            records.append(r)
    
    <span class="co"># Create new record.</span>
    timestamp = time.strftime(<span class="st">&#39;%Y-%m-</span><span class="ot">%d</span><span class="st">&#39;</span>)
    data_filename = left + <span class="st">&#39;-&#39;</span> + right + <span class="st">&#39;.csv&#39;</span>
    new_record = (timestamp, left, right, data_filename)
    
    <span class="co"># Save.</span>
    records.append(new_record)
    <span class="kw">with</span> <span class="dt">open</span>(index_filename, <span class="st">&#39;w&#39;</span>) <span class="ch">as</span> raw:
        writer = csv.writer(raw)
        writer.writerows(records)</code></pre>
<p>Let’s test it. If our index file contains:</p>
<pre><code>2014-05-26,FRA,TCD,FRA-TCD.csv
2014-05-27,AUS,BRA,AUS-BRA.csv
2014-05-27,AUS,CAN,AUS-CAN.csv
2014-05-28,BRA,CAN,BRA-CAN.csv</code></pre>
<p>and we run:</p>
<pre class="sourceCode python"><code class="sourceCode python">update_index(<span class="st">&#39;data/index.csv&#39;</span>, <span class="st">&#39;TCD&#39;</span>, <span class="st">&#39;CAN&#39;</span>)</code></pre>
<p>then our index file now contains:</p>
<pre><code>2014-05-26,FRA,TCD,FRA-TCD.csv
2014-05-27,AUS,BRA,AUS-BRA.csv
2014-05-27,AUS,CAN,AUS-CAN.csv
2014-05-28,BRA,CAN,BRA-CAN.csv
2014-05-29,TCD,CAN,TCD-CAN.csv</code></pre>
<p>Now that all of this is in place, it’s easy for us—and other people—to do new and exciting things with our data. For example, we can easily write a small program that tells us what data sets include information about a particular country <em>and</em> have been published since we last checked:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="kw">def</span> what_is_available(index_file, country, after):
    <span class="co">&#39;&#39;&#39;What data files include a country and have been published since &#39;after&#39;?&#39;&#39;&#39;</span>
    <span class="kw">with</span> <span class="dt">open</span>(index_file, <span class="st">&#39;r&#39;</span>) <span class="ch">as</span> raw:
        reader = csv.reader(raw)
        filenames = []
        <span class="kw">for</span> record in reader:
            <span class="kw">if</span> (after &lt;= record[<span class="dv">0</span>]) and (record[<span class="dv">1</span>] == country or record[<span class="dv">2</span>] == country):
                filenames.append(record[<span class="dv">3</span>])
    <span class="kw">return</span> filenames

<span class="dt">print</span> what_is_available(<span class="st">&#39;data/index.csv&#39;</span>, <span class="st">&#39;BRA&#39;</span>, <span class="st">&#39;2014-05-27&#39;</span>)</code></pre>
<pre class="output"><code>[&#39;AUS-BRA.csv&#39;, &#39;BRA-CAN.csv&#39;]</code></pre>
<p>This may not seem like a breakthrough, but it is actually an example of how the web helps researchers do new kinds of science. With a little bit more work, we could create a file on <em>our</em> machine to record when we last ran <code>what_is_available</code> for each of several different sites that are producing data. Each time we run it, our program would:</p>
<ul>
<li>read our local “what to check” file;</li>
<li>ask each data source whether it had any new data for us;</li>
<li>download and process that data; and</li>
<li>present us with a summary of the results.</li>
</ul>
<p>This is exactly how blogs work. Every blog reader keeps a list of blog URLs that it’s supposed to check. When it is run, it goes to each of those sites and asks them for their index file (which is typically called something like <code>feed.xml</code>). It then checks the articles listed in that index against its local record of what has already been seen, then downloads any articles that are new.</p>
<p>By automating this process, blogging tools help us focus attention on things that are actually worth looking at.</p>
<div id="to-automate-or-not" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>To Automate or Not</h2>
</div>
<div class="panel-body">
<p>Should <code>update_index</code> be called inside <code>save_records</code> so that the index is automatically updated every time a new data set is generated? Why or why not?</p>
</div>
</div>
<div id="removing-redundant-redundancy" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Removing Redundant Redundancy</h2>
</div>
<div class="panel-body">
<p><code>update_index</code> and <code>save_records</code> both construct the name of the data file. Refactor them to remove this redundancy.</p>
</div>
</div>
<div id="generating-a-local-index-large" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Generating a Local Index (Large)</h2>
</div>
<div class="panel-body">
<p>Generate a local index file to show what data sets were created when.</p>
</div>
</div>
        </div>
      </div>
      </article>
      <div class="footer">
        <a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
        <a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
        <a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
        <a class="label swc-blue-bg" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
  </body>
</html>