Last updated: 2020-03-26
Table of content
- Sources and Scrapers
This guide provides information on the criterias we use to determine whether a source should be added to the project, and offer technical details regarding how a source can be implemented.
Any source added to the scraper must meet the following criteria:
No news articles, no aggregated sources, no Wikipedia.
Additional data is welcome.
In keeping with other datasets, presumptive cases should be considered part of the case total.
If you have found a source that matches the criterias above, read on!
Sources can pull JSON, CSV, or good ol' HTML down and are written in a sort of modular way, with a handful of helpers available to clean up the data. Sources can pull in data for anything -- cities, counties, states, countries, or collections thereof. See the existing scrapers for ideas on how to deal with different ways of data being presented.
Start by going to src/shared/scrapers/
and creating a new file in the country, region, and region directory (src/shared/scrapers/USA/CA/mycounty-name.js
)
Note: any files you create that start with _
will be ignored by the crawler. This is a good way to create utility or shared functionality between scrapers.
Your source should export an object containing at a minimum the following properties:
url
- The source of the datascraper
- An async function that scrapes data and returns objects, defined below
Add the following directly to the scraper object if the data you're pulling in is specific to a given location:
country
- ISO 3166-1 alpha-3 country codestate
- The state, province, or regioncounty
- The county or parishcity
- The city name
Additional flags can be set:
type
- one ofjson
,csv
,table
,list
,paragraph
,pdf
,image
. assumeslist
ifundefined
.timeseries
-true
if this source provides timeseries data,false
orundefined
if it only provides the latest dataheadless
- whether this source requires a headless browser to scrapecertValidation
-false
to skip certificate validation when running this scraper (used to workaround certificate errors)priority
- any number (negative or positive).0
is default, higher priority wins if duplicate data is present, ties are broken by rating (see "Source rating" below).
For each scraper, we're now asking that you provide:
sources
- Array of objects with{ name, url, description }
detailing the true source of the data, withname
as a human readible name andurl
as the URL for source's landing page. This is required when using CSV and JSON sources that aren't webpages a human can read.
If this is a curated source (data aggregated by a single person or organization from multiple organizations):
curators
- Array of objects with{ name, url, twitter, github, email }
indicating the name of the curator and their information so that they can get credit on the page.
If you're interested in maintaining the scraper and would like your name to appear on the sources page, add the following:
maintainers
- Array of objects with{ name, url, twitter, github, email }
. If you provide aurl
, that will be used on the site, otherwise it will go down the list and link to whatever information you've provided. Anything beyond a name is totally optional, butgithub
is encouraged.
Everything defined on the source object starting with _
will be available to the scraper via the this
.
Sources are rated based on:
- How hard is it to read? -
csv
andjson
give best scores, withtable
right behind it, withlist
andparagraph
worse.pdf
gets no points, andimage
gets negative points. - Timeseries? - Sources score points if they provide a timeseries.
- Completeness - Sources get points for having
cases
,tested
,deaths
,hospitalized
,discharged
,recovered
,country
,state
,county
, andcity
. - SSL - Sources get points for serving over ssl
- Headless? - Sources get docked points if they require a headless scraper
Scrapers are async
functions associated with the scraper
attribute on the source object. You may implement one or multiple scrapers if the
source changes its formatting (see What to do if a scraper breaks?).
Your scraper should return an object, an array of objects, or null
in case the source does not have any data.
The object may contain the following attributes:
country
- ISO 3166-1 alpha-3 country code [required]state
- The state, province, or region (not required if defined on scraper object)county
- The county or parish (not required if defined on scraper object)city
- The city name (not required if defined on scraper object)cases
- Total number of casesdeaths
- Total number of deathshospitalized
- Total number of hospitalizeddischarged
- Total number of dischargedrecovered
- Total number recoveredtested
- Total number testedfeature
- GeoJSON feature associated with the location (See Features and population data)featureId
- Additional identifiers to aid with feature matching (See Features and population data)population
- The estimated population of the location (See Features and population data)coordinates
- Array of coordinates as[longitude, latitude]
(See Features and population data)
Returning an array of objects is useful for aggregate sources, sources that provide information for more than one geographical area. For example, Canada
provides information for all provinces of the country. If the scraper returns an array, each object in the array will have the attributes specified
in the source object appended, meaning you only need to specify the fields that change per location (county
, cases
, deaths
for example).
null
should be returned in case no data is available. This could be the case if the source has not provided an update for today, or we are fetching historical
information for which we have no cached data.
At the moment, we provide support for scraping HTML
, CSV
, TSV
, JSON
, and PDF
documents. Fetching is accomplished using the functions provided by
lib/fetch.js
. This module should be imported in your source file and provides 5 functions:
await fetch.page(url)
retrieves an HTML document and loads it using Cheerio. Make sure to refer to their documentation.await fetch.csv(url)
andawait fetch.tsv(url)
retrieves a CSV/TSV document and loads it as a JSON array. Each row of the CSV is an item in the array, and the CSV columns are the attributes of each item in the returned array.await fetch.json(url)
retrieves a JSON documentawait fetch.pdf(url)
retrieves a PDF document. Returns an array of text object with their associated x/y positions. We provide a number of helper functions to process documents inlib/pdf.js
.await fetch.headless(url)
In certain instances, the page fetched requires a full browser to be able to fetch the data (eg. we need Javascript enabled). Uses a headless browser and returns the loaded HTML document using Cheerio.
See library functions for API of the available library/utility functions you can use in your scraper.
Key highlights:
lib/geography.js
provides helper functions related location geography. Make sure to look ataddEmptyRegions
andaddCounty
as they are often used.lib/parse.js
provides helper functions to parse numbers, floats, and strings.lib/transform.js
provides helper functions to perform common data manipulation operations. Make sure to look atsumData
as it is often used.lib/datetime/index.js
provides helper functions to perform date related manipulations.
Of course, if something is missing, yarn add
it as a dependency and import
it!
It's a tough challenge to write scrapers that will work when websites are inevitably updated. Here are some tips:
- If your source is an HTML table, validate its structure
- If data for a field is not present (eg. no recovered information), do not put 0 for that field. Make sure to leave the field undefined so the scraper knows there is no information for that particular field.
- Write your scraper so it handles aggregate data with a single scraper entry (i.e. find a table, process the table)
- Try not to hardcode county or city names, instead let the data on the page populate that
- Try to make your scraper less brittle by avoiding using generated class names (i.e. CSS modules)
- When targeting elements, don't assume order will be the same (i.e. if there are multiple
.count
elements, don't assume the second one is deaths, verify it by parsing the label)
If your source is an HTML page, you can use a simple HTML table validation to verify that the structure of your table is what you expect, prior to scraping.
At the top of your scraper, import the module:
import * as htmlTableValidation from '../../../lib/html/table-validation.js';
And use it like this during your scrape (assuming the table is named $table
):
const rules = {
headings: {
0: /country/i,
1: /number of cases/i,
2: /deaths/i
},
data: [
{ column: 0, row: 'ANY', rule: /Adams/ },
{ column: 1, row: 'ALL', rule: /^[0-9]+$/ },
{ column: 2, row: 'ALL', rule: /(^[0-9]+|)$/ }
]
};
const opts = { includeErrCount: 5, logToConsole: true };
htmlTableValidation.throwIfErrors($countyTable, rules, opts);
When this runs, if any rules are not satisfied, it will throw an Error with a few sample failures (5, in this case):
Sample:
Logged to console:
3 validation errors.
[
'heading 0 "County" does not match /country/i',
'heading 1 "Cases" does not match /number of cases/i',
'no row in column 0 matches /Adams/'
]
Error thrown:
Error processing : Error: 3 validation errors.. Sample: heading 0 "County" does not match /country/i;heading 1 ... [etc.]
### Sample scraper
Here's the scraper for Indiana that gets data from a CSV:
```javascript
{
url: 'https://opendata.arcgis.com/datasets/d14de7e28b0448ab82eb36d6f25b1ea1_0.csv',
country: 'iso1:US',
state: 'iso2:US-IN',
scraper: async function() {
let data = await fetch.csv(this.url);
let counties = [];
for (let county of data) {
counties.push({
county: geography.addCounty(parse.string(county.COUNTYNAME)), // Add " County" to the end
cases: parse.number(county.Total_Positive),
deaths: parse.number(county.Total_Deaths),
tested: parse.number(county.Total_Tested)
});
}
// Also return data for IN itself
counties.push(transform.sumData(counties));
return counties;
}
},
You can see that country
and state
are already defined on the object, and all the scraper has to do is pull down the CSV and return an array of objects.
Here's the scraper for Oregon that pulls data from a HTML table:
{
state: 'iso2:US-OR',
country: 'iso1:US',
url: 'https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/DISEASESAZ/Pages/emerging-respiratory-infections.aspx',
scraper: async function() {
let counties = [];
let $ = await fetch.page(this.url);
let $table = $('table[summary="Cases by County in Oregon for COVID-19"]');
let $trs = $table.find('tbody > tr:not(:first-child):not(:last-child)');
$trs.each((index, tr) => {
let $tr = $(tr);
counties.push({
county: geography.addCounty(parse.string($tr.find('td:first-child').text()),
cases: parse.number($tr.find('td:nth-child(2)').text())
});
});
// Also return data for OR itself
counties.push(transform.sumData(counties));
return counties;
}
},
It first finds the table with the [summary]
attribute, then iterates over each of the rows extracting county names and cases (skipping the first and last rows), and finally, returns an array of objects.
If your datasource has timeseries data, you can include its data in retroactive regeneration (prior to this project's inception) by checking for process.env['SCRAPE_DATE']
. This date is your target date; get it in whatever format you need, and only return results from your timeseries dataset from that date. See the JHU scraper for an example.
Scrapers need to be able to operate correctly on old data, so updates to scrapers must be backwards compatible. If you know the date the site broke, you can have two implementations (or more) of a scraper in the same function, based on date:
{
state: 'iso2:US-LA',
country: 'iso1:US',
aggregate: 'county',
_countyMap: { 'La Salle Parish': 'LaSalle Parish' },
scraper: {
// 0 matches all dates before the next definition
'0': async function() {
this.url = 'http://ldh.la.gov/Coronavirus/';
this.type = 'table';
const counties = [];
const $ = await fetch.page(this.url);
const $table = $('p:contains("Louisiana Cases")').nextAll('table');
...
return counties;
},
// 2020-03-14 matches all dates starting with 2020-03-14
'2020-03-14': async function() {
this.url = 'https://opendata.arcgis.com/datasets/cba425c2e5b8421c88827dc0ec8c663b_0.csv';
this.type = 'csv';
const counties = [];
const data = await fetch.csv(this.url);
...
return counties;
},
// 2020-03-17 matches all dates after 2020-03-14 and starting with 2020-03-17
'2020-03-17': async function() {
this.url = 'https://opendata.arcgis.com/datasets/79e1165ecb95496589d39faa25a83ad4_0.csv';
this.type = 'csv';
const counties = [];
const data = await fetch.csv(this.url);
...
return counties;
}
}
}
As you can see, you can change this.url
and this.type
within your function (but be sure to set it every time so it works with timeseries generation).
Another example is when HTML on the page changes, you can simply change the selectors or Cheerio function calls:
let $table;
if (datetime.scrapeDateIsBefore('2020-03-16')) {
$table = $('table[summary="Texas COVID-19 Cases"]');
} else {
$table = $('table[summary="COVID-19 Cases in Texas Counties"]');
}
You can also use datetime.scrapeDateIsAfter()
for more complex customization.
We strive to provide a GeoJSON feature and population number for every location in our dataset. When adding a source for a country, we may already have this information and can populate it automatically. For smaller regional entities, this information may not be available and has to be added manually.
Features can be specified in three ways: through the country
, state
and county
field, by matching the longitude
and latitude
to a particular feature,
through the featureId
field, or through the feature
field.
While the first two methods works most of the time, sometimes you will have to rely on featureId
to help the crawler make the correct guess.
featureId
is an object that specifies one or more of the attributes below:
name
adm1_code
iso_a2
iso_3166_2
code_hasc
postal
We compare the value you specify with the data stored in world-states-provinces.json (Careful, big file!). If we find a match across all the fields you specify, we select the feature. There are way way more attributes to use in that file, so make sure to give it a quick glance.
In case we do not have any geographical information for the location you are trying to scrape, you can provide a GeoJSON feature directly in the feature
attribute
you can return with the scraper.
If we have a feature for the location, we will calculate a longitude
and latitude
. You may also specify a custom longitude and latitude by specifying a value in
the coordinates
attribute.
Population can usually be guessed automatically, but if that is not the case, you can provide a population number by returning a value for the population
field
in the returned object of the scraper.
You should test your source first by running yarn test
. This will perform some basic tests to make sure nothing crashes and the source object is in the correct form.
To add test coverage for a scraper, you only need to provide test assets; no new tests need to be added.
-
Add a tests folder to the scraper folder, e.g.
scrapers/FRA/tests
orscrapers/USA/AK/tests
-
Add a sample response from the target URL. The filename should be the URL, without the
http(s)://
prefix, and with all non-alphanumeric characters replaced with an underscore_
. The file extension should match the format of the contents (html
,csv
,json
, etc). Example:-
URL: https://raw.githubusercontent.com/opencovid19-fr/data/master/dist/chiffres-cles.csv
-
File name: raw_githubusercontent_com_opencovid19_fr_data_master_dist_chiffres_cles.csv
-
-
Add a file named
expected.json
containing the array of values that the scraper is expected to return. (Leave out any GeoJSONfeatures
properties.)
For sources that have a time series, the expected.json
file represents the latest result in the
sample response provided. You can additionally test the return value for a specific date by adding
a file with the name expected.YYYY-MM-DD.json
; for example, expected.2020-03-16.json
.
📁 USA
📁 AK
📄 index.js # scraper
📁 tests
📄 dhss_alaska_gov_dph_Epi_id_Pages_COVID_19_monitoring.html # sample response
📄 expected.json # expected result
...
📁 FRA
📄 index.js # scraper
📁 tests
📄 raw_githubusercontent_com_covid19_fr_data_chiffres_cles.csv # sample response
📄 expected.json # expected result for most recent date in sample
📄 expected.2020-03-16.json # expected result for March 16, 2020
You should run your source with the crawler by running yarn start -l "<path to scraper>"
.
The path to scraper should be the relative path under
src/shared/scrapers
; e.g., the scraper for Montana, USA would be
"US/MO").
After the crawler has finished running, look at how many counties, states, and countries were
scraped. Also look for missing location or population information. Finally, look at the output located in the dist
directory. data.json
contains all the information
the crawler could get from your source. report.json
provides a report on crawling process. ratings.json
provides a rating for your source.
By using the --writeTo
and --onlyUseCache
options, and the script
tools/compare-report-dirs.js
, you can do some quick regression
testing of any changes you make to scrapers or libraries.
For example, before starting work on some US/NV scrapers, you can generate all of the reports using only cached data, and save it to an arbitrary folder:
yarn start --date 2020-04-06 --onlyUseCache --location US/NV --writeTo zz_before
After doing your work, you regenerate to a different folder using the same cached data:
yarn start --date 2020-04-06 --onlyUseCache --location US/NV --writeTo zz_after
You can then run tools/compare-report-dirs.js
, comparing the "left"
and "right" folders. Each report is listed with the differences:
$ node tools/compare-report-dirs.js --left zz_before --right zz_after
data-2020-04-06.json
--------------------
* [3, Storey County, Nevada, United States]/deaths value: 0 != 41
* [5, Washoe County, Nevada, United States]/deaths value: 4 != -1
report.json
-----------
* /scrape/numCities value: 0 != 7
ratings.json
------------
equal
features-2020-04-06.json
------------------------
* /features[5]/properties/name value: Washoe County != Washoe
crawler-report.csv
------------------
equal
data-2020-04-06.csv
-------------------
* Line 3, col 79: "4" != "-"
For the above example, zz_after/data-2020-04-06.json
has the
following content:
[
{ "county": ... },
...
{
"county": "Washoe County",
"state": "Nevada",
"country": "United States",
...
"cases": 281,
"deaths": -1,
"recovered": 30,
},
]
and the diff in the report was:
* [5, Washoe County, Nevada, United States]/deaths value: 4 != -1
In the JSON diff:
-
[#]
indicates an array index -
Sometimes, extra annotations are added to an array index to help you find it in the source JSON.
[5, Washoe County, ...]
says that the fifth element has some associated textWashoe County
in one of its elements. -
/
represents a child under a parent (or the root element if the root is a hash)
The diff line [5, Washoe County, ...]/deaths value: 4 != -1
therefore says that the fifth element under the root, which has
Washoe County
in it, has a child element deaths
, which is
different between the two files.
4 != -1
is this difference.
Another example from a different run:
[2703, Roberts County, South Dakota]/sources[0]/url value: X != https://doh.org
The above line says:
-
the root array element 2703 (Roberts County, SD) ...
-
has a 'sources' array, of which element 0 ...
-
has a 'url' property which is different:
-
The
--left
file containsX
, the--right
file containshttps://doh.org