Skip to content

Latest commit

 

History

History
104 lines (69 loc) · 386 KB

Guardian Collection Workflow.md

File metadata and controls

104 lines (69 loc) · 386 KB

Guardian Collection Workflow

---

Using the Guardian API for searching & ChatGPT to write Python scripts to gather the full text of the newspaper articles

Requirements:

Workflow Steps:

  1. Use the Guardian API to get data

  • Get a Guardian API Key if you don't have one (request developer key).

  • Search The Guardian using the Guardian's API Console:

    In a web browser, go to the URL of the Guardian's "Open Platform API Console" (beta version, not the old version): http://open-platform.theguardian.com/explore/

    The Guardian's API Console

    When the Guardian's API console search form loads, check the box in that form for "Show All Filters." Then fill in the following fields in the form:

    • search term in double quotes(e.g. "Human rights")
      • order-by (choose "oldest")
      • page-size (set at 200 to show the maximum number of hits per search-results page)
      • from-date & to-date (in the format, e.g., "06-01-2021 to 06-01-2024")
      • api-key (your Guardian api key)
        At the bottom of the Guardian API Console web page, you'll instantly see a live view of the JSON returned by your search parameters, including the URLs for the pages found. The maximum number of hits on a single search-results page is 199. Select and copy the JSON results. Copy beginning with the first curly bracket "{" to the last curly bracket" }"JSON results shown in The Guardian's API Console
        For multiple search-results pages:
      • The JSON search results start with metadata that includes, the number of the current search-results page and the total number of search-results pages (e.g., current Page: 2, pages: 2). This shows how many total results there are and whether you need to gather the JSON on extra search results pages.
      • If so, after harvesting the results of one search-results page (by copying the JSON into a "guardian-json" spreadsheet through the process described below), you can use the "page" field in the Guardian API Console's search form to request the next page of search results.
      • You'll then need to copy and accumulate the JSON from each set of search-results at the bottom of the JSON previously collect in the "guardian-json" spreadsheet.
  • Collect the URLs of the links from the Guardian search in Excel:

  1. Paste the JSON search results from the steps above into a blank spreadsheet, e.g., and Excel sheet (using "paste special" to paste without formatting). Name the spreadsheet "guardian-json.xlxs" and save it among your working data.
  2. Select the first column, then go to Data view ("Data" tab in Excel) and click on "Filter."
  3. Cell A1 will now show a little drop-down arrow. Click on the arrow, and choose: Text Filters > Begins with. Type in webUrl. Then click "OK" to filter the worksheet to show only the rows beginning with this expression.
  4. Next do a search-and-replace in Excel to replace "webUrl:" with nothing. This will leave you with a visible column of URLs without extraneous matter.
  5. Finally, copy the column of URLs, excluding any rows with only a curly bracket (and save the URLs in your working data as "urls.txt").

2. Scrape articles using ChatGPT and Google Colab

Scraping (Phase 1): Initial Scrape

  1. Get full text articles

    • Ask ChatGPT:
      • write a python script to web scrape the content from articles in a text file called guardian_urls.txt
        • Open a new Colab notebook

        • Paste the code from ChatGPT into Colab

        • If you get errors, put those into ChatGPT and ask it to revise your script

        • Finally, export your results to Excel by asking ChatGPT: “Thank you! Can you help me export the results to Excel?”

        • You should see an output like this:

          (Images/Guardian/output.jpg)

    Open your spreadsheet:

image4

3. Clean your data

Use OpenRefine to convert the dates to eliminate unwanted whitespace.

  1. Open the OpenRefine interface:
    1. On a Windows machine: Open the “C:/openrefine-win-2.6-rc2/openrefine-2.6-rc2” folder in your system and run “openrefine.exe” by clicking on the named file twice to open the OpenRefine interface at the address 127.0.0.1:3333 (you can always navigate back to the OpenRefine interface by pointing your browser to this address, and can even use it in multiple windows using it).
      1. On a Mac: Open OpenRefine in Applications.
      2. It will open up a tab in your web browser.
    2. Once you are in the OpenRefine interface, click “Create Project” and upload the spreadsheet you recently finished editing. Click "Next" and then Create Project" again.
    3. Make sure there is no whitespace in any of your columns:
      1. Open that column's drop down menu, and select "Edit Cells" > "Common transformations" > "trim leading and trailing whitespace”.
      2. Within the “article body” column, select “Edit Cells” > “Common transformations” > “Collapse consecutive whitespace”
    4. Export your cleaned data as a .xls document.

image5

4. Add a delimiter to separate your articles

Open a Word document that you will name, and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).

  1. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns (" ^13^13@@@@@@@@@@"). This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.
    1. Finally, save or export the.docx Word file as a .txt file (aggregate-plain-txt) as follows:
      • When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").

5. Chop Into Individual Plain Text Files

  1. There are a number of tools to chop one file into multiple files using a specific delimiter. In our case, our delimiter is the ten "@" signs (@@@@@@@@@@) between each of our articles.
    1. You can use ChatGPT to write a Python script to separate the articles into separate plain text files. Ask: Can you write a script for Python to separate the content into separate plain text files at the delimiter @@@@@@@@@@?