Using the Guardian API for searching & ChatGPT to write Python scripts to gather the full text of the newspaper articles
- Guardian API Key (register on developer's site and request a key for the developer API)
- ChatGPT and Google Colab
- Open Refine (optional)
- Text Editor (e.g. Sublime)
-
Get a Guardian API Key if you don't have one (request developer key).
-
Search The Guardian using the Guardian's API Console:
In a web browser, go to the URL of the Guardian's "Open Platform API Console" (beta version, not the old version): http://open-platform.theguardian.com/explore/
When the Guardian's API console search form loads, check the box in that form for "Show All Filters." Then fill in the following fields in the form:
- search term in double quotes(e.g. "Human rights")
- order-by (choose "oldest")
- page-size (set at 200 to show the maximum number of hits per search-results page)
- from-date & to-date (in the format, e.g., "06-01-2021 to 06-01-2024")
- api-key (your Guardian api key)
At the bottom of the Guardian API Console web page, you'll instantly see a live view of the JSON returned by your search parameters, including the URLs for the pages found. The maximum number of hits on a single search-results page is 199. Select and copy the JSON results. Copy beginning with the first curly bracket "{" to the last curly bracket" }"
For multiple search-results pages: - The JSON search results start with metadata that includes, the number of the current search-results page and the total number of search-results pages (e.g., current Page: 2, pages: 2). This shows how many total results there are and whether you need to gather the JSON on extra search results pages.
- If so, after harvesting the results of one search-results page (by copying the JSON into a "guardian-json" spreadsheet through the process described below), you can use the "page" field in the Guardian API Console's search form to request the next page of search results.
- You'll then need to copy and accumulate the JSON from each set of search-results at the bottom of the JSON previously collect in the "guardian-json" spreadsheet.
- search term in double quotes(e.g. "Human rights")
-
Collect the URLs of the links from the Guardian search in Excel:
- Paste the JSON search results from the steps above into a blank spreadsheet, e.g., and Excel sheet (using "paste special" to paste without formatting). Name the spreadsheet "guardian-json.xlxs" and save it among your working data.
- Select the first column, then go to Data view ("Data" tab in Excel) and click on "Filter."
- Cell A1 will now show a little drop-down arrow. Click on the arrow, and choose: Text Filters > Begins with. Type in webUrl. Then click "OK" to filter the worksheet to show only the rows beginning with this expression.
- Next do a search-and-replace in Excel to replace "webUrl:" with nothing. This will leave you with a visible column of URLs without extraneous matter.
- Finally, copy the column of URLs, excluding any rows with only a curly bracket (and save the URLs in your working data as "urls.txt").
Scraping (Phase 1): Initial Scrape
-
- Ask ChatGPT:
- write a python script to web scrape the content from articles in a text file called guardian_urls.txt
-
Open a new Colab notebook
-
Paste the code from ChatGPT into Colab
-
If you get errors, put those into ChatGPT and ask it to revise your script
-
Finally, export your results to Excel by asking ChatGPT: “Thank you! Can you help me export the results to Excel?”
-
You should see an output like this:
-
- write a python script to web scrape the content from articles in a text file called guardian_urls.txt
Open your spreadsheet:
- Ask ChatGPT:
- Open the OpenRefine interface:
- On a Windows machine: Open the “C:/openrefine-win-2.6-rc2/openrefine-2.6-rc2” folder in your system and run “openrefine.exe” by clicking on the named file twice to open the OpenRefine interface at the address 127.0.0.1:3333 (you can always navigate back to the OpenRefine interface by pointing your browser to this address, and can even use it in multiple windows using it).
- On a Mac: Open OpenRefine in Applications.
- It will open up a tab in your web browser.
- Once you are in the OpenRefine interface, click “Create Project” and upload the spreadsheet you recently finished editing. Click "Next" and then Create Project" again.
- Make sure there is no whitespace in any of your columns:
- Open that column's drop down menu, and select "Edit Cells" > "Common transformations" > "trim leading and trailing whitespace”.
- Within the “article body” column, select “Edit Cells” > “Common transformations” > “Collapse consecutive whitespace”
- Export your cleaned data as a .xls document.
- On a Windows machine: Open the “C:/openrefine-win-2.6-rc2/openrefine-2.6-rc2” folder in your system and run “openrefine.exe” by clicking on the named file twice to open the OpenRefine interface at the address 127.0.0.1:3333 (you can always navigate back to the OpenRefine interface by pointing your browser to this address, and can even use it in multiple windows using it).
4. Add a delimiter to separate your articles
Open a Word document that you will name, and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).
- Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns (" ^13^13@@@@@@@@@@"). This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.
- Finally, save or export the.docx Word file as a .txt file (aggregate-plain-txt) as follows:
- When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
- Finally, save or export the.docx Word file as a .txt file (aggregate-plain-txt) as follows:
- There are a number of tools to chop one file into multiple files using a specific delimiter. In our case, our delimiter is the ten "@" signs (@@@@@@@@@@) between each of our articles.
- You can use ChatGPT to write a Python script to separate the articles into separate plain text files. Ask: Can you write a script for Python to separate the content into separate plain text files at the delimiter @@@@@@@@@@?