download script #1

crdanielbusch · 2024-11-05T08:48:32Z

Description

Write a script that downloads all climate data from the FAOSTAT website.
We need to:

create a downloaded_data directory
download the zip files
unpack the zip files

Checklist

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Changelog item added to changelog/

Notes

In the script it is best to call datalad download-url via datalad api, then the location of the data is saved and it is automatically unlocked. otherwise you would always have to unlock the data before the script runs
FAOSTAT has no version numbers in the data and quasi rolling releases, which makes it a bit more difficult to see when new data is actually available. But maybe there is something in the data files
the individual domains also have bulk downloads, that would also be an option. then we could even download the metadata as well, then we have it right along with the data
zip files can be pushed to gin as well
We need to use selenium to load javascript and load the html text. Only beautiful soup and requests doesn't work
Data is now saved in a folder named after the lates update, e.g. 2023-11-09
Updated .gitattributes to consider .zip files as well

URLs: https://bulks-faostat.fao.org/production/Emissions_crops_E_All_Data.zip

=== Do not change lines below === { "chain": [], "cmd": "poetry run python3 scripts/download_all_domains.py", "dsid": "934d913e-8268-4342-aa0c-3702ee516d10", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^

JGuetschow

Looks good and ready to go ahead with data reading. I've added a few minor comments.

JGuetschow · 2024-11-12T10:31:06Z

scripts/remove_downloads.py

+from faostat_data_primap.helper.definitions import downloaded_data_path
+
+
+def run():


What is the delete function good for? Why would we want to delete version?

We don't really need this script. I've used it to test the work flow. Once I have the tests set up I can delete it

JGuetschow · 2024-11-12T10:31:49Z

scripts/remove_downloads.py

+            files_to_delete = os.listdir(path_to_files)
+
+            for file in files_to_delete:
+                path_to_file = path_to_files / file


Deletion of committed files only works after unlocking them using datalad unlock

Good to know

JGuetschow · 2024-11-12T10:40:17Z

src/faostat_data_primap/download.py

+
+    Returns
+    -------
+        True if the file was downloaded, False if a cached file was found


In case there is no file, but download fails, requests throws and error, correct?

Pretty sure it does. I can try and make a test for that

JGuetschow · 2024-11-12T10:42:00Z

scripts/download_all_domains.py

+
+
+def download_all_domains(sources: list[tuple[str]]) -> list[str]:
+    """


You could add downloading of the metadata available with the data. Having the methodology descriptions next to the data could be quite helpful

But It's less important and can be added later.

I would be good to keep old version of the methodology description docs though, to see if things have changed. I assume it will not always be updated, so we might want to use checksums and only store new versions of they have changed and symlink to the old version instead. That could quickly rule out methodology changes (if they are not updated yearly)

JGuetschow

Looks good and ready to go ahead with data reading. I've added a few minor comments.

=== Do not change lines below === { "chain": [], "cmd": "poetry run python3 scripts/download_all_domains.py", "dsid": "934d913e-8268-4342-aa0c-3702ee516d10", "exit": 0, "extra_inputs": [], "inputs": [], "outputs": [], "pwd": "." } ^^^ Do not change lines above ^^^

crdanielbusch · 2024-11-14T14:51:40Z

@JGuetschow this is ready to be reviewed again.

Since you've looked at it last time:

I have added the possibility to download the methodology document - that's the download_methodology function.
Added more tests - unit tests for download functions and one integration test that downloads all available data.
I had to tweak some CI / pre-commit checks because I wasn't able to make them work: dis-abled the poetry-check command in the pre-commit (got a weird network error, we can also have a look at it together), and had to de-activate poetry cache in the github action (this is a known issue, see https://gitlab.com/climate-resource/copier-core-python-repository/-/issues/38)

crdanielbusch added 30 commits November 5, 2024 09:48

download script

9d6fe03

test

b38efa6

[DATALAD] Download URLs

b8b1c3f

URLs: https://bulks-faostat.fao.org/production/Emissions_crops_E_All_Data.zip

download script from non-annex1 repo

f9ee4cf

Added emissions_crops

e4da448

Added emissions_crops

0539921

basic script to download files [skip ci]

b2b6d72

Added farm_gate_emissions_crops

64a717b

Added farm_gate_livestock

a84c484

Added farm_gate_agriculture_energy

9a02cc8

Added land_use_forests

55920e2

Added land_use_fires

8e848a2

Added land_use_drained_organic_soils

d234f99

Added pre_post_agricultural_production

efba26b

delete data script, unzip [skip ci]

e3edb3f

remove simlinks from github [skip ci]

2fce622

Added farm_gate_emissions_crops

f731957

Added farm_gate_livestock

d691cd8

Added farm_gate_agriculture_energy

7e6ed5e

Added land_use_forests

897c953

Added land_use_fires

79f0296

Added land_use_drained_organic_soils

569b41d

Added pre_post_agricultural_production

0e4f635

scrape last updated tag from domain website

2b89304

Added farm_gate_emissions_crops

e339125

Added farm_gate_livestock

cd66735

Added farm_gate_agriculture_energy

74faa46

Added land_use_forests

b4bd258

Added land_use_fires

fe3d848

Added land_use_drained_organic_soils

701c69d

crdanielbusch added 2 commits November 12, 2024 11:01

clean up [skip ci]

4ccafe1

JGuetschow approved these changes Nov 12, 2024

View reviewed changes

crdanielbusch added 12 commits November 12, 2024 17:57

download methodology document [skip ci]

2033652

docstring [skip-ci]

d56268d

tests [skip-ci]

c051b34

[DATALAD] Recorded changes

122e331

[DATALAD] Recorded changes

6e8fae5

[DATALAD] Recorded changes

d0986c7

[DATALAD] Recorded changes

5a735ba

ci

70ccf44

install mypy stubs in ci

bfe05fa

bs4

2a99000

crdanielbusch self-assigned this Nov 14, 2024

crdanielbusch added 12 commits November 14, 2024 11:48

mypy

32c8e6f

mypy install missing stubs

bdb9b5c

install bs4 stubs directly

1926d86

mypy: request stubs, Path

9fd3087

licences, mypy

6fe3d39

rm python datalad

1f3e97c

poetry lock

aed1cc4

deactivate poetry lock for now

8560e34

pre commit

72589ed

changelog

4ba7f59

integration test

8305505

rename

1145a23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download script #1

download script #1

crdanielbusch commented Nov 5, 2024 •

edited

Loading

JGuetschow left a comment

JGuetschow Nov 12, 2024

crdanielbusch Nov 12, 2024

JGuetschow Nov 12, 2024

crdanielbusch Nov 12, 2024

JGuetschow Nov 12, 2024

crdanielbusch Nov 12, 2024

JGuetschow Nov 12, 2024

JGuetschow Nov 12, 2024

JGuetschow Nov 12, 2024

JGuetschow left a comment

crdanielbusch commented Nov 14, 2024

		from faostat_data_primap.helper.definitions import downloaded_data_path


		def run():



		def download_all_domains(sources: list[tuple[str]]) -> list[str]:
		"""

download script #1

Are you sure you want to change the base?

download script #1

Conversation

crdanielbusch commented Nov 5, 2024 • edited Loading

Description

Checklist

Notes

JGuetschow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGuetschow left a comment

Choose a reason for hiding this comment

crdanielbusch commented Nov 14, 2024

crdanielbusch commented Nov 5, 2024 •

edited

Loading