Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download script #1

Open
wants to merge 77 commits into
base: main
Choose a base branch
from
Open

download script #1

wants to merge 77 commits into from

Conversation

crdanielbusch
Copy link
Contributor

@crdanielbusch crdanielbusch commented Nov 5, 2024

Description

Write a script that downloads all climate data from the FAOSTAT website.
We need to:

  • create a downloaded_data directory
  • download the zip files
  • unpack the zip files

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

Notes

  • In the script it is best to call datalad download-url via datalad api, then the location of the data is saved and it is automatically unlocked. otherwise you would always have to unlock the data before the script runs
  • FAOSTAT has no version numbers in the data and quasi rolling releases, which makes it a bit more difficult to see when new data is actually available. But maybe there is something in the data files
  • the individual domains also have bulk downloads, that would also be an option. then we could even download the metadata as well, then we have it right along with the data
  • zip files can be pushed to gin as well
  • We need to use selenium to load javascript and load the html text. Only beautiful soup and requests doesn't work
  • Data is now saved in a folder named after the lates update, e.g. 2023-11-09
  • Updated .gitattributes to consider .zip files as well

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "poetry run python3 scripts/download_all_domains.py",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Copy link

@JGuetschow JGuetschow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and ready to go ahead with data reading. I've added a few minor comments.

from faostat_data_primap.helper.definitions import downloaded_data_path


def run():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the delete function good for? Why would we want to delete version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need this script. I've used it to test the work flow. Once I have the tests set up I can delete it

files_to_delete = os.listdir(path_to_files)

for file in files_to_delete:
path_to_file = path_to_files / file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deletion of committed files only works after unlocking them using datalad unlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know


Returns
-------
True if the file was downloaded, False if a cached file was found

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case there is no file, but download fails, requests throws and error, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure it does. I can try and make a test for that



def download_all_domains(sources: list[tuple[str]]) -> list[str]:
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add downloading of the metadata available with the data. Having the methodology descriptions next to the data could be quite helpful

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But It's less important and can be added later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be good to keep old version of the methodology description docs though, to see if things have changed. I assume it will not always be updated, so we might want to use checksums and only store new versions of they have changed and symlink to the old version instead. That could quickly rule out methodology changes (if they are not updated yearly)

Copy link

@JGuetschow JGuetschow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and ready to go ahead with data reading. I've added a few minor comments.

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "poetry run python3 scripts/download_all_domains.py",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "poetry run python3 scripts/download_all_domains.py",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@crdanielbusch crdanielbusch self-assigned this Nov 14, 2024
@crdanielbusch
Copy link
Contributor Author

@JGuetschow this is ready to be reviewed again.

Since you've looked at it last time:

  • I have added the possibility to download the methodology document - that's the download_methodology function.
  • Added more tests - unit tests for download functions and one integration test that downloads all available data.
  • I had to tweak some CI / pre-commit checks because I wasn't able to make them work: dis-abled the poetry-check command in the pre-commit (got a weird network error, we can also have a look at it together), and had to de-activate poetry cache in the github action (this is a known issue, see https://gitlab.com/climate-resource/copier-core-python-repository/-/issues/38)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants