Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues when downloading images using GBIF data #7

Closed
LevanBokeria opened this issue Aug 14, 2023 · 1 comment
Closed

Memory issues when downloading images using GBIF data #7

LevanBokeria opened this issue Aug 14, 2023 · 1 comment
Assignees

Comments

@LevanBokeria
Copy link

This issue was raised on the parent repo here - RolnickLab#2

Documenting it here for the AMI fork repository, and expanding with the latest updates:

Problem description:

The GBIF dwca file for Lepidoptera is huge, about 30GB zipped. The script to download GBIF images using dwca files 02-fetch_gbif_moth_data.py runs out of memory and is automatically killed by slurm.

I tried running a smaller download, just for one family of moths Erebidae, which has a dwca zipped file of 1.5GB. But the process still exploaded in memory use and was killed.

The cause of the problem:

I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:

with Pool() as pool:
    pool.map(fetch_image_data, taxon_keys)

This happens after the dwca file is loaded and occurrence data is read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.

Additionally, the dwca files which are read into memory using the dwca-reader package take really large amount of space. The main culprit is the occurrence.txt file. When read into memory, these dataframes take unexpectedly large amount of space. For example, a zipped dwca file of ~200MB contains a ~1GB occurrence.txt file, but when read into memory consumes ~3.5GB of RAM.

Proposed solution:

I have turned of parallelisation in the download code, so the images for each species are downloaded serially. Takes a lot more time.
I have downloaded separate dwca files for the 16 moth families provided by David Roy in the initial moths checklist. I have written a wrapper python file which takes the family-specific dwca file as an argument and will call the functions from the 02-fetch_gbif_moth_data.py file.

Will make a new branch and open a PR to document these changes.

@LevanBokeria LevanBokeria self-assigned this Aug 14, 2023
@LevanBokeria LevanBokeria moved this to 🐛 In Progress in AMBER Aug 14, 2023
@LevanBokeria
Copy link
Author

To document progress made on this front over the last months:

With the big lepidoptera.zip file, the process used to get killed because the dwca reader package was trying to unzip the contents in a temporary directory, and that temporary directory used to be assigned 100GB maximum space, while the .zip file contents were more than that.

My initial solution involved passing a custom temporary folder to extract dwca file contents to. But this took a lot of time and each time wrote 200GB+ worth of files to disk.
I later discovered that the dwca reader package can also read already extracted dwca files. So the solution involves manually extracting the lepidoptera.zip file once in a folder in our project directory (make sure you have 200GB+ space available), and then point that directory to the dwca reader package. This avoids having to re-exctract the .zip file every time you use it to download images.

Other fascets of solution to the memory issue (documented in the README of the gbif_download_standalone repo) have involved:

  • When reading the dwca occurrence dataframes, only reading the desired columns which reduces the file size drastically.
  • Saving the truncated occurrence dataframe as a CSV file. This reduces its file drastically.
  • Splitting the big lepidoptera occurrence by species, and saving a separate CSV file for each. The downstreadm download code finds the corresponding CSV file for each species and only loads that into memory.

@github-project-automation github-project-automation bot moved this from 🐛 In Progress to 🦋 Done in AMBER Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🦋 Done
Development

No branches or pull requests

1 participant