Memory issues when downloading images using GBIF data #7

LevanBokeria · 2023-08-14T15:44:36Z

This issue was raised on the parent repo here - RolnickLab#2

Documenting it here for the AMI fork repository, and expanding with the latest updates:

Problem description:

The GBIF dwca file for Lepidoptera is huge, about 30GB zipped. The script to download GBIF images using dwca files 02-fetch_gbif_moth_data.py runs out of memory and is automatically killed by slurm.

I tried running a smaller download, just for one family of moths Erebidae, which has a dwca zipped file of 1.5GB. But the process still exploaded in memory use and was killed.

The cause of the problem:

I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:

with Pool() as pool:
    pool.map(fetch_image_data, taxon_keys)

This happens after the dwca file is loaded and occurrence data is read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.

Additionally, the dwca files which are read into memory using the dwca-reader package take really large amount of space. The main culprit is the occurrence.txt file. When read into memory, these dataframes take unexpectedly large amount of space. For example, a zipped dwca file of ~200MB contains a ~1GB occurrence.txt file, but when read into memory consumes ~3.5GB of RAM.

Proposed solution:

I have turned of parallelisation in the download code, so the images for each species are downloaded serially. Takes a lot more time.
I have downloaded separate dwca files for the 16 moth families provided by David Roy in the initial moths checklist. I have written a wrapper python file which takes the family-specific dwca file as an argument and will call the functions from the 02-fetch_gbif_moth_data.py file.

Will make a new branch and open a PR to document these changes.

The text was updated successfully, but these errors were encountered:

LevanBokeria · 2023-10-24T16:04:55Z

To document progress made on this front over the last months:

With the big lepidoptera.zip file, the process used to get killed because the dwca reader package was trying to unzip the contents in a temporary directory, and that temporary directory used to be assigned 100GB maximum space, while the .zip file contents were more than that.

My initial solution involved passing a custom temporary folder to extract dwca file contents to. But this took a lot of time and each time wrote 200GB+ worth of files to disk.
I later discovered that the dwca reader package can also read already extracted dwca files. So the solution involves manually extracting the lepidoptera.zip file once in a folder in our project directory (make sure you have 200GB+ space available), and then point that directory to the dwca reader package. This avoids having to re-exctract the .zip file every time you use it to download images.

Other fascets of solution to the memory issue (documented in the README of the gbif_download_standalone repo) have involved:

When reading the dwca occurrence dataframes, only reading the desired columns which reduces the file size drastically.
Saving the truncated occurrence dataframe as a CSV file. This reduces its file drastically.
Splitting the big lepidoptera occurrence by species, and saving a separate CSV file for each. The downstreadm download code finds the corresponding CSV file for each species and only loads that into memory.

LevanBokeria self-assigned this Aug 14, 2023

LevanBokeria added this to AMBER Aug 14, 2023

LevanBokeria moved this to 🐛 In Progress in AMBER Aug 14, 2023

LevanBokeria closed this as completed Oct 24, 2023

github-project-automation bot moved this from 🐛 In Progress to 🦋 Done in AMBER Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues when downloading images using GBIF data #7

Memory issues when downloading images using GBIF data #7

LevanBokeria commented Aug 14, 2023

LevanBokeria commented Oct 24, 2023

Memory issues when downloading images using GBIF data #7

Memory issues when downloading images using GBIF data #7

Comments

LevanBokeria commented Aug 14, 2023

Problem description:

The cause of the problem:

Proposed solution:

LevanBokeria commented Oct 24, 2023