You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was raised on the parent repo here - RolnickLab#2
Documenting it here for the AMI fork repository, and expanding with the latest updates:
Problem description:
The GBIF dwca file for Lepidoptera is huge, about 30GB zipped. The script to download GBIF images using dwca files 02-fetch_gbif_moth_data.py runs out of memory and is automatically killed by slurm.
I tried running a smaller download, just for one family of moths Erebidae, which has a dwca zipped file of 1.5GB. But the process still exploaded in memory use and was killed.
The cause of the problem:
I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:
with Pool() as pool:
pool.map(fetch_image_data, taxon_keys)
This happens after the dwca file is loaded and occurrence data is read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.
Additionally, the dwca files which are read into memory using the dwca-reader package take really large amount of space. The main culprit is the occurrence.txt file. When read into memory, these dataframes take unexpectedly large amount of space. For example, a zipped dwca file of ~200MB contains a ~1GB occurrence.txt file, but when read into memory consumes ~3.5GB of RAM.
Proposed solution:
I have turned of parallelisation in the download code, so the images for each species are downloaded serially. Takes a lot more time.
I have downloaded separate dwca files for the 16 moth families provided by David Roy in the initial moths checklist. I have written a wrapper python file which takes the family-specific dwca file as an argument and will call the functions from the 02-fetch_gbif_moth_data.py file.
Will make a new branch and open a PR to document these changes.
The text was updated successfully, but these errors were encountered:
To document progress made on this front over the last months:
With the big lepidoptera.zip file, the process used to get killed because the dwca reader package was trying to unzip the contents in a temporary directory, and that temporary directory used to be assigned 100GB maximum space, while the .zip file contents were more than that.
My initial solution involved passing a custom temporary folder to extract dwca file contents to. But this took a lot of time and each time wrote 200GB+ worth of files to disk.
I later discovered that the dwca reader package can also read already extracted dwca files. So the solution involves manually extracting the lepidoptera.zip file once in a folder in our project directory (make sure you have 200GB+ space available), and then point that directory to the dwca reader package. This avoids having to re-exctract the .zip file every time you use it to download images.
Other fascets of solution to the memory issue (documented in the README of the gbif_download_standalone repo) have involved:
When reading the dwca occurrence dataframes, only reading the desired columns which reduces the file size drastically.
Saving the truncated occurrence dataframe as a CSV file. This reduces its file drastically.
Splitting the big lepidoptera occurrence by species, and saving a separate CSV file for each. The downstreadm download code finds the corresponding CSV file for each species and only loads that into memory.
This issue was raised on the parent repo here - RolnickLab#2
Documenting it here for the AMI fork repository, and expanding with the latest updates:
Problem description:
The GBIF dwca file for Lepidoptera is huge, about 30GB zipped. The script to download GBIF images using dwca files
02-fetch_gbif_moth_data.py
runs out of memory and is automatically killed by slurm.I tried running a smaller download, just for one family of moths Erebidae, which has a dwca zipped file of 1.5GB. But the process still exploaded in memory use and was killed.
The cause of the problem:
I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:
This happens after the dwca file is loaded and occurrence data is read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.
Additionally, the dwca files which are read into memory using the dwca-reader package take really large amount of space. The main culprit is the occurrence.txt file. When read into memory, these dataframes take unexpectedly large amount of space. For example, a zipped dwca file of ~200MB contains a ~1GB occurrence.txt file, but when read into memory consumes ~3.5GB of RAM.
Proposed solution:
I have turned of parallelisation in the download code, so the images for each species are downloaded serially. Takes a lot more time.
I have downloaded separate dwca files for the 16 moth families provided by David Roy in the initial moths checklist. I have written a wrapper python file which takes the family-specific dwca file as an argument and will call the functions from the
02-fetch_gbif_moth_data.py
file.Will make a new branch and open a PR to document these changes.
The text was updated successfully, but these errors were encountered: