-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory limitations for processing the DwC-A file #2
Comments
Following up on this: I tried running a small task, using a DwC-A file just for the family name "Erebidae" which results in a dwca .zip file of size 1.5GB. I was expecting this should not result in any memory issues. As the species list, I gave it a csv file from David Roy containing a list of 996 species from various families, including Erebidae. The script should have gone through each species name, found corresponding occurrences in the dwca file, and downloaded images for species in the Erebidae family while not finding any occurrence data for other species. However, the process was killed due to memory limitations. SLURM gives a .out and a .stats files that give insight into the memory use. As the .stats file says, memory usage was 5799713988K which is roughly 5.8 TB of RAM used:
The .out file can be hound here too. I investigated the
This happens after the dwca file is loaded and occurrence data read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each Have you encountered a similar issue previously @alcunha or @adityajain07 ? Perhaps there is a simple solution, apologies if I am missing something. |
During the Step 2 of downloading data, the script
02-fetch_gbif_moth_data.py
uses a Darwin Core Archive file to download the appropriate images for a list of species.I have downloaded the DwC-A file associated with order = Lepidopera , which is 30GB zipped, and containes large files like the occurrence.txt which is around 110GB by itself. The link to downloading the file is here.
However, given the large size of the file the computer runs out of memory when trying to read in the occurrence.txt for subsequence processing. The process is automatically killed. The full error output of the terminal is below (note that some of the text output like "reading the occurrence.txt file..." was added in by me inside the script).
I am able to download smaller DwC-A files, associated with just one family name = Erebidae, and I think this won't run into the same memory issues. However, since we'd like to train the model on a very large list of species, we would need to download images for all of them.
One solution would be to download images in chunks, separately for different "family" names, using separate DwC-A files for each family.
However, I was wondering if you might have a solution to downloading all the images in one go. Did you ever run into same memory limiatations when downloading images for your species, which I believe were in thousands as well? And if so, how did you work around it? Any advice would be greatly appreciated!
Terminal output after attempting the download:
`(gbif-species-trainer-AMI-fork) lbokeria@610-MJ6THLXQ7R data_download % python 02-fetch_gbif_moth_data.py \
--write_directory /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/data_download/output_data/gbif_data_uksi_macro_moths_small_try/ \
--dwca_file /Users/lbokeria/Downloads/0001402-230530130749713.zip \
--species_checklist /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/uksi-macro-moths-small-try-keys.csv \
--max_images_per_species 2 \
--resume_session True
INFO:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:NumExpr defaulting to 8 threads.
reading the multimedia.txt file...
/Users/lbokeria/miniforge3/envs/gbif-species-trainer-AMI-fork/lib/python3.9/site-packages/dwca/read.py:203: DtypeWarning: Columns (4,5,6,7,8,9,10,12) have mixed types. Specify dtype option on import or set low_memory=False.
df = read_csv(self.absolute_temporary_path(relative_path), **kwargs)
finished
reading the occurrence.txt file...
zsh: killed python 02-fetch_gbif_moth_data.py --write_directory --dwca_file 2 True`
The text was updated successfully, but these errors were encountered: