Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory limitations for processing the DwC-A file #2

Open
LevanBokeria opened this issue Jun 9, 2023 · 1 comment
Open

Memory limitations for processing the DwC-A file #2

LevanBokeria opened this issue Jun 9, 2023 · 1 comment

Comments

@LevanBokeria
Copy link

During the Step 2 of downloading data, the script 02-fetch_gbif_moth_data.py uses a Darwin Core Archive file to download the appropriate images for a list of species.

I have downloaded the DwC-A file associated with order = Lepidopera , which is 30GB zipped, and containes large files like the occurrence.txt which is around 110GB by itself. The link to downloading the file is here.

However, given the large size of the file the computer runs out of memory when trying to read in the occurrence.txt for subsequence processing. The process is automatically killed. The full error output of the terminal is below (note that some of the text output like "reading the occurrence.txt file..." was added in by me inside the script).

I am able to download smaller DwC-A files, associated with just one family name = Erebidae, and I think this won't run into the same memory issues. However, since we'd like to train the model on a very large list of species, we would need to download images for all of them.

One solution would be to download images in chunks, separately for different "family" names, using separate DwC-A files for each family.

However, I was wondering if you might have a solution to downloading all the images in one go. Did you ever run into same memory limiatations when downloading images for your species, which I believe were in thousands as well? And if so, how did you work around it? Any advice would be greatly appreciated!


Terminal output after attempting the download:

`(gbif-species-trainer-AMI-fork) lbokeria@610-MJ6THLXQ7R data_download % python 02-fetch_gbif_moth_data.py \

--write_directory /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/data_download/output_data/gbif_data_uksi_macro_moths_small_try/ \

--dwca_file /Users/lbokeria/Downloads/0001402-230530130749713.zip \

--species_checklist /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/uksi-macro-moths-small-try-keys.csv \

--max_images_per_species 2 \

--resume_session True

INFO:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:NumExpr defaulting to 8 threads.
reading the multimedia.txt file...
/Users/lbokeria/miniforge3/envs/gbif-species-trainer-AMI-fork/lib/python3.9/site-packages/dwca/read.py:203: DtypeWarning: Columns (4,5,6,7,8,9,10,12) have mixed types. Specify dtype option on import or set low_memory=False.
df = read_csv(self.absolute_temporary_path(relative_path), **kwargs)
finished
reading the occurrence.txt file...
zsh: killed python 02-fetch_gbif_moth_data.py --write_directory --dwca_file 2 True`

@LevanBokeria
Copy link
Author

Following up on this:

I tried running a small task, using a DwC-A file just for the family name "Erebidae" which results in a dwca .zip file of size 1.5GB. I was expecting this should not result in any memory issues.

As the species list, I gave it a csv file from David Roy containing a list of 996 species from various families, including Erebidae.

The script should have gone through each species name, found corresponding occurrences in the dwca file, and downloaded images for species in the Erebidae family while not finding any occurrence data for other species.

However, the process was killed due to memory limitations. SLURM gives a .out and a .stats files that give insight into the memory use. As the .stats file says, memory usage was 5799713988K which is roughly 5.8 TB of RAM used:

+--------------------------------------------------------------------------+
| Job on the Baskerville cluster:
| Starting at Thu Jun 29 13:31:23 2023 for rybf4168(102339)
| Identity jobid 459822 jobname 02-fetch_gbif_moth_data.sh
| Running against project vjgo8416-amber and in partition baskerville-a100_40
| Requested cpu=36,mem=108G,node=1,billing=36 - 02:00:00 walltime
| Assigned to nodes bask-pg0308u23a
| Command /bask/projects/v/vjgo8416-amber/projects/gbif-species-trainer-AMI-fork/data_download/02-fetch_gbif_moth_data.sh
| WorkDir /bask/projects/v/vjgo8416-amber/projects/gbif-species-trainer-AMI-fork/data_download
+--------------------------------------------------------------------------+
+--------------------------------------------------------------------------+
| Finished at Thu Jun 29 13:37:16 2023 for rybf4168(102339) on the Baskerville Cluster
| Required (03:31.988 cputime, 5799713988K memory used) - 00:05:53 walltime
| JobState COMPLETING - Reason OutOfMemory
| Exitcode 0:125
+--------------------------------------------------------------------------+

The .out file can be hound here too.

I investigated the 02-fetch_gbif_moth_data.py code, and the likely reason for such high RAM use is parallelisation of processes that happen on lines 187-188:

with Pool() as pool:
    pool.map(fetch_image_data, taxon_keys)

This happens after the dwca file is loaded and occurrence data read, which creates a very large file in global memory. Seems like parallelization of processes actually duplicates the global variables so each sub-process has an independent instance of python. This means, for each taxon_keys the occurrence dataframe is duplicated. This is likely resulting in astronomical memory usage.

Have you encountered a similar issue previously @alcunha or @adityajain07 ? Perhaps there is a simple solution, apologies if I am missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant