Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use fitsio to loop quickly over 20k small fits-file? #335

Open
Nestak2 opened this issue Nov 9, 2021 · 1 comment
Open

Can I use fitsio to loop quickly over 20k small fits-file? #335

Nestak2 opened this issue Nov 9, 2021 · 1 comment

Comments

@Nestak2
Copy link

Nestak2 commented Nov 9, 2021

Hi, I need to extract information from a few columns in ~20k different fits files. Each file is relatively small, ~0.2MB. I have been doing this so far with a loop and astropy like this

from astropy.io import fits

data = []
for file_name in fits_files_list:
    with fits.open(file_name, memmap=False) as hdulist:
        lam = np.around(10**hdulist[1].data['loglam'], 4)
        flux = np.around(hdulist[1].data['flux'], 4)
        z = np.around(hdulist[2].data['z'], 4)
    data.append([lam, flux, z])

This takes for the 20k fits files ~2.5 hours and from time to time I need to loop through the files for other reasons. So I wanted to minimize the time for that and I tried out fitsio in this way:

import fitsio
from fitsio import FITS,FITSHDR

for file_name in fits_files_list[:300]:
    hdulist=fitsio.FITS(file_name)
    lam = np.around(10**hdulist[1]['loglam'][:], 4)
    flux = np.around(hdulist[1]['flux'][:], 4)
    z = np.around(hdulist[2]['z'][:], 4)
    data.append([lam, flux, z])

But unfortunately, it doesn't give me much of a time improvement, if at all. So my question is: Can I improve the time for looping with fistio? Do you know of other packages that would help me? Or do you know if I can change my algorithm to make it run faster, e.g. somehow vectorize the loop? Or some software to stack quickly 20k fits files into one fits-file (TOPCAT has no function that does this for more than 2 files)? Thanks for any ideas and comments!

@esheldon
Copy link
Owner

It might be good to profile this, to see if it is limited by reading from disk.

If it is read limited, then the best way to speed it up would be to run multiple jobs on different machines and combine the results afterward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants