You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This bug surfaced when we took a dataset with a huge number of frames (writing ~1100 data.h5 files). When we went to run dials.find_spots on this dataset, DIALS was extremely slow to start, and it failed with an error:
unable to open external link file name = '[...]_data_001013.h5'.
I think the error is related to our use of NFS for data storage, i.e. there were too many file handles open at once. On a dataset with < 1000 data*.h5 files, dials.find_spots ran without an error but was extremely slow.
The traceback refers to this line in FormatEigerNXmxFilewriter.get_raw_data:
data_subsets = [v for k, v in sorted(nxdata.items()) if DATA_FILE_RE.match(k)]
It looks like every single linked _data*.h5 file is loaded here, for every call to get_raw_data(), which is kind of crazy. However, doing this appears to be essential because the number of images per data* file is not stored anywhere, as far as I can tell.
The patch only loads the first data file in the series, and uses it to determine the data shape. Perhaps something like this could be incorporated into FormatEigerNXmxFilewriter?
The text was updated successfully, but these errors were encountered:
@spmeisburger thanks for raising this - wasn't aware that this nonsense could happen for every get_raw_data() call: I am sure this is handled more gracefully for non-filewriter data.
Meanwhile you can also ulimit -n unlimited which will also make things work
and cache the indexing such that the get_raw_data() method can just seek and read. However, I am also fairly sure that there is a good implementation of this elsewhere in dxtbx which does this from C++ which could be preferable 🤔 - certainly I remember this for the old "nearly nexus" and friends format
This bug surfaced when we took a dataset with a huge number of frames (writing ~1100 data.h5 files). When we went to run
dials.find_spots
on this dataset, DIALS was extremely slow to start, and it failed with an error:I think the error is related to our use of NFS for data storage, i.e. there were too many file handles open at once. On a dataset with < 1000 data*.h5 files,
dials.find_spots
ran without an error but was extremely slow.The traceback refers to this line in FormatEigerNXmxFilewriter.get_raw_data:
data_subsets = [v for k, v in sorted(nxdata.items()) if DATA_FILE_RE.match(k)]
https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/FormatNXmxEigerFilewriter.py#L105C5-L105C83
It looks like every single linked _data*.h5 file is loaded here, for every call to
get_raw_data()
, which is kind of crazy. However, doing this appears to be essential because the number of images per data* file is not stored anywhere, as far as I can tell.I made a patch in our local DIALS installation that mostly solves the problem, here: https://github.com/FlexXBeamline/dials-extensions/blob/faster-read-raw/dials_extensions/FormatNXmxEigerFilewriterCHESS.py
The patch only loads the first data file in the series, and uses it to determine the data shape. Perhaps something like this could be incorporated into FormatEigerNXmxFilewriter?
The text was updated successfully, but these errors were encountered: