[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

bendichter · 2024-09-24T15:27:18Z

What happened?

When using get_default_backend_configuration for long time series, the recommended chunks are similar to the dataset size, which creates very long chunks that are sub-optimal for viewing windows of time e.g. the way data is accessed in neurosift. A better chunking for time series would deviate from the similarity convention, and provide chunks that hold more channels.

Steps to Reproduce

import numpy as np
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from neuroconv.tools.nwb_helpers import get_default_backend_configuration


data = np.ones((10000000,128))

nwbfile = mock_NWBFile()

ts = mock_ElectricalSeries(data=data, nwbfile=nwbfile)
nwbfile

backend_config = get_default_backend_configuration(nwbfile, backend="hdf5")
backend_config.dataset_configurations["acquisition/ElectricalSeries/data"].chunk_shape

output: (312500, 4)

Traceback

No response

Operating System

macOS

Python Executable

Conda

Python Version

3.10

Package Versions

No response

Code of Conduct

I agree to follow this project's Code of Conduct
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

h-mayorquin · 2024-11-12T20:10:23Z

Some relevant discussion here:
NeurodataWithoutBorders/pynwb#1945

h-mayorquin · 2025-01-23T18:17:14Z

I looked into this yesterday and, fortunately, this is only a problem if the data of the ElectricalSeries is an in-memory array as in your example.

For our ecephys conversions, the data is passed wrapped as SpikeInterfaceRecordingDataChunkIterator

neuroconv/src/neuroconv/tools/spikeinterface/spikeinterfacerecordingdatachunkiterator.py

Lines 81 to 93 in 9290b67

    
           def _get_default_chunk_shape(self, chunk_mb: float = 10.0) -> tuple[int, int]: 
        
               assert chunk_mb > 0, f"chunk_mb ({chunk_mb}) must be greater than zero!" 
        
               chunk_channels = min( 
        
                   self.recording.get_num_channels(), 
        
                   64,  # from https://github.com/flatironinstitute/neurosift/issues/52#issuecomment-1671405249 
        
               ) 
        
               chunk_frames = min( 
        
                   self.recording.get_num_frames(segment_index=self.segment_index), 
        
                   int(chunk_mb * 1e6 / (self.recording.get_dtype().itemsize * chunk_channels)), 
        
               ) 
        
               return (chunk_frames, chunk_channels)

And the defaults work just fine (see the example):

import numpy as np
from neuroconv.tools.testing.mock_interfaces import MockRecordingInterface
from neuroconv.tools.nwb_helpers import get_default_backend_configuration


interface = MockRecordingInterface(num_channels=384, durations=[10.0])
nwbfile = interface.create_nwbfile()

backend_config = get_default_backend_configuration(nwbfile, backend="hdf5")
backend_config.dataset_configurations["acquisition/ElectricalSeries/data"].chunk_shape

Output:
(39062, 64)

While passing data as an array is limited by memory and should not affect large datasets it still should be fixed. I will push a fix where if the type of the container is an ElectricalSeries then the chunking will be the same as the one of the iterator above.

bendichter added the bug label Sep 24, 2024

h-mayorquin self-assigned this Sep 26, 2024

h-mayorquin linked a pull request Jan 24, 2025 that will close this issue

Fix chunking for ElectricalSeries when data is passed in memory. #1184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

bendichter commented Sep 24, 2024

h-mayorquin commented Nov 12, 2024

h-mayorquin commented Jan 23, 2025 •

edited

Loading

[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

Comments

bendichter commented Sep 24, 2024

What happened?

Steps to Reproduce

Traceback

Operating System

Python Executable

Python Version

Package Versions

Code of Conduct

h-mayorquin commented Nov 12, 2024

h-mayorquin commented Jan 23, 2025 • edited Loading

h-mayorquin commented Jan 23, 2025 •

edited

Loading