Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

Open
2 tasks done
bendichter opened this issue Sep 24, 2024 · 2 comments · May be fixed by #1184
Open
2 tasks done

[Bug]: get_default_backend_configuration: auto chunk not good for time series data #1099

bendichter opened this issue Sep 24, 2024 · 2 comments · May be fixed by #1184
Assignees
Labels

Comments

@bendichter
Copy link
Contributor

What happened?

When using get_default_backend_configuration for long time series, the recommended chunks are similar to the dataset size, which creates very long chunks that are sub-optimal for viewing windows of time e.g. the way data is accessed in neurosift. A better chunking for time series would deviate from the similarity convention, and provide chunks that hold more channels.

Steps to Reproduce

import numpy as np
from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from neuroconv.tools.nwb_helpers import get_default_backend_configuration


data = np.ones((10000000,128))

nwbfile = mock_NWBFile()

ts = mock_ElectricalSeries(data=data, nwbfile=nwbfile)
nwbfile

backend_config = get_default_backend_configuration(nwbfile, backend="hdf5")
backend_config.dataset_configurations["acquisition/ElectricalSeries/data"].chunk_shape

output: (312500, 4)

Traceback

No response

Operating System

macOS

Python Executable

Conda

Python Version

3.10

Package Versions

No response

Code of Conduct

@bendichter bendichter added the bug label Sep 24, 2024
@h-mayorquin h-mayorquin self-assigned this Sep 26, 2024
@h-mayorquin
Copy link
Collaborator

Some relevant discussion here:
NeurodataWithoutBorders/pynwb#1945

@h-mayorquin
Copy link
Collaborator

h-mayorquin commented Jan 23, 2025

I looked into this yesterday and, fortunately, this is only a problem if the data of the ElectricalSeries is an in-memory array as in your example.

For our ecephys conversions, the data is passed wrapped as SpikeInterfaceRecordingDataChunkIterator

def _get_default_chunk_shape(self, chunk_mb: float = 10.0) -> tuple[int, int]:
assert chunk_mb > 0, f"chunk_mb ({chunk_mb}) must be greater than zero!"
chunk_channels = min(
self.recording.get_num_channels(),
64, # from https://github.com/flatironinstitute/neurosift/issues/52#issuecomment-1671405249
)
chunk_frames = min(
self.recording.get_num_frames(segment_index=self.segment_index),
int(chunk_mb * 1e6 / (self.recording.get_dtype().itemsize * chunk_channels)),
)
return (chunk_frames, chunk_channels)

And the defaults work just fine (see the example):

import numpy as np
from neuroconv.tools.testing.mock_interfaces import MockRecordingInterface
from neuroconv.tools.nwb_helpers import get_default_backend_configuration


interface = MockRecordingInterface(num_channels=384, durations=[10.0])
nwbfile = interface.create_nwbfile()

backend_config = get_default_backend_configuration(nwbfile, backend="hdf5")
backend_config.dataset_configurations["acquisition/ElectricalSeries/data"].chunk_shape

Output:
(39062, 64)

While passing data as an array is limited by memory and should not affect large datasets it still should be fixed. I will push a fix where if the type of the container is an ElectricalSeries then the chunking will be the same as the one of the iterator above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants