Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken NaN encoding when writing v2 storage format from v3 library #2741

Open
d70-t opened this issue Jan 21, 2025 · 1 comment
Open

Broken NaN encoding when writing v2 storage format from v3 library #2741

d70-t opened this issue Jan 21, 2025 · 1 comment
Labels
bug Potential issues with the zarr-python library

Comments

@d70-t
Copy link
Contributor

d70-t commented Jan 21, 2025

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.12.8

Operating System

Mac

Installation

micromamba create -n zarr3 python 'zarr>=3'

Description

When writing an float array with fill_value=np.nan to a version 2 store, according to the spec, the fill value has to be encoded as a string (i.e.: "NaN"). This has to be the case, as JSON doesn't support a NaN literal. However, when doing so using zarr>=3, the fill value is not encoded as string. The resulting .zarray is thus not a valid JSON and can't be read using other JSON parsers.

This behaviour could be a regession of #412.
The bug was originally found by @lkluft.

Steps to reproduce

Using this Python script as check_nan_encoding.py:

import zarr
import numpy as np

import numcodecs
import sys

print("zarr version", zarr.__version__)
print("numcodecs version", numcodecs.__version__)
print("python version", sys.version)

def make_array(**kwargs):
    if zarr.__version__ >= "3":
        return zarr.create_array(zarr_format=2, **kwargs)
    else:
        return zarr.create(**kwargs)

def to_str(buffer):
    if zarr.__version__ >= "3":
        return buffer.to_bytes().decode("utf-8")
    else:
        return buffer.decode("utf-8")

def get_fill_value_line(store):
    return [line.strip(" ,")
            for line in to_str(store[".zarray"]).split("\n")
            if "fill_value" in line][0]

store = {}
z = make_array(
    store=store,
    shape=(1,),
    chunks=(1,),
    dtype="f4",
    fill_value=np.nan,
)
print(get_fill_value_line(store))

When running in a zarr>=3 environment:

micromamba create -n zarr3 python 'zarr>=3'
micromamba run -n zarr3 python3 check_nan_encoding.py

the following is printed:

zarr version 3.0.1
numcodecs version 0.15.0
python version 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:19:53) [Clang 18.1.8 ]
"fill_value": NaN

Which is not a valid JSON encoding.


When running the same code in a zarr>=2,<3 environment:

micromamba env create -n zarr2 -c conda-forge python 'zarr>=2,<3'
micromamba run -n zarr2 python3 check_nan_encoding.py
zarr version 2.18.4
numcodecs version 0.15.0
python version 3.13.1 | packaged by conda-forge | (main, Jan 13 2025, 09:45:31) [Clang 18.1.8 ]
"fill_value": "NaN"

Which is fine according to the spec.

Additional output

No response

@d70-t d70-t added the bug Potential issues with the zarr-python library label Jan 21, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 21, 2025

thanks for this report, that's a pretty glaring error. we need to fix it and add tests. for zarr v3, the relevant code block is here, and for v2 it's here. Some opportunities for refactoring here I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants