Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion._to_numpy: Add tests for pandas.Series with pyarrow numeric dtypes #3585

Merged
merged 18 commits into from
Nov 12, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Nov 5, 2024

This PR adds tests for pandas.Series with pyarrow numeric dtypes. Available pyarrow numeric dtypes are at https://arrow.apache.org/docs/python/api/datatypes.html.

Similar to the issue reported in #3584, there are also dtype conversion behavior changes in pandas 2.2.

With pandas>=2.2, everything works as expected:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: x = pd.Series([1, 2, 3], dtype="int32[pyarrow]")

In [4]: np.ascontiguousarray(x)
Out[4]: array([1, 2, 3], dtype=int32)

In [5]: x = pd.Series([1, pd.NA, 3], dtype="int32[pyarrow]")

In [6]: np.ascontiguousarray(x)
Out[6]: array([ 1., nan,  3.])

With pandas 2.1, missing values needs to convert to numpy float dtype explicitly.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: x = pd.Series([1, 2, 3], dtype="int32[pyarrow]")

In [4]: np.ascontiguousarray(x)
Out[4]: array([1, 2, 3], dtype=int32)

In [5]: x = pd.Series([1, pd.NA, 3], dtype="int32[pyarrow]")

In [6]: np.ascontiguousarray(x)
Out[6]: array([1, <NA>, 3], dtype=object)

@seisman seisman force-pushed the to_numpy/pyarrow_dtypes branch 4 times, most recently from 2a2ab7a to 69b44ad Compare November 6, 2024 14:54
@seisman seisman force-pushed the to_numpy/pandas_numeric branch 2 times, most recently from 6b77f42 to 7222db2 Compare November 7, 2024 00:17
@seisman seisman changed the base branch from to_numpy/pandas_numeric to main November 7, 2024 13:43
@seisman seisman marked this pull request as ready for review November 7, 2024 13:45
@seisman seisman changed the title WIP: clib.conversion._to_numpy: Add tests for pandas.Series with pyarrow numeric dtypes and pyarrow arrays clib.conversion._to_numpy: Add tests for pandas.Series and pyarrow.arrays with pyarrow numeric dtypes Nov 7, 2024
@seisman seisman changed the title clib.conversion._to_numpy: Add tests for pandas.Series and pyarrow.arrays with pyarrow numeric dtypes clib.conversion._to_numpy: Add tests for pandas.Series and pyarrow.array with pyarrow numeric dtypes Nov 7, 2024
# PyArrow dtypes can be specified using the following formats:
#
# - Prefixed with the name of the dtype and "[pyarrow]" (e.g., "int8[pyarrow]")
# - Specified using ``ArrowDType`` (e.g., "pd.ArrowDtype(pa.int8())")
Copy link
Member Author

@seisman seisman Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use pd.ArrowDtype(pa.int8()) here because pa is not defined when pyarrow is not installed. So we have to use the string aliases.

pytest.param("float64[pyarrow]", np.float64, id="float64[pyarrow]"),
],
)
def test_to_numpy_pandas_series_pyarrow_dtypes_numeric(dtype, expected_dtype):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is exactly the same as the test test_to_numpy_pandas_series_numpy_dtypes_numeric above, but I don't think we should merge them into a single test.

To merge them into a single test, we have to change the pytest.param to

pytest.param("float64[pyarrow]", np.float64, id="float64[pyarrow]", marks=skip_if_no(package="pyarrow")),

which is too long to fit in one line and will make the pytest.params too long to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to fit it into a single test like so:

from pygmt.helpers.testing import skip_if_no

pa_marks = {"marks": skip_if_no(package="pyarrow")}

@pytest.mark.parametrize(
    ("dtype", "expected_dtype"),
    [
        ...,
        pytest.param("int8[pyarrow]", np.int8, id="int8[pyarrow]", **pa_marks),
        pytest.param("int16[pyarrow]", np.int16, id="int16[pyarrow]", **pa_marks),
        pytest.param("int32[pyarrow]", np.int32, id="int32[pyarrow]", **pa_marks),
        pytest.param("int64[pyarrow]", np.int64, id="int64[pyarrow]", **pa_marks),
        pytest.param("uint8[pyarrow]", np.uint8, id="uint8[pyarrow]", **pa_marks),
        pytest.param("uint16[pyarrow]", np.uint16, id="uint16[pyarrow]", **pa_marks),
        pytest.param("uint32[pyarrow]", np.uint32, id="uint32[pyarrow]", **pa_marks),
        pytest.param("uint64[pyarrow]", np.uint64, id="uint64[pyarrow]", **pa_marks),
        pytest.param("float16[pyarrow]", np.float16, id="float16[pyarrow]", **pa_marks),
        pytest.param("float32[pyarrow]", np.float32, id="float32[pyarrow]", **pa_marks),
        pytest.param("float64[pyarrow]", np.float64, id="float64[pyarrow]", **pa_marks),
    ],
)

The longest line is just under 88 characters.

Copy link
Member Author

@seisman seisman Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes. We probably should merge this PR into #3584, so that we can test all pandas dtypes (including pyarrow-backed) in a single PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I see that this branch is targetting to_numpy/pandas_numeric/#3584, so you can merge this in then.

pygmt/tests/test_clib_to_numpy.py Show resolved Hide resolved
pygmt/tests/test_clib_to_numpy.py Outdated Show resolved Hide resolved
# In PyArrow, array types can be specified in two ways:
#
# - Using string aliases (e.g., "int8")
# - Using pyarrow.DataType (e.g., ``pa.int8()``)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we can't use pa.int8() here because pa is undefined if pyarrow is not installed.

@seisman seisman changed the title clib.conversion._to_numpy: Add tests for pandas.Series and pyarrow.array with pyarrow numeric dtypes clib.conversion._to_numpy: Add tests for pandas.Series with pyarrow numeric dtypes Nov 8, 2024
@seisman seisman marked this pull request as ready for review November 9, 2024 08:56
@seisman seisman added maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog needs review This PR has higher priority and needs review. labels Nov 9, 2024
@seisman seisman added this to the 0.14.0 milestone Nov 9, 2024
@seisman seisman changed the base branch from main to to_numpy/pandas_numeric November 10, 2024 04:49
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 10, 2024
@seisman seisman added the needs review This PR has higher priority and needs review. label Nov 11, 2024
@seisman seisman merged commit a65f6ae into to_numpy/pandas_numeric Nov 12, 2024
14 of 16 checks passed
@seisman seisman deleted the to_numpy/pyarrow_dtypes branch November 12, 2024 00:53
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants