Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion._to_numpy: Add tests for pyarrow.array with pyarrow numeric types #3599

Merged
merged 4 commits into from
Nov 9, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Nov 8, 2024

Description of proposed changes

This PR adds tests for _to_numpy to check if it works for PyArrow arrays with pyarrow numeric dtypes.

PyArrow provides 11 numeric dtypes (https://arrow.apache.org/docs/python/api/datatypes.html): int8, int16, int32, int64, uint8, uint16, uint32, uint64, float16, float32, and float64. All can be converted to numpy dtype without issues.

Here is a minimal example to play with pyarrow types.

>>> import numpy as np
>>> import pyarrow as pa
>>> x = pa.array([1, 2, 3], type=pa.int16())
>>> x
<pyarrow.lib.Int16Array object at 0x7fc28a73fb80>
[
  1,
  2,
  3
]
>>> np.ascontiguousarray(x)
array([1, 2, 3], dtype=int16)

>>> x = pa.array([1, 2, None], type=pa.int16())
>>> x
<pyarrow.lib.Int16Array object at 0x7fc28a7ba620>
[
  1,
  2,
  null
]

>>> np.ascontiguousarray(x)
array([ 1.,  2., nan])

Please note that pa.float16 needs special handling:

>>> x = pa.array([1.0, 2.0, 3.0], type=pa.float16())
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 x = pa.array([1.0, 2.0, 3.0], type=pa.float16())

File ~/opt/miniforge/envs/pygmt/lib/python3.12/site-packages/pyarrow/array.pxi:370, in pyarrow.lib.array()

File ~/opt/miniforge/envs/pygmt/lib/python3.12/site-packages/pyarrow/array.pxi:42, in pyarrow.lib._sequence_to_array()

File ~/opt/miniforge/envs/pygmt/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/opt/miniforge/envs/pygmt/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: Expected np.float16 instance

>>> # The solution is from https://arrow.apache.org/docs/python/generated/pyarrow.float16.html#pyarrow.float16
>>> x = pa.array(np.array([1.0, 2.0, 3.0], dtype=np.float16), type=pa.float16())
>>> x
<pyarrow.lib.HalfFloatArray object at 0x7fc279308b20>
[
  15360,
  16384,
  16896
]
>>> # The numbers above look weird but when converting to numpy, they're correct
>>> np.ascontiguousarray(x)
array([1., 2., 3.], dtype=float16)

@seisman seisman added maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog needs review This PR has higher priority and needs review. labels Nov 8, 2024
@seisman seisman added this to the 0.14.0 milestone Nov 8, 2024
@seisman
Copy link
Member Author

seisman commented Nov 8, 2024

Please review this PR first before PRs #3584, #3585, #3596, since this PR is not affected by the behavior changes between pandas <2.2 and pandas >= 2.2. Then we can better know if workarounds in #3596 have side-effects.

seisman added a commit that referenced this pull request Nov 8, 2024
@seisman
Copy link
Member Author

seisman commented Nov 8, 2024

I'm not sure why the Python 3.10 job keeps failing on macOS.

pygmt/tests/test_clib_to_numpy.py Show resolved Hide resolved
pygmt/tests/test_clib_to_numpy.py Show resolved Hide resolved
Comment on lines 167 to 169
# PyArrow provides the following dtypes:
#
# - Numeric dtypes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, but PyArrow calls them types instead of dtypes, e.g. at https://arrow.apache.org/docs/python/data.html#type-metadata. So should we call them types instead of dtypes?

Suggested change
# PyArrow provides the following dtypes:
#
# - Numeric dtypes:
# PyArrow provides the following types:
#
# - Numeric types:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also noticed the difference. I'm OK with using type in the comments, but for the tests, currently, the test name is:

test_to_numpy_pyarrow_array_pyarrow_dtypes_numeric(dtype, expected_dtype):

Strictly speaking, it should be:

test_to_numpy_pyarrow_array_pyarrow_types_numeric(type, expected_type):

But I feel the minor changes in the tests are unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think they use 'type' in the comments, but use 'dtype' in the function name to be consistent with the other unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, in 111738c, I've updated 'dtype' to 'type' for comments about PyArrow and left the tests unchanged.

pygmt/tests/test_clib_to_numpy.py Outdated Show resolved Hide resolved
pygmt/tests/test_clib_to_numpy.py Outdated Show resolved Hide resolved
@seisman seisman merged commit 189f376 into main Nov 9, 2024
17 of 18 checks passed
@seisman seisman deleted the to_numpy/pyarrow_array branch November 9, 2024 08:29
@seisman seisman removed the needs review This PR has higher priority and needs review. label Nov 9, 2024
@seisman seisman changed the title clib.conversion._to_numpy: Add tests for pyarrow.array with pyarrow numeric dtypes clib.conversion._to_numpy: Add tests for pyarrow.array with pyarrow numeric types Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants