Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH1033 Add overloads of engine for pd.read_json #1035

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

loicdiridollou
Copy link
Contributor

@loicdiridollou
Copy link
Contributor Author

For engine="pyarrow", you are forced to pass lines=True and you can not pass a StringIO buffer, it has to be a file path or a ReadBuffer[bytes].

@loicdiridollou loicdiridollou changed the title GHXXX Add overloads of engine for pd.read_json GH1033 Add overloads of engine for pd.read_json Nov 16, 2024
Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about the misuse of ellipses with default arguments. Ellipses should only be used in an argument when the argument is optional. When you want a specific result to happen as a result of the argument being specified, you don't use ellipses. The overloads that require values to be specified (i.e., the ones without ellipses) should come before the ones that use ellipses. And the ones with ellipses should have "broad" types. So writing something like engine: Literal["pyarrow"] = ... can't be correct, because the default value of engine is ujson, so if the stub is to work without specification of that parameter it would be engine: Literal["ujson", "pyarrow"] = ... .

Comment on lines +145 to +151
lines: Literal[True] = ...,
chunksize: None = ...,
compression: CompressionOptions = ...,
nrows: int | None = ...,
storage_options: StorageOptions = ...,
dtype_backend: DtypeBackend | NoDefault = ...,
engine: Literal["pyarrow"] = ...,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default value of lines is False, if you have ... for both lines and engine, it means that you don't have to specify either. So I think you don't want the ellipses here on either argument.

@@ -72,6 +98,7 @@ def read_json(
nrows: int | None = ...,
storage_options: StorageOptions = ...,
dtype_backend: DtypeBackend | NoDefault = ...,
engine: Literal["pyarrow"] = ...,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ellipses here should be removed.

lines: Literal[True],
chunksize: int,
compression: CompressionOptions = ...,
nrows: int | None = ...,
storage_options: StorageOptions = ...,
dtype_backend: DtypeBackend | NoDefault = ...,
engine: Literal["pyarrow"] = ...,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ellipses

Comment on lines +195 to +201
lines: Literal[True] = ...,
chunksize: None = ...,
compression: CompressionOptions = ...,
nrows: int | None = ...,
storage_options: StorageOptions = ...,
dtype_backend: DtypeBackend | NoDefault = ...,
engine: Literal["pyarrow"] = ...,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the ellipses

Comment on lines +1646 to +1652
check(
assert_type(
pd.read_json(dd, lines=True, engine="pyarrow"),
pd.DataFrame,
),
pd.DataFrame,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a test with TYPE_CHECKING_INVALID_USAGE that makes sure that we disallow lines=False with engine="pyarrow" .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Added new argument engine to read_json() to support parsing JSON with pyarrow by specifying engine="pyarrow"
2 participants