-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResultSet fetchmany_arrow
/fetchall_arrow
methods fail during concat_tables
#418
Comments
Hi @ksofeikov! Thank you for reporting this issue and attaching the stacktrace. Before we go deeper into fixing the code, I would like to understand - how did it ended up so two pieces of the same result set have different schema? Is it possible to create some synthetic example to reproduce this issue, so you don't have to reveal your data? It woul help us to check if |
fetchmany_arrow
/fetchall_arrow
methods fail during concat_tables
@kravets-levko tbh, I'm not sure how to create a reproducible example here, since the cursor just bulk-reads the table from the store. Since this happens during the result read-out stage, it's not really something that touches the client code/controlled by me. What I can do is maybe try to stop at a breakpoint and see how the cloud fetch gets a table with a different schema. There is another pointer I could give, I guess. Let me know if I completely misunderstood it, but it seems like the I remember seeing a similar pyarrow schema problem with However, memory might be letting me down here :) |
I suspect one way to create an example is to craft two arrow files with one column each and then one will have no null values in a column, and the other one will have nulls there. then try to read those files as if they were the query result, this will probably trigger it. |
Yeah, seems like the null assumption was correct. Here is a minimally working example
will throw
|
Also, the root cause of this is that strings are backed by python
generates
|
@ksofeikov Yes, I totally understand what the error means. What I don't understand - why it happens. In your example, you literally created two different tables with different schemas. However, when you run SQL query - you get a single result set, all rows of which should have the same schema. Even if you I have some suspects on where the schema may potentially get changed while reading the data from server, and I will try to reproduce your issue. Meanwhile, I want to ask you to check some other things: |
Just checked |
Changing env from Is there a way to verify the env version on the client, just to make sure it was picked up after changing the compute on the platform? Regardless, there are a couple of suitable workarounds for now, so good luck with the fix! |
It's good that workaround helped and you can continue using the library while we're looking for the fix. Too bad that we have another issue with CloudFetch 🤔 Anyway. Thank you for the bug report and all your help, which allowed to narrow down the scope of the issue (even though I still need to reproduce it 🙂). Will keep you posted if any updates or other questions |
I just got bitten by this as well. FWIW, no pandas was involved in my use case. Just SQL. It's good to know that disabling cloud fetch is a work-around. |
Hi there,
I'm using this client library to fetch much data from our DBX environment. The version I'm using is
3.3.0
.The library keeps crashing when attempts to concatenate two current and partial results. I can not attach to full trace, because it contains some of the internal schemas, but here is the gist of it:
The exact stack trace is
The data is coming through a cursor like this
The source table is created through a CTAS statement, so all fields are nullable by default. I have found two ways to resolve the issue:
results = pyarrow.concat_tables([results, partial_results], promote_options="permissive")
promote options topermissive
to so pyarrow can marry two schemas, or2.9.6
I checked the
2.9.6
source code and it does not seem to be using a permissive schema casting, so seems like a regression in this case.I'm not sure if I can add anything else beyond that, but do let me know.
And to be clear, I request like 100k records at a time there, and can iterate through like 95k of them, and then it fails. So I'm not really sure if there is a reliable way to reproduce that
If the cluster runtime matters,
13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
Thanks!
The text was updated successfully, but these errors were encountered: