Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable connect to Redshift using adbc_driver_postgresql #1563

Open
j0bekt01 opened this issue Feb 23, 2024 · 12 comments
Open

unable connect to Redshift using adbc_driver_postgresql #1563

j0bekt01 opened this issue Feb 23, 2024 · 12 comments
Labels
Type: bug Something isn't working

Comments

@j0bekt01
Copy link

j0bekt01 commented Feb 23, 2024

What happened?

I was attempting to create a connection to Redshift using adbc_driver_postgresql but I am getting this error: IO: [libpq] Failed to build type mapping table: ERROR: column "typarray" does not exist in pg_type.
I am aware that Redshift is based on an earlier version of Postgres. Is it currently possible to connect to Redshift? if not, are there plans to do so?

Environment/Setup
4.14.336-253.554.amzn2.x86_64
Python 3.9.18
adbc-driver-manager 0.10.0
adbc-driver-postgresql 0.10.0
pyarrow 14.0.2

@j0bekt01 j0bekt01 added the Type: bug Something isn't working label Feb 23, 2024
@lidavidm
Copy link
Member

Hmm, it looks like Redshift uses the wire protocol but isn't quite compatible, so we'd have to handle it separately. Even then, maintaining CI would be a challenge, plus it appears it's based on PostgreSQL 8? We've had issues in the past with older PostgreSQL versions, too.

@lidavidm
Copy link
Member

Specifically for the error here, we are querying the database metadata at startup to figure out which column types map to which Arrow types, but Redshift appears to not provide one of the required columns.

@Sairam90
Copy link

Sairam90 commented Oct 3, 2024

any updates on this issue ?

@paleolimbot
Copy link
Member

We can probably solve the specific issue of not having the column typarray. I don't have a way to test on Redshift but would be happy to look into it given a way to do so!

@gtomitsuka
Copy link

gtomitsuka commented Oct 4, 2024

We're blocked on a transition to ADBC by this; TextQL is happy to provide access to one of our demo data warehouses with their default sample database for a few weeks – it's serverless, so may require a cold start sometimes, but works very well. I assume you won't need more than a few thousand queries used for this.

Who should I talk to about this? Feel free to reach out, [email protected].

I may be able to contribute myself in a month or so if the issue turns out to be much bigger than expected. This guide about the differences is pretty decent.

@paleolimbot
Copy link
Member

I'll send you a note...I happen to be working on the Postgres driver right now and I'm happy to kick the tires a little to see if I can get it to work. To claim Redshift support long term we would need a set of more permanent credentials that we can use in CI, but there are a few companies that are doing this kind of testing anyway that might be willing to share with us if we can demonstrate a POC 🙂 .

@gtomitsuka
Copy link

gtomitsuka commented Oct 4, 2024

My CTO is currently on honeymoon; I might be able to get long-term credentials for you too but would definitely need his approval first. Will keep you updated!

Responded to your email with credentials for you, might be able to get more for more people if needed.

@j0bekt01
Copy link
Author

j0bekt01 commented Oct 4, 2024

Supporting Redshift will significantly boost the exposure and usage of this driver. In my experience working with databases, particularly Redshift, the primary bottleneck is extracting the data fast, and this support will greatly alleviate that issue. ConnectorX has gained popularity as the default engine for Polars, but it only supports reading operations. For writing, one must use SQLAlchemy, which is less performant. Alternatively, you can use the Redshift COPY command, which involves dumping data to S3 via Arrow and then using the Redshift Data API to submit a COPY command. The ability to both read and write with adbc_driver would be a game-changer.
I am happy to help contribute in any way I can.

@WillAyd
Copy link
Contributor

WillAyd commented Oct 4, 2024

Alternatively, you can use the Redshift COPY command, which involves dumping data to S3 via Arrow and then using the Redshift Data API to submit a COPY command.

I believe you can also just use the AWS SDK for pandas (previously known as awswrangler) which does this automatically for you. That may be the best of your intermediate options, although yea having ADBC support in the near term is probably best

@j0bekt01
Copy link
Author

j0bekt01 commented Oct 4, 2024

Alternatively, you can use the Redshift COPY command, which involves dumping data to S3 via Arrow and then using the Redshift Data API to submit a COPY command.

I believe you can also just use the AWS SDK for pandas (previously known as awswrangler) which does this automatically for you. That may be the best of your intermediate options, although yea having ADBC support in the near term is probably best

Yes, you are correct, awswrangler handles exactly this internally. However, I prefer using Polars over Pandas, and I know conversion between the two is possible. I just believe that supporting a premier data warehouse (Redshift) would solidify this as a strong alternative to JDBC/ODBC drivers.

@WillAyd
Copy link
Contributor

WillAyd commented Oct 4, 2024

Does the Apache project get any promotional credits from cloud providers? While having someone donate access is nice, it might be best safest if we had a dedicated redshift to connect to over OIDC (ditto for Google cloud and the Bigquerry driver)

This article is pretty old but I wonder if it still applies:

https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/

@lidavidm
Copy link
Member

lidavidm commented Oct 7, 2024

I do not believe we get any credits, unfortunately.

paleolimbot added a commit that referenced this issue Nov 5, 2024
…dshift (#2219)

Just following up on #1563 to see if the missing `typarray` column is
the only issue. To get the details right might be a large project, but
we might be able to support a basic connection without too much effort.
Paramter binding and non-COPY result fetching seem to work...the default
query fetch method (COPY) is not supported, `connection_get_info()`
fails, and at a glance, `connection_get_objects()` might be returning
incorrect results (and fails at the column depth).

``` r
library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = Sys.getenv("ADBC_REDSHIFT_TEST_URI"),
  adbc.postgresql.load_array_types = FALSE
)

con <- db |> 
  adbc_connection_init()

stmt <- con |> 
  adbc_statement_init(adbc.postgresql.use_copy = FALSE)

stream <- nanoarrow::nanoarrow_allocate_array_stream()
stmt |> 
  adbc_statement_bind(data.frame(45)) |> 
  adbc_statement_set_sql_query("SELECT 1 + $1 as foofy, 'string' as foofy_string") |> 
  adbc_statement_execute_query(stream)
#> [1] -1

tibble::as_tibble(stream)
#> # A tibble: 1 × 2
#>   foofy foofy_string
#>   <dbl> <chr>       
#> 1    46 string
```

<sup>Created on 2024-10-04 with [reprex
v2.1.1](https://reprex.tidyverse.org)</sup>

---------

Co-authored-by: William Ayd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants