Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r: Support bool and dictionary<string> in AdbcStatementBind() #1008

Open
nbenn opened this issue Aug 30, 2023 · 8 comments
Open

r: Support bool and dictionary<string> in AdbcStatementBind() #1008

nbenn opened this issue Aug 30, 2023 · 8 comments

Comments

@nbenn
Copy link
Collaborator

nbenn commented Aug 30, 2023

Some examples:

library(adbcdrivermanager)

con <- adbc_connection_init(
  adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
)

write_adbc(data.frame(a = as.factor(letters)), con, "fct")
#> Error in adbc_statement_execute_query(stmt) :
#>   Column 0 has unsupported type dictionary
write_adbc(data.frame(a = c(TRUE, FALSE, NA)), con, "bol")
#> Error in adbc_statement_execute_query(stmt) :
#>   Column 0 has unsupported type bool
write_adbc(data.frame(a = as.difftime(1, units = "hours")), con, "dtm")
#> Error in adbc_statement_execute_query(stmt) :
#>   Column 0 has unsupported type duration

While the errors are correct in the sense that SQLite does not offer proper support for any of these types, DBI/RSQLite is more lenient:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

DBI::dbWriteTable(con, "types", data.frame(
  fct = as.factor(letters[1:3]),
  bol = c(TRUE, FALSE, NA),
  dtm = as.difftime(1:3, units = "hours")
))

str(DBI::dbReadTable(con, "types"))
#> 'data.frame':	3 obs. of  3 variables:
#>  $ fct: chr  "a" "b" "c"
#>  $ bol: int  1 0 NA
#>  $ dtm: int  1 2 3
@nbenn nbenn changed the title r/adbcdrivermanager: how do we deal with type-incompatibilities r/adbcdrivermanager: how do we deal with type-incompatibilities? Aug 30, 2023
@nbenn
Copy link
Collaborator Author

nbenn commented Aug 31, 2023

@paleolimbot

@paleolimbot
Copy link
Member

I think it seems reasonable that a driver R package should have the opportunity to have some control over how a given data.frame() is converted to a nanoarrow_array_stream. Right now that call is just nanoarrow::as_nanoarrow_array_stream(), which doesn't leave any room for - as you noted - the database driver to negotiate types it does not support but could (either losslessly - like storing a logical as an integer - or by stripping some metadata with a warning - like for difftime).

Perhaps:

as_adbc_bind_array_stream <- function(x) {
  UseMethod("as_adbc_bind_array_stream")
}

as_adbc_bind_array_stream.default <- function(x) {
  nanoarrow::as_nanoarrow_array_stream(x)
}

The SQLite driver could then do:

as_adbc_bind_array_stream.adbcsqlite_statement <- function(x) {
  if (is.data.frame(x)) {
    # loop and reassign columns with unsupported types, possibly warning for
    # assignments that will loose information
  }

  nanoarrow::as_nanoarrow_array_stream(x)
}

@paleolimbot
Copy link
Member

@krlmlr You brought up the concept of roundtripping in a conversation offline: I think that may be more of an issue for conversion from the other side. In general, conversions to arrow (via nanoarrow or arrow) only happen without error if the vector (and any associated metadata - like timezone or unit) can be represented in Arrow losslessly. If you know the ptype that you want back, nanoarrow::convert_array_stream() should be able to losslessly roundtrip (e.g., convert_array_stream(result, to = data.frame(x = factor(levels = c("a", "b", ...))).

@nbenn
Copy link
Collaborator Author

nbenn commented Sep 1, 2023

@paleolimbot Yeah, I think something like this goes a long way for solving this type of issues.

As a more general remark: I don't think the current behavior is bad. It's just that I feel we need to give the user an option to opt in to (or out of) a more forgiving mechanism. We could keep the current behavior as "strict mode" (this might even be the default) and if strict mode not enabled (or disabled) something like the above kicks in where factors can end up as TEXT columns (in the case of SQLite). The down side of not being in "strict mode" is that you can't always round-trip: The TEXT column from before will yield a character vector if read back. Maybe keeping around "strict mode" addresses @krlmlr's comment as well: when enabled you will always be able to round trip (or error).

@ianmcook Is this the type of discussion you wanted us to have?

@paleolimbot
Copy link
Member

It should also be said that we could also:

  • Support dictionary-encoded columns in the drivers. It's a tiny bit of a pain for the drivers implemented in C but not impossible (basically ArrowArrayViewGetString(array_view->dictionary, ArrowArrayViewGetInt(array_view, i)).
  • Not use dictionary encoding to represent factors by default when converting in nanoarrow. That would be a somewhat large departure from the conversions that happen in the arrow package and I think is also slower.
  • Work around this by never using dictionary encoding in adbi when passing a data frame to adbcdrivermanager.

Example of how you could avoid dictionary encoding without changes to nanoarrow or driver R packages:

library(nanoarrow)

without_dictionary_encoding <- function(x) {
  x$children <- lapply(x$children, without_dictionary_encoding)
  
  if (is.null(x$dictionary)) {
    x
  } else {
    x$dictionary
  }
}

df <- palmerpenguins::penguins[1]
(schema <- infer_nanoarrow_schema(df))
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 1
#>   ..$ species:<nanoarrow_schema dictionary(int32)<string>>
#>   .. ..$ format    : chr "i"
#>   .. ..$ name      : chr "species"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary:<nanoarrow_schema string>
#>   .. .. ..$ format    : chr "u"
#>   .. .. ..$ name      : chr ""
#>   .. .. ..$ metadata  : list()
#>   .. .. ..$ flags     : int 2
#>   .. .. ..$ children  : list()
#>   .. .. ..$ dictionary: NULL
#>  $ dictionary: NULL
as_nanoarrow_array(df)
#> <nanoarrow_array struct[344]>
#>  $ length    : int 344
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ species:<nanoarrow_array dictionary(int32)<string>[344]>
#>   .. ..$ length    : int 344
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<int32>[344][1376 b]> `0 0 0 0 0 0 0 0 0 ...`
#>   .. ..$ dictionary:<nanoarrow_array string[3]>
#>   .. .. ..$ length    : int 3
#>   .. .. ..$ null_count: int 0
#>   .. .. ..$ offset    : int 0
#>   .. .. ..$ buffers   :List of 3
#>   .. .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. .. ..$ :<nanoarrow_buffer data_offset<int32>[4][16 b]> `0 6 15 21`
#>   .. .. .. ..$ :<nanoarrow_buffer data<string>[21 b]> `AdelieChinstrapGentoo`
#>   .. .. ..$ dictionary: NULL
#>   .. .. ..$ children  : list()
#>   .. ..$ children  : list()
#>  $ dictionary: NULL
as_nanoarrow_array(df, schema = without_dictionary_encoding(schema))
#> <nanoarrow_array struct[344]>
#>  $ length    : int 344
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ species:<nanoarrow_array string[344]>
#>   .. ..$ length    : int 344
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 3
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data_offset<int32>[345][1380 b]> `0 6 12 18 2...`
#>   .. .. ..$ :<nanoarrow_buffer data<string>[2268 b]> `AdelieAdelieAdelieAdel...`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL

Created on 2023-09-01 with reprex v2.0.2

@paleolimbot paleolimbot changed the title r/adbcdrivermanager: how do we deal with type-incompatibilities? r: Support bool and dictionary<string> in AdbcStatementBind() Sep 14, 2023
@paleolimbot
Copy link
Member

We had a somewhat extensive offline conversation about this, and I think the action item we decided on was to add support for bool, dictionary<string>, and duration into the sqlite driver. I think that bool and dictionary<string> are straightforward (but I may have to punt on duration).

@lidavidm
Copy link
Member

And for reference duration should be doable in SQLite as an ISO8601 string, which would at least preserve the unit and be a little more recognizable than just a plain integer

@paleolimbot
Copy link
Member

Bool seems to be already supported; dictionary is implemented in #1224; blobs were already implemented but there was no way to roundtrip them until #1223.

Bool example:

library(adbcdrivermanager)

db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
con <- adbc_connection_init(db)

df <- data.frame(x = c(TRUE, FALSE, NA))
write_adbc(df, con, "tbl")

read_adbc(con, "SELECT * from tbl") |> 
  as.data.frame()  
#>    x
#> 1  1
#> 2  0
#> 3 NA

Created on 2023-10-25 with reprex v2.0.2

lidavidm pushed a commit that referenced this issue Oct 26, 2023
…binary types (#1224)

This PR adds the ability to ingest dictionary-encoded string and binary
columns.

Part of addressing #1008.

From the R bindings:

``` r
library(adbcdrivermanager)

db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
con <- adbc_connection_init(db)

df <- data.frame(x = factor(letters[1:10]))
write_adbc(df, con, "tbl")

read_adbc(con, "SELECT * from tbl") |> 
  as.data.frame()  
#>    x
#> 1  a
#> 2  b
#> 3  c
#> 4  d
#> 5  e
#> 6  f
#> 7  g
#> 8  h
#> 9  i
#> 10 j
```

<sup>Created on 2023-10-25 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants