r: Support `bool` and `dictionary<string>` in `AdbcStatementBind()` #1008

nbenn · 2023-08-30T10:43:00Z

Some examples:

library(adbcdrivermanager)

con <- adbc_connection_init(
  adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
)

write_adbc(data.frame(a = as.factor(letters)), con, "fct")
#> Error in adbc_statement_execute_query(stmt) :
#>   Column 0 has unsupported type dictionary
write_adbc(data.frame(a = c(TRUE, FALSE, NA)), con, "bol")
#> Error in adbc_statement_execute_query(stmt) :
#>   Column 0 has unsupported type bool
write_adbc(data.frame(a = as.difftime(1, units = "hours")), con, "dtm")
#> Error in adbc_statement_execute_query(stmt) :
#>   Column 0 has unsupported type duration

While the errors are correct in the sense that SQLite does not offer proper support for any of these types, DBI/RSQLite is more lenient:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

DBI::dbWriteTable(con, "types", data.frame(
  fct = as.factor(letters[1:3]),
  bol = c(TRUE, FALSE, NA),
  dtm = as.difftime(1:3, units = "hours")
))

str(DBI::dbReadTable(con, "types"))
#> 'data.frame':	3 obs. of  3 variables:
#>  $ fct: chr  "a" "b" "c"
#>  $ bol: int  1 0 NA
#>  $ dtm: int  1 2 3

nbenn · 2023-08-31T14:57:39Z

@paleolimbot

paleolimbot · 2023-08-31T19:47:42Z

I think it seems reasonable that a driver R package should have the opportunity to have some control over how a given data.frame() is converted to a nanoarrow_array_stream. Right now that call is just nanoarrow::as_nanoarrow_array_stream(), which doesn't leave any room for - as you noted - the database driver to negotiate types it does not support but could (either losslessly - like storing a logical as an integer - or by stripping some metadata with a warning - like for difftime).

Perhaps:

as_adbc_bind_array_stream <- function(x) {
  UseMethod("as_adbc_bind_array_stream")
}

as_adbc_bind_array_stream.default <- function(x) {
  nanoarrow::as_nanoarrow_array_stream(x)
}

The SQLite driver could then do:

as_adbc_bind_array_stream.adbcsqlite_statement <- function(x) {
  if (is.data.frame(x)) {
    # loop and reassign columns with unsupported types, possibly warning for
    # assignments that will loose information
  }

  nanoarrow::as_nanoarrow_array_stream(x)
}

paleolimbot · 2023-08-31T19:57:12Z

@krlmlr You brought up the concept of roundtripping in a conversation offline: I think that may be more of an issue for conversion from the other side. In general, conversions to arrow (via nanoarrow or arrow) only happen without error if the vector (and any associated metadata - like timezone or unit) can be represented in Arrow losslessly. If you know the ptype that you want back, nanoarrow::convert_array_stream() should be able to losslessly roundtrip (e.g., convert_array_stream(result, to = data.frame(x = factor(levels = c("a", "b", ...))).

nbenn · 2023-09-01T07:23:33Z

@paleolimbot Yeah, I think something like this goes a long way for solving this type of issues.

As a more general remark: I don't think the current behavior is bad. It's just that I feel we need to give the user an option to opt in to (or out of) a more forgiving mechanism. We could keep the current behavior as "strict mode" (this might even be the default) and if strict mode not enabled (or disabled) something like the above kicks in where factors can end up as TEXT columns (in the case of SQLite). The down side of not being in "strict mode" is that you can't always round-trip: The TEXT column from before will yield a character vector if read back. Maybe keeping around "strict mode" addresses @krlmlr's comment as well: when enabled you will always be able to round trip (or error).

@ianmcook Is this the type of discussion you wanted us to have?

paleolimbot · 2023-09-02T01:17:31Z

It should also be said that we could also:

Support dictionary-encoded columns in the drivers. It's a tiny bit of a pain for the drivers implemented in C but not impossible (basically ArrowArrayViewGetString(array_view->dictionary, ArrowArrayViewGetInt(array_view, i)).
Not use dictionary encoding to represent factors by default when converting in nanoarrow. That would be a somewhat large departure from the conversions that happen in the arrow package and I think is also slower.
Work around this by never using dictionary encoding in adbi when passing a data frame to adbcdrivermanager.

Example of how you could avoid dictionary encoding without changes to nanoarrow or driver R packages:

library(nanoarrow)

without_dictionary_encoding <- function(x) {
  x$children <- lapply(x$children, without_dictionary_encoding)
  
  if (is.null(x$dictionary)) {
    x
  } else {
    x$dictionary
  }
}

df <- palmerpenguins::penguins[1]
(schema <- infer_nanoarrow_schema(df))
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 1
#>   ..$ species:<nanoarrow_schema dictionary(int32)<string>>
#>   .. ..$ format    : chr "i"
#>   .. ..$ name      : chr "species"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary:<nanoarrow_schema string>
#>   .. .. ..$ format    : chr "u"
#>   .. .. ..$ name      : chr ""
#>   .. .. ..$ metadata  : list()
#>   .. .. ..$ flags     : int 2
#>   .. .. ..$ children  : list()
#>   .. .. ..$ dictionary: NULL
#>  $ dictionary: NULL
as_nanoarrow_array(df)
#> <nanoarrow_array struct[344]>
#>  $ length    : int 344
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ species:<nanoarrow_array dictionary(int32)<string>[344]>
#>   .. ..$ length    : int 344
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<int32>[344][1376 b]> `0 0 0 0 0 0 0 0 0 ...`
#>   .. ..$ dictionary:<nanoarrow_array string[3]>
#>   .. .. ..$ length    : int 3
#>   .. .. ..$ null_count: int 0
#>   .. .. ..$ offset    : int 0
#>   .. .. ..$ buffers   :List of 3
#>   .. .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. .. ..$ :<nanoarrow_buffer data_offset<int32>[4][16 b]> `0 6 15 21`
#>   .. .. .. ..$ :<nanoarrow_buffer data<string>[21 b]> `AdelieChinstrapGentoo`
#>   .. .. ..$ dictionary: NULL
#>   .. .. ..$ children  : list()
#>   .. ..$ children  : list()
#>  $ dictionary: NULL
as_nanoarrow_array(df, schema = without_dictionary_encoding(schema))
#> <nanoarrow_array struct[344]>
#>  $ length    : int 344
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ species:<nanoarrow_array string[344]>
#>   .. ..$ length    : int 344
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 3
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data_offset<int32>[345][1380 b]> `0 6 12 18 2...`
#>   .. .. ..$ :<nanoarrow_buffer data<string>[2268 b]> `AdelieAdelieAdelieAdel...`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL

^{Created on 2023-09-01 with reprex v2.0.2}

paleolimbot · 2023-09-14T16:37:04Z

We had a somewhat extensive offline conversation about this, and I think the action item we decided on was to add support for bool, dictionary<string>, and duration into the sqlite driver. I think that bool and dictionary<string> are straightforward (but I may have to punt on duration).

lidavidm · 2023-09-14T16:47:49Z

And for reference duration should be doable in SQLite as an ISO8601 string, which would at least preserve the unit and be a little more recognizable than just a plain integer

paleolimbot · 2023-10-25T18:03:23Z

Bool seems to be already supported; dictionary is implemented in #1224; blobs were already implemented but there was no way to roundtrip them until #1223.

Bool example:

library(adbcdrivermanager)

db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
con <- adbc_connection_init(db)

df <- data.frame(x = c(TRUE, FALSE, NA))
write_adbc(df, con, "tbl")

read_adbc(con, "SELECT * from tbl") |> 
  as.data.frame()  
#>    x
#> 1  1
#> 2  0
#> 3 NA

^{Created on 2023-10-25 with reprex v2.0.2}

…binary types (#1224) This PR adds the ability to ingest dictionary-encoded string and binary columns. Part of addressing #1008. From the R bindings: ``` r library(adbcdrivermanager) db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:") con <- adbc_connection_init(db) df <- data.frame(x = factor(letters[1:10])) write_adbc(df, con, "tbl") read_adbc(con, "SELECT * from tbl") |> as.data.frame() #> x #> 1 a #> 2 b #> 3 c #> 4 d #> 5 e #> 6 f #> 7 g #> 8 h #> 9 i #> 10 j ``` <sup>Created on 2023-10-25 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

nbenn changed the title ~~r/adbcdrivermanager: how do we deal with type-incompatibilities~~ r/adbcdrivermanager: how do we deal with type-incompatibilities? Aug 30, 2023

paleolimbot changed the title ~~r/adbcdrivermanager: how do we deal with type-incompatibilities?~~ r: Support bool and dictionary<string> in AdbcStatementBind() Sep 14, 2023

paleolimbot mentioned this issue Oct 25, 2023

feat(c/driver/sqlite): Support binding dictionary-encoded string and binary types #1224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r: Support `bool` and `dictionary<string>` in `AdbcStatementBind()` #1008

r: Support `bool` and `dictionary<string>` in `AdbcStatementBind()` #1008

nbenn commented Aug 30, 2023

nbenn commented Aug 31, 2023

paleolimbot commented Aug 31, 2023

paleolimbot commented Aug 31, 2023

nbenn commented Sep 1, 2023

paleolimbot commented Sep 2, 2023

paleolimbot commented Sep 14, 2023

lidavidm commented Sep 14, 2023

paleolimbot commented Oct 25, 2023

r: Support bool and dictionary<string> in AdbcStatementBind() #1008

r: Support bool and dictionary<string> in AdbcStatementBind() #1008

Comments

nbenn commented Aug 30, 2023

nbenn commented Aug 31, 2023

paleolimbot commented Aug 31, 2023

paleolimbot commented Aug 31, 2023

nbenn commented Sep 1, 2023

paleolimbot commented Sep 2, 2023

paleolimbot commented Sep 14, 2023

lidavidm commented Sep 14, 2023

paleolimbot commented Oct 25, 2023

r: Support `bool` and `dictionary<string>` in `AdbcStatementBind()` #1008

r: Support `bool` and `dictionary<string>` in `AdbcStatementBind()` #1008