feat: Throw exception when non-utf8 characters are in a data.frame #16

Tmonster · 2023-09-22T13:12:40Z

The duckdb engine assumes all strings are valid utf-8. In the R-client, we forgot to check if strings were in fact utf8. Here we check them when the scanning the df when we register it.

Closes #12.

codecov-commenter · 2023-09-22T13:33:30Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

krlmlr · 2023-11-08T12:59:21Z

Thanks. Are there any other locations where we need to care about this? Thinking mainly about parameter binding via dbBind() or dbSendQuery(params = ...) .

krlmlr · 2024-03-11T04:38:34Z

Coming back to this.

What are the costs of this preemptive check? I assume that the code must read the entire string? What are the consequences of avoiding this check?

Example for constructing a broken string:

bork <- iconv("börk", to = "latin1")
Encoding(bork) <- "UTF-8"
bork
#> [1] "b\xf6rk"

^{Created on 2024-03-11 with reprex v2.1.0}

krlmlr · 2024-03-21T07:04:07Z

There's a faster way:

x <- "Est\xe2ncia"
Encoding(x)
#> [1] "unknown"
x <- "Estância"
Encoding(x)
#> [1] "UTF-8"
x <- enc2utf8("Estância")
Encoding(x)
#> [1] "UTF-8"
Encoding(iconv("Est\xe2ncia", from = "latin1", to = "UTF-8"))
#> [1] "UTF-8"

^{Created on 2024-03-21 with reprex v2.1.0}

This shows different results when ran line by line in the RStudio IDE: the second example also yields "unknown". I propose to do the fast check first, and only use utf8proc if that returns "unknown".

Tmonster · 2024-03-26T12:07:20Z

I agree, I think the fast check first is better. Otherwise we have to copy the whole string. Will work on this now

…to unicode_strings_fix

Tmonster · 2024-03-26T14:15:40Z

I actually wonder now if the issue is in the function enc2utf8. In register.R we encode the values of the data frame in the function encode_values(). enc2utf8 is applied to all character columns in the data frame that is passed. Playing around with the enc2utf8 function, there seem to be some inconsistencies when the encoding of the passed string is unknown.

> enc2utf8("Est\xe2ncia")
# [1] "Est\xe2ncia"
> iconv(enc2utf8("Est\xe2ncia"), from="latin1", to="UTF-8")
# [1] "Estância"
> iconv(enc2utf8("Est\xe2ncia"), from="UTF-8", to="UTF-8")
# [1] NA
> iconv(enc2utf8("hello"), from="UTF-8", to="UTF-8")
# [1] "hello"
> iconv(enc2utf8("hello"), from="latin1", to="UTF-8")
# [1] "hello"

Now I just check that Encoding(x) is valid for every value of a varcher column. This seems like a lot of overhead though, so open to other ideas.

EDIT: changes for readability

Tmonster added 6 commits September 15, 2023 14:46

add bug steps'

2bec241

fix, want to check smoke test

d2afe3c

adding a test case

ac19d45

removing bug script adding test script

6fe54ed

looks much better

5e12514

Merge branch 'main' into unicode_strings_fix

5f385ae

Tmonster and others added 4 commits September 22, 2023 15:53

clean up the code a little bit

12024e2

remove iostream

dbcd319

skip test on previous versions of R

e8ec3a6

fix skip if condition

253e2d7

Tmonster force-pushed the unicode_strings_fix branch from b891555 to 253e2d7 Compare September 26, 2023 11:17

krlmlr force-pushed the unicode_strings_fix branch from 253e2d7 to 0a48fd4 Compare November 8, 2023 12:58

SimonCoulombe mentioned this pull request Mar 3, 2024

Excessive RAM usage for DBI::dbWriteTable() and dplyr::collect() #97

Open

Tmonster and others added 9 commits March 11, 2024 05:36

add bug steps'

0e1a958

fix, want to check smoke test

eb6f613

adding a test case

0187cbe

removing bug script adding test script

4260efe

looks much better

4a0f24c

clean up the code a little bit

f23078b

remove iostream

5f08052

skip test on previous versions of R

7028fa4

fix skip if condition

2286bda

krlmlr force-pushed the unicode_strings_fix branch from 0a48fd4 to 2286bda Compare March 11, 2024 04:36

krlmlr changed the title ~~throw exception when non-utf8 characters are in a data.frame~~ feat: Throw exception when non-utf8 characters are in a data.frame Mar 11, 2024

krlmlr added this to the 0.10.1 milestone Mar 21, 2024

krlmlr force-pushed the main branch from 3acd095 to 7886e14 Compare March 23, 2024 05:54

Tmonster added 4 commits March 26, 2024 13:10

Merge branch 'main' into unicode_strings_fix

36185c5

check that every value has an encoding. error if not

9f8e93d

Merge branch 'unicode_strings_fix' of github.com:Tmonster/duckdb-r in…

0cb29b7

…to unicode_strings_fix

fix merge issue

e401e60

add code to check utf-8 errors

e84547d

krlmlr force-pushed the main branch from 640f15f to 169cc0a Compare October 19, 2024 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Throw exception when non-utf8 characters are in a data.frame #16

feat: Throw exception when non-utf8 characters are in a data.frame #16

Tmonster commented Sep 22, 2023 •

edited by krlmlr

Loading

codecov-commenter commented Sep 22, 2023 •

edited

Loading

krlmlr commented Nov 8, 2023

krlmlr commented Mar 11, 2024

krlmlr commented Mar 21, 2024

Tmonster commented Mar 26, 2024

Tmonster commented Mar 26, 2024 •

edited

Loading

feat: Throw exception when non-utf8 characters are in a data.frame #16

Are you sure you want to change the base?

feat: Throw exception when non-utf8 characters are in a data.frame #16

Conversation

Tmonster commented Sep 22, 2023 • edited by krlmlr Loading

codecov-commenter commented Sep 22, 2023 • edited Loading

Welcome to Codecov 🎉

krlmlr commented Nov 8, 2023

krlmlr commented Mar 11, 2024

krlmlr commented Mar 21, 2024

Tmonster commented Mar 26, 2024

Tmonster commented Mar 26, 2024 • edited Loading

Tmonster commented Sep 22, 2023 •

edited by krlmlr

Loading

codecov-commenter commented Sep 22, 2023 •

edited

Loading

Tmonster commented Mar 26, 2024 •

edited

Loading