-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Throw exception when non-utf8 characters are in a data.frame #16
base: main
Are you sure you want to change the base?
Conversation
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
b891555
to
253e2d7
Compare
253e2d7
to
0a48fd4
Compare
Thanks. Are there any other locations where we need to care about this? Thinking mainly about parameter binding via |
0a48fd4
to
2286bda
Compare
Coming back to this. What are the costs of this preemptive check? I assume that the code must read the entire string? What are the consequences of avoiding this check? Example for constructing a broken string: bork <- iconv("börk", to = "latin1")
Encoding(bork) <- "UTF-8"
bork
#> [1] "b\xf6rk" Created on 2024-03-11 with reprex v2.1.0 |
There's a faster way: x <- "Est\xe2ncia"
Encoding(x)
#> [1] "unknown"
x <- "Estância"
Encoding(x)
#> [1] "UTF-8"
x <- enc2utf8("Estância")
Encoding(x)
#> [1] "UTF-8"
Encoding(iconv("Est\xe2ncia", from = "latin1", to = "UTF-8"))
#> [1] "UTF-8" Created on 2024-03-21 with reprex v2.1.0 This shows different results when ran line by line in the RStudio IDE: the second example also yields "unknown". I propose to do the fast check first, and only use utf8proc if that returns "unknown". |
I agree, I think the fast check first is better. Otherwise we have to copy the whole string. Will work on this now |
I actually wonder now if the issue is in the function > enc2utf8("Est\xe2ncia")
# [1] "Est\xe2ncia"
> iconv(enc2utf8("Est\xe2ncia"), from="latin1", to="UTF-8")
# [1] "Estância"
> iconv(enc2utf8("Est\xe2ncia"), from="UTF-8", to="UTF-8")
# [1] NA
> iconv(enc2utf8("hello"), from="UTF-8", to="UTF-8")
# [1] "hello"
> iconv(enc2utf8("hello"), from="latin1", to="UTF-8")
# [1] "hello" Now I just check that Encoding(x) is valid for every value of a varcher column. This seems like a lot of overhead though, so open to other ideas. EDIT: changes for readability |
The duckdb engine assumes all strings are valid utf-8. In the R-client, we forgot to check if strings were in fact utf8. Here we check them when the scanning the df when we register it.
Closes #12.