-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Always use get_src_tbl_names()
to fetch tbl Id
s within dm_from_con()
#1934
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good to me, thank you for your contribution!
I tried using this for SQL Server, trying to access tables in schemas, making use of the schema
argument passed on in the ellipsis. This did not work, therefore I rewrote some of your code slightly, cf. comment.
Also, I think the function should fail in case specifically requested tables cannot be found, which is how it was supposed to behave before. I also changed that in the suggested code.
Does this make sense to you, could you please adapt?
set_names(src_tbl_names, nms) %>% | ||
quote_ids(con) %>% | ||
map(possibly(tbl, NULL), src = src) | ||
tbls <- map(tbl_ids, possibly(tbl, NULL), src = src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using schemas this code fails as well, since table_names
normally does not include the schema, yet names(tbl_ids)
does.
Therefore, I would suggest to replace lines 96-109 with the following:
tbl_ids <- get_src_tbl_names(src, ..., names = .names)
# Fetch only the tbls which were specifically requested
if (!is.null(table_names)) {
nms <- purrr::map_chr(tbl_ids, ~ .x@name[["table"]])
bad_table_names <- table_names[!(table_names %in% nms)]
tbl_ids <- tbl_ids[which(nms %in% table_names)]
}
tbls <- map(tbl_ids, possibly(tbl, NULL), src = src)
bad <- map_lgl(tbls, is_null)
if (any(bad) || (exists("bad_table_names", inherits = FALSE) && length(bad_table_names) > 0)) {
if (is_null(table_names)) {
warn_tbl_access(names(tbls)[bad])
tbls <- tbls[!bad]
} else {
abort_tbl_access(union(names(tbls)[bad], bad_table_names))
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for this!! I hadn't spotted the *_tbl_access()
functions, those look super handy 😁 I'll certainly adjust to include those.
Regarding table_names
, I've just been thinking about the following situation:
Suppose there is a table name which exists in multiple schemas, e.g. suppose schema1.table1
and schema2.table1
both exist.
Now if a user asks for:
dm_from_con(con, table_names = "table1", schemas = c("schema1", "schema2"))
What should we return in such a case?
One option would be for the user to provide "schema.table"
-format names in table_names
. The documentation for table_names
would perhaps need some adjustment - but:
- If a user has requested multiple schemas, then we can assume the user is already aware of the
schema.table
syntax - see discussion in comment here feat:dm_from_con()
gains.names
argument for pattern-based construction of table names in the dm object #1790 (comment) - There is zero ambiguity if names are specified like this
- I think it works with the current
names(tbl_ids)
implementation?
Or maybe we should include both schema1.table1
and schema2.table1
in the returned dm
object; which I think is what your suggested changes will result in? I'm conflicted about this, because I feel like once we start dealing with multiple schemas, we should always be explicit about those schemas - but maybe that's just me being too strict with SQL ideals 😛
To offer one more alternative, we could include only the first table found with that name, and then raising a warning, as described here? #1789 (comment)
I'll happily proceed with whichever you think is best!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply, good catch about the same table name in different schemas.
In my opinion, if the user provides 1 table name and more than 1 schema name, as in your example, this should lead to an error: either schema
is not provided, has length 1 or needs to have a length matching the length of table_names
.
names(tbl_ids)
is a vector of the resulting names in the dm according to the specification in the .names
argument. In case schema
is given with multiple schema names, the default for this argument -- as you know -- is "{.schema}.{.table}", which leads to dm
-tables just like you suggested (schema1.table1
and schema2.table1
), if the user gave table_names = c("table1", "table1")
and schema = c("schema1", "schema2")
.
In general names(tbl_ids)
could be anything though and for this reason it cannot be used in any conditions.
I think the error messages need to be adapted to include the schema names of the tables that could not be accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okidoki! So based on these comments so far, I think we might need to modify how table_names
works, such that it can accept a list in the multiple-schemas case - something like:
# No `schema` arg => use default schema, i.e. same as one-schema case (next example)
dm_from_con(con, table_names = c("table1", "table2"))
# One specified schema => `table_names` is a single vector of names
# Returned dm object contains: schema1.table1, schema1.table2
dm_from_con(con, schema = "schema1", table_names = c("table1", "table2"))
# Multiple schemas => `table_names` is a list of vectors (must it be named?)
# Returned dm object contains: schema1.table1, schema1.table2, schema2.table1, schema2.table3
dm_from_con(
con,
schema = c("schema1", "schema2"),
table_names = list(
"schema1" = c("table1", "table2"),
"schema2" = c("table1", "table3")
)
)
I'll make one more case for leaving things as they are currently implemented, i.e. based on names(tbl_ids)
, because it avoids adding complexity to table_names
.
In general
names(tbl_ids)
could be anything though
True - but names are always constructed using the .names
pattern, which is either the default (which multi-schema users must be aware of already), or specified by the user!
So the user does always know how elements of the returned dm
object will be named; and therefore they know how the strings in table_names
should look.
# Default schema / one schema
dm_from_con(con, table_names = c("table1", "table2"))
dm_from_con(con, schema = "schema1", table_names = c("table1", "table2"))
# Multiple schemas, default `.names`
dm_from_con(
con,
schemas = c("schema1", "schema2"),
table_names = c("schema1.table1", "schema1.table2", "schema2.table1", "schema2.table3")
)
# Multiple schemas, custom `.names` - user knows that `table_names` should match `.names`
dm_from_con(
con,
schemas = c("schema1", "schema2"),
# The user chooses `.names`...
.names = "{.schema}__{.table}",
# ... so, they know what `table_names` should look like!
table_names = c("schema1__table1", "schema1__table2", "schema2__table1", "schema2__table3")
)
Thinking back to before multiple schema functionality was added, I think one interpretation of table_names
was:
"I only want the returned
dm
object to contain elements with names like I've specified intable_names
."
And I think that with this implementation, that interpretation is still valid now in the multi-schema case.
What do you think? For either of those approaches, I reckon I now have a solid specification to go away and implement 😇 many thanks for your patience and guidance with this one!
Thanks for your contribution! What's the next action here? Can we add a test for this change? |
I think we want the following:
This gets rid of a lot of legacy code and simplifies things. Let me tinker with that a bit, no action needed here for now. |
According to the docs of
dm_from_con()
, thetable_names
parameter allows the user to include only specific tables in the returneddm
object:Currently, we use two different methods to fetch tbls:
table_names
was not provided, usedm:::get_src_tbl_names()
to fetch tblId
s, which are later passed totbl()
table_names
was provided, pass the character-type names directly totbl()
But the second method does not work with non-default schema names, or with multiple schemas; and we also create additional problems when we try to set names for the list of tbls (#1933).
I think we can actually use the first method in all cases, i.e. use
get_src_tbl_names()
to return a named list of tblId
s.Then we can use
table_names
to take a subset from that list, before passing totbl()
.Fixes #1933.