Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: NA returned by get_uid when using division_filter #909

Open
davised opened this issue Mar 6, 2023 · 3 comments
Open

bug: NA returned by get_uid when using division_filter #909

davised opened this issue Mar 6, 2023 · 3 comments

Comments

@davised
Copy link

davised commented Mar 6, 2023

Running into a strange bug. I'm querying some various bacterial taxa, and a handful of my taxa have conflicting genus names in the database, e.g. Rhocococcus.

I attempted to resolve this issue using the division_filter option, but then I realized I was still getting NA returned.

I tried to solve using the rows = option, but I don't think I can depend on a particular row being the proper one. Here, the second row is what I want.

The odd thing is that the function returns somewhat as expected with the division_filter = "bacteria", yet the returned value is NA.

> taxize::get_uid("Rhodococcus", rank_query = "Genus", rows = 2)
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'Found:  Rhodococcus
══  Results  ═════════════════

• Total: 1Found: 1Not Found: 0
[1] "1827"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] TRUE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1827"
> taxize::get_uid("Rhodococcus", rank_query = "Genus", division_filter = "bacteria")
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'Found:  Rhodococcus
══  Results  ═════════════════

• Total: 1Found: 1Not Found: 0
[1] NA
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] FALSE
Session Info
> sessioninfo::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.3 (2022-03-10)
 os       Fedora Linux 36 (Workstation Edition)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2023-03-06
 pandoc   2.14.0.3 @ /usr/bin/pandocPackages ──────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 ape           5.6-2   2022-03-02 [1] CRAN (R 4.1.2)
 bold          1.2.0   2021-05-11 [1] CRAN (R 4.1.3)
 cli           3.4.1   2022-09-23 [1] CRAN (R 4.1.3)
 codetools     0.2-18  2020-11-04 [2] CRAN (R 4.1.3)
 conditionz    0.1.0   2019-04-24 [1] CRAN (R 4.1.3)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.1.3)
 crul          1.3     2022-09-03 [1] CRAN (R 4.1.3)
 curl          4.3.2   2021-06-23 [1] CRAN (R 4.1.2)
 data.table    1.14.2  2021-09-27 [1] CRAN (R 4.1.2)
 foreach       1.5.2   2022-02-02 [1] CRAN (R 4.1.2)
 httpcode      0.3.0   2020-04-10 [1] CRAN (R 4.1.3)
 iterators     1.0.14  2022-02-05 [1] CRAN (R 4.1.2)
 jsonlite      1.8.2   2022-10-02 [1] CRAN (R 4.1.3)
 lattice       0.20-45 2021-09-22 [2] CRAN (R 4.1.3)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
 nlme          3.1-155 2022-01-16 [2] CRAN (R 4.1.3)
 plyr          1.8.7   2022-03-24 [1] CRAN (R 4.1.3)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
 Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.3)
 reshape       0.8.9   2022-04-12 [1] CRAN (R 4.1.3)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.3)
 stringi       1.7.8   2022-07-11 [1] CRAN (R 4.1.3)
 stringr       1.4.1   2022-08-20 [1] CRAN (R 4.1.3)
 taxize        0.9.100 2022-04-22 [1] CRAN (R 4.1.3)
 triebeard     0.3.0   2016-08-04 [1] CRAN (R 4.1.3)
 urltools      1.7.3   2019-04-14 [1] CRAN (R 4.1.3)
 uuid          1.1-0   2022-04-19 [1] CRAN (R 4.1.3)
 xml2          1.3.3   2021-11-30 [1] CRAN (R 4.1.2)
 zoo           1.8-10  2022-04-15 [1] CRAN (R 4.1.3)

 [1] /home/davised/R/x86_64-redhat-linux-gnu-library/4.1
 [2] /usr/lib64/R/library
 [3] /usr/share/R/library

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@davised
Copy link
Author

davised commented Mar 6, 2023

For completeness, I tested a second genus, Paracoccus, and got the same bug.

Looks like that one also prefers the second row. Maybe I can try rows = 2 and see if that's sufficient for now.

@zachary-foster
Copy link
Collaborator

Hi Ed!

So the issue is that the division is technically called "high G+C Gram-positive bacteria" for whatever reason. You can see that when you run it without a division filter:

> taxize::get_uid("Rhodococcus", rank_query = "Genus")
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'

More than one UID found for taxon 'Rhodococcus'!

            Enter rownumber of taxon (other inputs will return 'NA'):

  status  rank                        division scientificname commonname     uid genus species subsp modificationdate
1 active genus                   scale insects    Rhodococcus            1661425                     2015/09/16 00:00
2 active genus high G+C Gram-positive bacteria    Rhodococcus               1827                     2022/09/18 00:00

Note that the input to division_filter is a regex.
bacteria on its own does not work because taxize adds ^ and $ to the division_filter argument automatically, although I think I will remove that so partial matches are possible.
Here is what you would need to use to make it work:

> taxize::get_uid("Rhodococcus", rank_query = "Genus", division_filter = "high G\\+C Gram-positive bacteria")
══  1 queries  ═══════════════

Retrieving data for taxon 'Rhodococcus'

✔  Found:  Rhodococcus
══  Results  ═════════════════

• Total: 1 
• Found: 1 
• Not Found: 0
[1] "1827"
attr(,"class")
[1] "uid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] TRUE
attr(,"uri")
[1] "https://www.ncbi.nlm.nih.gov/taxonomy/1827"

Note that + is a regex character so it needs to be escaped with \\. After the change I just made to the version on github, just "bacteria" will work in this instance as well.

@davised
Copy link
Author

davised commented Mar 14, 2023

Ah I figured it was a regex but I didn't realize that it was like grep -x, which is why I thought it was a bug.

I think having a flag (e.g. full=TRUE) by default could be fine, then I can set it to full=FALSE for my usage.

That way folks can choose how it works. Might make sense to let folks know it's a regex as well in the first place (maybe that's documented and I didn't see it).

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants