-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
submission: pangoling: Access to word predictability using large language (transformer) models #575
Comments
Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type |
🚀 Editor check started 👋 |
Checks for pangoling (v0.0.0.9005)git hash: 543c11bd
Important: All failing checks above must be addressed prior to proceeding Package License: MIT + file LICENSE 1. Package DependenciesDetails of Package Dependency Usage (click to open)
The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table. baselapply (19), length (7), c (6), dim (6), paste0 (6), t (6), unlist (4), by (3), list (3), names (3), seq_len (3), which (3), do.call (2), getOption (2), matrix (2), ncol (2), rep (2), seq_along (2), sum (2), unname (2), as.list (1), floor (1), for (1), grepl (1), lengths (1), mode (1), new.env (1), options (1), rownames (1), split (1), switch (1), vector (1) pangolingcreate_tensor_lst (5), lst_to_kwargs (5), char_to_token (4), encode (4), get_id (4), get_vocab (4), get_word_by_word_texts (2), masked_lp_mat (2), causal_config (1), causal_lp (1), causal_lp_mats (1), causal_mat (1), causal_next_tokens_tbl (1), causal_preload (1), causal_tokens_lp_tbl (1), chr_detect (1), masked_config (1), num_to_token (1), word_lp (1) tidytablemap_chr. (4), map2 (3), map. (2), pmap. (2), arrange. (1), map (1), map_dbl. (1), map_dfr (1), map_dfr. (1), map2_dbl. (1), pmap_chr (1), relocate (1), tidytable (1) reticulatepy_to_r (5) memoisememoise (3) cachemcache_mem (2) graphicstext (2) data.tablechmatch (1) statslm (1) tidyselecteverything (1) NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately. 2. Statistical PropertiesThis package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing. Details of statistical properties (click to open)
The package has:
Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the The final measure (
2a. Network visualisationClick to see the interactive network visualisation of calls between objects in package 3.
|
id | name | conclusion | sha | run_number | date |
---|---|---|---|---|---|
4232217801 | pages build and deployment | success | c6ce46 | 16 | 2023-02-21 |
4232181659 | pkgdown | success | 543c11 | 44 | 2023-02-21 |
4232181660 | R-CMD-check | success | 543c11 | 37 | 2023-02-21 |
4232181654 | test-coverage | success | 543c11 | 39 | 2023-02-21 |
3b. goodpractice
results
R CMD check
with rcmdcheck
rcmdcheck found no errors, warnings, or notes
Test coverage with covr
Package coverage: 0.89
The following files are not completely covered by tests:
file | coverage |
---|---|
R/tr_causal.R | 0% |
R/tr_masked.R | 0% |
R/tr_utils.R | 0% |
R/utils.R | 0% |
R/zzz.R | 0% |
Cyclocomplexity with cyclocomp
The following function have cyclocomplexity >= 15:
function | cyclocomplexity |
---|---|
word_lp | 16 |
Static code analyses with lintr
lintr found the following 39 potential issues:
message | number of times |
---|---|
Avoid library() and require() calls in packages | 5 |
Lines should not be more than 80 characters. | 34 |
Package Versions
package | version |
---|---|
pkgstats | 0.1.3 |
pkgcheck | 0.1.1.11 |
Editor-in-Chief Instructions:
Processing may not proceed until the items marked with ✖️ have been resolved.
Hi, Also when I run ✔ Package coverage is 94.8%. |
Thanks @bnicenboim for your full submission and for explaining the issue with test coverage. Your explanation makes sense, so we can move forward. I'll start searching for a handling editor. In the meantime you may want to start thinking of potential reviewers to suggest to the handling editor. |
Is there a pool of potential reviewers that I can have access to? |
I guess authors are mostly guided by their knowledge of their intended audience. But for inspiration see how editors look for reviewers. Editors have access to a private airtable database, but often we look elsewhere. |
Dear @bnicenboim I'm sorry for the extraordinary delay in finding a handling editor. Most editors are busy and some handling more than one package. And the very few available are not yet due to handle another submission. Please hold a bit longer. |
ok, thanks for letting me know, no problem. |
@ropensci-review-bot assign @karthik as editor |
Assigned! @karthik is now the editor |
👋 @bnicenboim |
Hi, any news about the next steps? |
Hi @bnicenboim |
Editor checks:
Editor commentsNo additional comments at this time. I'm looking for reviewers at the moment, but if you've got any suggestions for people with expertise but no conflict, please suggest names. |
@ropensci-review-bot seeking reviewers |
Please add this badge to the README of your package repository: [![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575) Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news |
I really don't know about reviewers, I guess someone involved in the packages named here: Or maybe someone based on the reverse imports of reticulate: |
@ropensci-review-bot assign @lisalevinson as reviewer |
@lisalevinson added to the reviewers list. Review due date is 2023-05-24. Thanks @lisalevinson for accepting to review! Please refer to our reviewer guide. rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more. |
@lisalevinson: If you haven't done so, please fill this form for us to update our reviewers records. |
@ropensci-review-bot assign @utkuturk as reviewer |
@utkuturk added to the reviewers list. Review due date is 2023-05-29. Thanks @utkuturk for accepting to review! Please refer to our reviewer guide. rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more. |
📆 @lisalevinson you have 2 days left before the due date for your review (2023-05-24). |
📆 @utkuturk you have 2 days left before the due date for your review (2023-05-29). |
Hi @bnicenboim, @karthik, @lisalevinson, Sorry, my review took more time than necessary. Tremendous thanks to @bnicenboim for his efforts writing this package. My review is just some notes from a regular user. I already used this package before this review, but used another laptop to tests things out. I plan on using this package for the foreseeable future as well. I am happy to continue tests things as the package develops. :) I am also happy to help with the writing a community vignette on a minimal use of reticulate to get everything started with miniconda and conda environments. Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Even though the installation of the package is straightforward, I do not think installing the package itself makes it immediately usable. I have some notes on this. See below.
Functionality
I am not sure about what to look for here. I did not chose this as complete mainly because "tests" are skipped for transformers, and the author did not include additional unit tests in his functions (at least I was not able to find them). I am happy to edit this part if I got corrected on the issue. This is simply my ignorance.
The sole reason that it does not conforms to packaging guidelines is because the package does not include CONTRIBUTING or does not have contribution guidelines in the README. It includes necessary descriptions for Estimated hours spent reviewing: 10 hours
Review CommentsMy main comments will be around two topics: (1) installation of the dependencies, (2) use of other arguments for the exported functions. Installation: Clarity Problem wrt Python The process of setting up Github and installing it locally went smoothly, without encountering any problems. However, there's a little hiccup that occurs during both installation methods. It doesn't automatically install the necessary modules for python packages that are used. It only does so when you try to run an R code that requires those packages. The good news is that it takes care of this installation automatically once you have successfully defined your python in R. This feature could be highly valuable for users as it grants them control over their Python modules, who are already advanced users and will have no problem using python in R. But, I do not think this is true of many psycholinguists or linguists who is the target audience for the package. I think there seems to be a lack of clarity regarding the overall process with respect to python. It's not evident whether the package utilizes its own Python environment, relies on an API, or requires any adjustments to the existing Python installation. Suggestions I know both of these suggestions are somewhat against ropensci guidelines since it advises against unnecessary start-up messages. However, I think these are not as unnecessary as their examples.
To enhance transparency, it would be helpful to have a concise warning message displayed when the pangoling library is initially attached, similar to the cmdstanr package. This warning message could include the following components:
It is reasonable to assume that users of this package are already familiar with using python via R. However, it would indeed be beneficial to explicitly state this assumption.
Given that the package aims to serve as a wrapper for these python tools, targeting individuals like myself who may not possess advanced technical knowledge, it would be advantageous to provide a vignette that offers guidance on getting started with pangoling. This vignette could include step-by-step instructions on the following fundamental aspects:
By including such a vignette, users with limited technical background would be able to grasp the essential procedures involved in setting up and utilizing pangoling effectively. Another clarity issue: Downloading packages It is essential to ensure clarity regarding the download of the models to the local system and emphasize that the package does not operate through an API or an online instantiation. While this may be a common knowledge for regular users of transformers or python, it is important to remember that not all psycholinguists possess the same level of familiarity with these tools. To address this, it would be beneficial to explicitly state some of the following points in either README:
Advanced Topics in Vignettes One aspect that I noticed was the absence of information regarding the usage of NULL arguments in the exported functions. It would be greatly appreciated if the authors could provide examples illustrating the use of NULL arguments. Although it is understood that each pretrained model may have unique configurations, having basic examples directly from the authors demonstrating how these arguments are implemented in the code would be highly beneficial. Currently, the function description includes a link for more details directing us to "from_pretained" configs. However, supplementing it with concrete examples would assist users in understanding the practical application of non-NULL arguments for those arguments in the context of this package. If the transformers or text package have this, the link to those vignettes might be useful. This additional guidance from the authors would be a valuable addition to the documentation. |
Thanks so much for your work on this package @bnicenboim! Sorry this review is a bit later than expected. I have made a lot of comments in my review on my personal user interface/workflow preferences, but the package is already very useful “as is”. Let me know if you have any questions, or would like any help with some of the vignette/documentation suggestions I have made - I’m happy to help, though I can’t promise a speedy turnaround. Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
I have no prior personal or working relationship with the package creator. DocumentationThe package includes all the following forms of documentation:
The Rmd vignettes exist (and are viewable as rendered on the package website), but they did not seem to install properly, either from the local package folder or from GitHub, even when building vignettes is set to true. I believe this may be related to an issue that occurs due to the
Examples run, but one has a typo, described below.
Functionality
I’m not experienced with testing, but there were some failures and warnings noted below. There did not seem to be many tests.
Estimated hours spent reviewing: 11
Review CommentsThanks so much for creating this package - I demo’d it to my students when it first came out to show them ways to do “Python” things within the R environment, and it was a great example for that. I also am very grateful to have a tool that would allow me to more easily carry out these kinds of analyses from R, rather than passing data over from Python projects (as I had previously been doing). It seems similar in spirit to the Python package minicons. Further Comments on DocumentationI did not have a “fresh” system without Python and Given the variety of inputs and outputs, it would be very helpful in the documentation/vignettes to have a table summarizing these for each function so it’s easier to tell which one you need for which purpose. WorkflowOne general comment is that the functions could be a little easier to use for the output that I am usually needing to generate for analyses. In my work, I’m typically gathering probabilities/surprisals for all of the words in a sentence or passage, not just one target word. Thus the Python scripts I have been using are based on specifying a complete sentence (“The apple fell far from the tree.”) and either specifying a target word position or getting probabilities across all of the word positions. This is awkward to do with the
The default “fill” for
Perhaps functions like these could be separate ones from those already provided. Reporting and Documentation of ProbabilitiesBased on some independent testing, it seems that the log probabilities are natural log, but I couldn’t find that documented anywhere. It is fairly common to use base 2 log for surprisals (in bits), so which log is used should at least be indicated clearly. It would be great to make this an argument which can be specified, as On this note, it seems that most folks use surprisals in the psycholinguistic literature, and they are a bit easier to interpret than negative probabilities. It would be nice to offer a “surprisal” option, which would usually be negative log2 probability. Output FormatsThe outputs of the functions are in a variety of structures, which is a bit confusing. It would be nice to have more consistency, though I can see to some extent why in some cases there is a named vector and in others a tidytable. I think it would be helpful for users to have a vignette with some more examples of working with the output to get it into more common shapes for next steps in an analysis, such as using Non-English Usage and ExamplesAs a linguistics-oriented package, it would be helpful to include some vignette examples that include languages other than English, and to mention any potential limitations for languages with different orthographic/tokenization issues. For example, Mandarin Chinese does not use spaces, but it might be possible to use the functions if spaces (or another delimiter) are indicated where word boundaries are desired. There are other issues that arise for the tokenization of these languages, however, as most (all?) of the models will be based on single character tokenization, which may not provide the best probabilities for words which are predominantly multi-character. These things are a little more clear if one is coding the Python for transformers, but are obscured by the wrapper functions, so seem relevant to at least point towards other resources on. library(pangoling)
library(tidyverse) Specific Function Comments
|
Thank you both! Very useful! |
ok, I'm still struggling with docker to answer the main concern of @utkuturk and check how the installation goes. Hopefully, I'll be able to figure out how does an installation from scratch looks like. In the meanwhile I have a question for @lisalevinson (but utku feel free to comment). Lisa was confused with the two uses of Ok, first load the packages. library(pangoling)
library(tidytable) There are two main formats that the causal_lp addresses. One is you have data word-by-word (or phrase-by-phrase) which is common stimuli for many self-paced reading, eye-tracking, ERP experiments that deal with naturalistic texts. ( df_psychl <- tidytable(
sent_n =
c(1, 1, 1, 1, 1, 1, 1, 1, 2,2, 2, 2, 2, 2, 2),
word = c("The", "apple", "doesn't","fall", "far", "from", "the", "tree.",
"Don't", "judge", "a", "book", "by", "its", "cover."), other_word_level_stuff = NA
)
df_psychl
#> # A tidytable: 15 × 3
#> sent_n word other_word_level_stuff
#> <dbl> <chr> <lgl>
#> 1 1 The NA
#> 2 1 apple NA
#> 3 1 doesn't NA
#> 4 1 fall NA
#> 5 1 far NA
#> 6 1 from NA
#> 7 1 the NA
#> 8 1 tree. NA
#> 9 2 Don't NA
#> 10 2 judge NA
#> 11 2 a NA
#> 12 2 book NA
#> 13 2 by NA
#> 14 2 its NA
#> 15 2 cover. NA If the two sentences are not part of one single text, that's when you must use df_psychl <- df_psychl |>
mutate(lp = causal_lp(x = word, .by = sent_n))
#> Processing using causal model ''...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`
df_psychl
#> # A tidytable: 15 × 4
#> sent_n word other_word_level_stuff lp
#> <dbl> <chr> <lgl> <dbl>
#> 1 1 The NA NA
#> 2 1 apple NA -10.9
#> 3 1 doesn't NA -5.50
#> 4 1 fall NA -3.60
#> 5 1 far NA -2.91
#> 6 1 from NA -0.745
#> 7 1 the NA -0.207
#> 8 1 tree. NA -1.58
#> 9 2 Don't NA NA
#> 10 2 judge NA -6.27
#> 11 2 a NA -2.33
#> 12 2 book NA -1.97
#> 13 2 by NA -0.409
#> 14 2 its NA -0.257
#> 15 2 cover. NA -1.38 The other use case is an experiment where the stimuli are also sentences, regardless of how you present the context, you only care about the critical region: df_psychl2 <- tidytable(
item_n =
c(1, 2, 3, 4),
context = c("The apple doesn't fall far from the",
"The apple doesn't fall far from the",
"Don't judge a book by its",
"Don't judge a book by its"),
critical = c("tree.","floor.","cover.","back."),
other_word_level_stuff = NA
)
df_psychl2
#> # A tidytable: 4 × 4
#> item_n context critical other_word_level_stuff
#> <dbl> <chr> <chr> <lgl>
#> 1 1 The apple doesn't fall far from the tree. NA
#> 2 2 The apple doesn't fall far from the floor. NA
#> 3 3 Don't judge a book by its cover. NA
#> 4 4 Don't judge a book by its back. NA In that case each row is an item, and then df_psychl2 <- df_psychl2 |>
mutate(lp = causal_lp(x = critical, l_contexts = context))
#> Processing using causal model ''...
#> Ignoring `.by` argument
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `The apple doesn't fall far from the floor.`
#> Text id: 3
#> `Don't judge a book by its cover.`
#> Text id: 4
#> `Don't judge a book by its back.`
df_psychl2
#> # A tidytable: 4 × 5
#> item_n context critical other_word_level_stuff lp
#> <dbl> <chr> <chr> <lgl> <dbl>
#> 1 1 The apple doesn't fall far from… tree. NA -1.58
#> 2 2 The apple doesn't fall far from… floor. NA -10.2
#> 3 3 Don't judge a book by its cover. NA -1.38
#> 4 4 Don't judge a book by its back. NA -17.2 Created on 2023-06-07 with reprex v2.0.2 Do you think that having better examples would help? Or that I should name the arguments differently or that I should have two different functions? Also for both of you, do you think I should change Also |
Regarding @utkuturk's comment about the installation. I used docker to create fresh installation of R (using https://rocker-project.org/images/other/r-ubuntu.html) with
The only thing I had to do was to install So basically, I had to do.
And that's it. Then the first time I run a command from pangoling, I was asked
I answered @utkuturk, did you have any specific problem with the installation? Or were you worried that users without python and conda would not manage? |
@karthik, @utkuturk, @lisalevinson, I was hoping to get answers to move forward with the revision |
Hey Bruno, Apologies for the delayed response! After reading your reply, I decided to try using a virtual machine with a fresh installation of Mac OS. Just as you mentioned, everything worked flawlessly with the new miniconda installation. Later, I went back to my original machine, the one I initially had trouble with. It turned out that there was an existing version of miniconda installed, and the issue was simply because that particular machine didn't have the proper python installation with PATH specifications. Please consider my previous message as more of a general concern rather than a serious problem. |
Hi @bnicenboim, I'm checking in on issues that have been static for awhile. I see this issue was moving along alright and then fell silent. Let us know if there is progress to report or any questions you may have. We can move to to a "Hold" status, or keep it open as-is if you expect to continue the respond to the reviews. |
@ropensci-review-bot put on hold |
Submission on hold! |
Hi, I did a lot of progress (and unrelated changes) and I had some question for @lisalevinson that was never answered, but regardless it's fine to have this on hold until my teaching is over and I can finalize the last things. |
@bnicenboim So sorry - I never saw your questions at all. I was on medical leave when you posted them and had an away message on my email, but that wouldn't go back to you through GitHub notifications of course! And then it must have gotten lost in a sea of miscellaneous GitHub emails when I tried to sort through everything after. It would take me some time to try it all out again and remember how everything works - honestly right now this will be hard for me to do before the end of May because I am co-organizing HSP and that is right after our current semester ends. Would that be an OK timeline for when you plan to pick it up again? I may be able to squeeze it in earlier but I don't want to make promises I'm not sure that I can keep! |
I don't think I'll touch anything until maybe June, so no worries from my side. |
@ldecicco-USGS: Please review the holding status |
Hi @bnicenboim - checking in with the rOpenSci editors team. Do you still plan on picking this back up? |
Yes, I'm picking it up. Sorry for the long delay. I also wanted to add some other features. I'll see how easy it is or if at least the function names that I'll have are compatible with these other usages I'm thinking about. Also, I don't think my answer to @lisalevinson is relevant anymore. She's completely right in that the two uses of causal_lp are confusing. I'll divide it into two functions. |
Phew, it took some time but I'm mostly done. I'm checking that I still comply with all the requisites of ropensci. Then, I'll report the changes I made and how I answered the reviewers' issues. |
Submitting Author Name: Bruno Nicenboim
Due date for @lisalevinson: 2023-05-24Submitting Author Github Handle: @bnicenboim
Repository: https://github.com/bnicenboim/pangoling
Version submitted: 0.0.0.9005
Submission type: Standard
Editor: @karthik
Reviewers: @lisalevinson, @utkuturk
Due date for @utkuturk: 2023-05-29
Archive: TBD
Version accepted: TBD
Language: en
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
The package is built on top of the python package
transformers
, and it offers some basic functionality for text analysis, including tokenization and perplexity calculation. Cruciallypangoling
also offers word predictability, which is widely used as a predictor in psycho and neurolinguistics, and it's not trivial to obtain. Alsotransformers
works with "tokens" rather than "words", and then pangoling takes cares of the mapping between the tokens to the target words (or even phrases).This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as in ERP/EEG and reading studies.
Another R package that acts as a wrapper for
transformers
istext
However,text
is more general, and its focus is on Natural Language Processing and Machine Learning.pangoling
is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP.text
doesn't allow for generating pangoling output in a straightforward way and in fact, I'm not sure if it's even possible to get token probabilities fromtext
since it seems more limited than the python packagetransformers
.NA
#573
pkgcheck
items which your package is unable to pass.pkgcheck
fails only because of the use of<<-
. But this is done in.OnLoad
as recommended by reticulate. Also see this issue .Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
Do you intend for this package to go on CRAN?
Do you intend for this package to go on Bioconductor?
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
Code of conduct
The text was updated successfully, but these errors were encountered: