Question: Why NLTK TweetTokenizer? #26

freecraver · 2022-02-07T17:19:30Z

Thanks for your work on this nice project.

I intend to create a library for text simplification, and potentially would like to integrate your package.
The selection of a tokenizer has an impact on the obtained readability scores and I was wondering how you approached this issue.

Was there any specific reason for choosing the Tweet-Tokenizer over e.g. the default/recommended Nltk-Tokenizer which better depicts the Penn Treebank's definition of word-boundaries?

py-readability-metrics/readability/text/analyzer.py

Line 128 in 3ffb97f

tokenizer = TweetTokenizer()

cdimascio · 2022-02-20T19:07:42Z

@freecraver I'm open to changing the tokenizer. Would you be interested in investigating the effort to switch over?

freecraver · 2022-02-21T10:02:41Z

Sure - please check #27 for my suggested changes.

freecraver mentioned this issue Feb 21, 2022

Allow word tokenization override #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Why NLTK TweetTokenizer? #26

Question: Why NLTK TweetTokenizer? #26

freecraver commented Feb 7, 2022

cdimascio commented Feb 20, 2022

freecraver commented Feb 21, 2022

Question: Why NLTK TweetTokenizer? #26

Question: Why NLTK TweetTokenizer? #26

Comments

freecraver commented Feb 7, 2022

cdimascio commented Feb 20, 2022

freecraver commented Feb 21, 2022