-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tags are joined with a comma and padded with asterisks #3491
Conversation
Thanks for tagging me on this. I think there is an issue with your Mecab configuration and this PR should not be accepted, but it raises a point that can be fixed in the spaCy use of Mecab. The format with the dashes is the
This format was used in the draft Unidic/UD mapping I received in 2017 and mentioned in the paper Universal Dependencies Version 2 for Japanese (there's not a correspondence table, but look for "名詞" in the text). Can you post the output of Because Unidic ships a default format and it's not obvious how to change it (see taku910/mecab#38) I didn't think to set the format explicitly in spaCy when loading Mecab but that would be a good idea. I'll see about adding it, it should just involve passing a format string to the Tagger when it's created. Looking at Unidic v2.3.0, the latest version, there's a |
Thank you for your prompt review. The output of
I haven't changed
I looked at my Ubuntu/18.04 environment and confirmed that it also had exact the same output from By the way, I had to manually downgrade unidic-mecab on Ubuntu as the latest
|
It looks to me that spaCy/spacy/lang/ja/__init__.py Lines 52 to 68 in 9e14b2b
|
You're absolutely right. I confirmed that the tests are currently failing but run correctly with your patch. I guess there must have been a formatting change at some point and this test wasn't updated... Thanks for catching it! |
@HiromuHota @polm Thanks for your work on Japanese and the analysis! So just to confirm: this PR should be merged then, right? Btw, in case you haven't seen it, I ended up making a small modification to the way the Mecab tags are stored on the get_mecab_tag = lambda token: token.doc.user_data["mecab_tags"][token.i]
Token.set_extension('mecab_tag', getter=get_mecab_tag) |
I think this PR is good to merge 👍 Also noted on the tags change, that looks good too, thanks for the heads up! |
* Add failing test for explosion#3356 * Fix test that caused pytest to choke on Python3 * adding kb_id as field to token, el as nlp pipeline component * annotate kb_id through ents in doc * kb snippet, draft by Matt (wip) * documented some comments and todos * hash the entity name * add pyx and separate method to add aliases * fix compile errors * adding aliases per entity in the KB * very minimal KB functionality working * adding and retrieving aliases * get candidates by alias * bugfix adding aliases * use StringStore * raising error when adding alias for unknown entity + unit test * avoid value 0 in preshmap and helpful user warnings * check and unit test in case prior probs exceed 1 * correct size, not counting dummy elements in the vector * check the length of entities and probabilities vector + unit test * create candidate object from entry pointer (not fully functional yet) * store entity hash instead of pointer * unit test on number of candidates generated * property getters and keep track of KB internally * Entity class * ensure no candidates are returned for unknown aliases * minimal EL pipe * name per entity * select candidate with highest prior probabiity * use nlp's vocab for stringstore * error msg and unit tests for setting kb_id on span * delete sandbox folder * Update v2-1.md * Fix xfail marker * Update wasabi pin * Fix tokenizer on Python2.7 (explosion#3460) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes explosion#3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.1.2 * 'entity_linker' instead of 'el' * specify unicode strings for python 2.7 * Merge branch 'spacy.io' [ci skip] * Add missing space in landing page (explosion#3462) [ci skip] * Fix train loop for train_textcat example * property annotations for fields with only a getter * adding future import unicode literals to .py files * error and warning messages * Update Binder [ci skip] * Fix typo [ci skip] * Update landing example [ci skip] * Improve landing example [ci skip] * Add xfailing test for explosion#3468 * Slightly modify test for explosion#3468 Check for Token.is_sent_start first (which is serialized/deserialized correctly) * Fix test for explosion#3468 * Add xfail test for explosion#3433. Improve test for add label. * 💫 Fix class mismap on parser deserializing (closes explosion#3433) (explosion#3470) v2.1 introduced a regression when deserializing the parser after parser.add_label() had been called. The code around the class mapping is pretty confusing currently, as it was written to accommodate backwards model compatibility. It needs to be revised when the models are next retrained. Closes explosion#3433 * 💫 Add better and serializable sentencizer (explosion#3471) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs * Add cheat sheet to spaCy 101 * Add blog post to v2.1 page * Bug fixes and options for TextCategorizer (explosion#3472) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs * Fix formatting [ci skip] * Merge branch 'spacy.io' [ci skip] * Small tweak to ensemble textcat model * Set version to v2.1.3 * Update binderVersion * Update favicon (closes explosion#3475) [ci skip] * Update Thai tag map (explosion#3480) * Update Thai tag map Update Thai tag map * Create wannaphongcom.md * Add Estonian to docs [ci skip] (closes explosion#3482) * entity as one field instead of both ID and name * Fix GPU training for textcat. Closes explosion#3473 * Fix social image * DOC: Update tokenizer docs to include default value for batch_size in pipe (explosion#3492) * Fix/irreg adverbs extension (explosion#3499) * extended list of irreg adverbs * added test to exceptions * fixed typo * fix(util): fix decaying function output (explosion#3495) * fix(util): fix decaying function output * fix(util): better test and adhere to code standards * fix(util): correct variable name, pytestify test, update website text * adds textpipe to universe (explosion#3500) [ci skip] * Adds textpipe to universe * signed contributor agreement * Adjust formatting, code style and use "standalone" category * Fix met a description in universe projects [ci skip] * Tags are joined with a comma and padded with asterisks (explosion#3491) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Fix a bug in the test of JapaneseTokenizer. This PR may require @polm's review. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Add spaCy IRL to landing [ci skip] * Update landing.js * added tag_map for indonesian * changed tag map from .py to .txt to see if tests pass * added symbols import * added utf8 encoding flag * added missing SCONJ symbol * Auto-format * Remove unused imports * Make tag map available in Indonesian defaults * Auto-format * added tag_map for indonesian (explosion#3515) * added tag_map for indonesian * changed tag map from .py to .txt to see if tests pass * added symbols import * added utf8 encoding flag * added missing SCONJ symbol * Auto-format * Remove unused imports * Make tag map available in Indonesian defaults * Update compatibility [ci skip] * failing test for Issue explosion#3449 * failing test for Issue explosion#3521 * fixing Issue explosion#3521 by adding all hyphen variants for each stopword * unicode string for python 2.7 * specify encoding in files * Update links and http -> https (explosion#3532) * update links and http -> https * SCA * Update Thai tokenizer_exception list (explosion#3529) * add tokenizer_exceptions word (ก-น) from https://goo.gl/JpJ2qq * update tokenizer_exceptions word list * add contributor file * Remove non-existent example (closes explosion#3533) * Don't make "settings" or "title" required in displaCy data (closes explosion#3531) * addressed all comments by Ines * Improved Dutch language resources and Dutch lemmatization (explosion#3409) * Improved Dutch language resources and Dutch lemmatization * Fix conftest * Update punctuation.py * Auto-format * Format and fix tests * Remove unused test file * Re-add deleted test * removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains * Cleaner lemmatization files * updated tag map with missing tags * fixed tag_map.py merge conflict * fix typos in tag_map flagged by `python -m debug-data` (explosion#3542) ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. Co-authored-by: Ines Montani <[email protected]> * Update Thai stop words (explosion#3545) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * Added Ludwig among the projects (explosion#3548) [ci skip] * Added Ludwig among the projects * Create w4nderlust.md * Add Uber to logo wall * Removes duplicate in table (explosion#3550) * Removes duplicate in table Just fixing typos. * Remove newline Co-authored-by: Ines Montani <[email protected]> * Auto-format * Make sure path is string (resolves explosion#3546) * Add xfailing test for explosion#3555 * Fix typo in web docs cli.md (explosion#3559) * Tidy up and auto-format * Ensure match pattern error isn't raised on empty errors (closes explosion#3549) * Fix website docs for Vectors.from_glove (explosion#3565) * Fix website docs for Vectors.from_glove * Add myself as a contributor * Added project gracyql to Universe (explosion#3570) (resolves explosion#3568) As discussed with Ines in explosion#3568 , adding a new project proposal for the community in SpaCy Universe website GracyQL a tiny graphql wrapper aroung spacy using graphene and starlette. ## Description Change only in universe.json file to add a new project ### Types of change New project reference in Universe ## Checklist - [x ] I have submitted the spaCy Contributor Agreement. - [x ] I ran the tests, and all new and existing tests passed. - [ x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Add myself to contributors (explosion#3575) * Signed agreement (explosion#3577) * Added Turkish Lira symbol(₺) (explosion#3576) Added Turkish Lira symbol(₺) https://en.wikipedia.org/wiki/Turkish_lira * Change default output format from `jsonl` to `json` for cli convert (explosion#3583) (closes explosion#3523) * Changing default ouput format from jsonl to json for cli convert * Adding Contributor Agreement * Remove Datacamp * Fix formatting * Improved training and evaluation (explosion#3538) * Add early stopping * Add return_score option to evaluate * Fix missing str to path conversion * Fix import + old python compatibility * Fix bad beam_width setting during cpu evaluation in spacy train with gpu option turned on * Fix symlink creation to show error message on failure (explosion#3589) (resolves explosion#3307)) * Fix symlink creation to show error message on failure. Update tests to reflect those changes. * Fix test to succeed on non windows systems. * Fix issue explosion#3551: Upper case lemmas If the Morphology class tries to lemmatize a word that's not in the string store, it's forced to just return it as-is. While loading exceptions, the class could hit a case where these strings weren't in the string store yet. The resulting lemmas could then be cached, leading to some words receiving upper-case lemmas. Closes explosion#3551. * Set version to v2.1.4.dev0 * Create fizban99.md (explosion#3601) * entity types for colors should be in uppercase (explosion#3599) although the text indicates the entity types should be in lowercase, the sample code shows uppercase, which is the correct format. * Create Dobita21.md (explosion#3614) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Update landing and feature overview * Remove unused image * Add course to 101 * Update link [ci skip] * Add Thai norm_exceptions (explosion#3612) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * Add save after `--save-every` batches for `spacy pretrain` (explosion#3510) <!--- Provide a general summary of your changes in the title. --> When using `spacy pretrain`, the model is saved only after every epoch. But each epoch can be very big since `pretrain` is used for language modeling tasks. So I added a `--save-every` option in the CLI to save after every `--save-every` batches. ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> To test... Save this file to `sample_sents.jsonl` ``` {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} {"text": "hello there."} ``` Then run `--save-every 2` when pretraining. ```bash spacy pretrain sample_sents.jsonl en_core_web_md here -nw 1 -bs 1 -i 10 --save-every 2 ``` And it should save the model to the `here/` folder after every 2 batches. The models that are saved during an epoch will have a `.temp` appended to the save name. At the end the training, you should see these files (`ls here/`): ```bash config.json model2.bin model5.bin model8.bin log.jsonl model2.temp.bin model5.temp.bin model8.temp.bin model0.bin model3.bin model6.bin model9.bin model0.temp.bin model3.temp.bin model6.temp.bin model9.temp.bin model1.bin model4.bin model7.bin model1.temp.bin model4.temp.bin model7.temp.bin ``` ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> This is a new feature to `spacy pretrain`. 🌵 **Unfortunately, I haven't been able to test this because compiling from source is not working (cythonize error).** ``` Processing matcher.pyx [Errno 2] No such file or directory: '/Users/mwu/github/spaCy/spacy/matcher.pyx' Traceback (most recent call last): File "/Users/mwu/github/spaCy/bin/cythonize.py", line 169, in <module> run(args.root) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 158, in run process(base, filename, db) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 124, in process preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") File "/Users/mwu/github/spaCy/bin/cythonize.py", line 87, in preserve_cwd func(*args) File "/Users/mwu/github/spaCy/bin/cythonize.py", line 63, in process_pyx raise Exception("Cython failed") Exception: Cython failed Traceback (most recent call last): File "setup.py", line 276, in <module> setup_package() File "setup.py", line 209, in setup_package generate_cython(root, "spacy") File "setup.py", line 132, in generate_cython raise RuntimeError("Running cythonize failed") RuntimeError: Running cythonize failed ``` Edit: Fixed! after deleting all `.cpp` files: `find spacy -name "*.cpp" | xargs rm` ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Allow jupyter=False to override Jupyter mode (closes explosion#3598) * Make flag shortcut consistent and document * Update spacy evaluate example * Auto-format * Rename early_stopping_iter to n_early_stopping * Document early stopping * update norm_exceptions (explosion#3627) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words * Update seo.js * Update Universe Website for pyInflect (explosion#3641) * Improve redundant variable name (explosion#3643) * Improve redundant variable name * Apply suggestions from code review Co-Authored-By: pickfire <[email protected]> * Doc changes for local website setup (explosion#3651) * Create yaph.md so I can contribute (explosion#3658) * Fix broken link to Dive Into Python 3 website (explosion#3656) * Fix broken link to Dive Into Python 3 website * Sign spaCy Contributor Agreement * Remove dangling M (explosion#3657) I assume this is a typo. Sorry if it has a meaning that I'm not aware of. * Update French example sents and add two German stop words (explosion#3662) * Update french example sentences * Add 'anderem' and 'ihren' to German stop words * update response after calling add_pipe (explosion#3661) * update response after calling add_pipe component:print_info is appened in the last, so need show it at the end of pipeline * Create henry860916.md * Add Thai lex_attrs (explosion#3655) * test sPacy commit to git fri 04052019 10:54 * change Data format from my format to master format * ทัทั้งนี้ ---> ทั้งนี้ * delete stop_word translate from Eng * Adjust formatting and readability * add Thai norm_exception * Add Dobita21 SCA * editรึ : หรือ, * Update Dobita21.md * Auto-format * Integrate norms into language defaults * add acronym and some norm exception words * add lex_attrs * Add lexical attribute getters into the language defaults * fix LEX_ATTRS Co-authored-by: Donut <[email protected]> Co-authored-by: Ines Montani <[email protected]> * Update universe.json (explosion#3653) [ci skip] * Update universe.json * Update universe.json * Relax jsonschema pin (closes explosion#3628) * Adjust wording and formatting [ci skip] * Fix inconsistant lemmatizer issue explosion#3484 (explosion#3646) * Fix inconsistant lemmatizer issue explosion#3484 * Remove test case * Rewrite example to use Retokenizer (resolves explosion#3681) Also add helper to filter spans * Fix typo (see explosion#3681) * Simplify helper (see explosion#3681) [ci skip] * Auto-format [ci skip] * Re-added Universe readme (explosion#3688) (closes explosion#3680) * Fix offset bug in loading pre-trained word2vec. (explosion#3689) * Fix offset bug in loading pre-trained word2vec. * add contributor agreement * Add util.filter_spans helper (explosion#3686) * Request to include Holmes in spaCy Universe (explosion#3685) * Request to add Holmes to spaCy Universe Dear spaCy team, I would be grateful if you would consider my Python library Holmes for inclusion in the spaCy Universe. Holmes transforms the syntactic structures delivered by spaCy into semantic structures that, together with various other techniques including ontological matching and word embeddings, serve as the basis for information extraction. Holmes supports several use cases including chatbot, structured search, topic matching and supervised document classification. I had the basic idea for Holmes around 15 years ago and now spaCy has made it possible to build an implementation that is stable and fast enough to actually be of use - thank you! At present Holmes supports English and German (I am based in Munich) but could easily be extended to support any other language with a spaCy model. * Added * Add version tag to `--base-model` argument (closes explosion#3720) * Submit contributor agreement (explosion#3705) * fix thai bug (explosion#3693) fix tokenize for pythainlp * Update glossary.py to match information found in documentation (explosion#3704) (closes #explosion#3679) * Update glossary.py to match information found in documentation I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves explosion#3679 👍 * Adds forgotten colon * fixing regex matcher examples (explosion#3708) (explosion#3719) * Improve Token.prob and Lexeme.prob docs (resolves explosion#3701) * Fix DependencyParser.predict docs (resolves explosion#3561) * Make "text" key in JSONL format optional when "tokens" key is provided (explosion#3721) * Fix issue with forcing text key when it is not required * Extending the docs to reflect the new behavior * Call rmtree and copytree with strings (closes explosion#3713) * Auto-format * Add TWiML podcast to universe [ci skip] * Fix return value of Language.update (closes explosion#3692) * Set version to v2.1.4.dev1 * Fix push-tag script * Fix .iob converter (closes explosion#3620) * Replace cytoolz.partition_all with util.minibatch * Set version to v2.1.4 * Merge branch 'spacy.io' [ci skip] * 💫 Improve introspection of custom extension attributes (explosion#3729) * Add custom __dir__ to Underscore (see explosion#3707) * Make sure custom extension methods keep their docstrings (see explosion#3707) * Improve tests * Prepend note on partial to docstring (see explosion#3707) * Remove print statement * Handle cases where docstring is None * Add check for callable to 'Language.replace_pipe' to fix explosion#3737 (explosion#3741) * Fix lex_id docs (closes explosion#3743) * Enhancing Kannada language Resources (explosion#3755) * Updated stop_words.py Added more stopwords * Create ujwal-narayan.md Enhancing Kannada language resources * Update Scorer and add API docs * Update Language.update docs * Document Language.evaluate * Marathi Language Support (explosion#3767) * Adding Marathi language details and folder to it * Adding few changes and running tests * Adding few changes and running tests * Update __init__.py mh -> mr * Rename spacy/lang/mh/__init__.py to spacy/lang/mr/__init__.py * mh -> mr * Update norm_exceptions.py (explosion#3778) * Update norm_exceptions.py Extended the Currency set to include Franc, Indian Rupee, Bangladeshi Taka, Korean Won, Mexican Dollar, and Egyptian Pound * Fix formatting [ci skip] * Use string name in setup.py Hopefully this will trick GitHub's parser into recognising it as a Python package and show us the dependents / "used by" statistics 🤞 * Corrected example model URL in requirements.txt (explosion#3786) The URL used to show how to add a model to the requirements.txt had the old release path (excl. explosion). * Make jsonschema dependency optional (explosion#3784) * fix all references to BILUO annotation format (explosion#3797) * Incorrect Token attribute ent_iob_ description (explosion#3800) * Incorrect Token attribute ent_iob_ description * Add spaCy contributor agreement * Fix typos in docs (closes explosion#3802) [ci skip] * Improve E024 text for incorrect GoldParse (closes explosion#3558) * Update UNIVERSE.md * Create NirantK.md (explosion#3807) [ci skip] * Add Baderlab/saber to universe.json (explosion#3806) * Overwrites default getter for like_num in Spanish by adding _num_words and like_num to lex_attrs.py (explosion#3810) (closes explosion#3803)) * (explosion#3803) Spanish like_num returning false for number-like token * (explosion#3803) Spanish like_num now returning True for number-like token * Add multiple packages to universe.json (explosion#3809) [ci skip] * Add multiple packages to universe.json Added following packages: NLPArchitect, NLPRe, Chatterbot, alibi, NeuroNER * Auto-format * Update slogan (probably just copy-paste mistake) * Adjust formatting * Update tags / categories * Tidy up universe [ci skip] * Update universe [ci skip] * Update universe [ci skip] * Update universe [ci skip] * Fix for explosion#3811 (explosion#3815) Corrected type of seed parameter. * Create intrafindBreno.md (explosion#3814) * minor fix to broken link in documentation (explosion#3819) [ci skip] * Update universe [ci skip] * Update srsly pin * Add merge_subtokens as parser post-process. Re explosion#3830 * Create Azagh3l.md (explosion#3836) * Update lex_attrs.py (explosion#3835) Corrected typos, added french (from France) versions of some numbers. * Add resume logic to spacy pretrain (explosion#3652) * Added ability to resume training * Add to readmee * Remove duplicate entry * Tidy up [ci skip] * Add regression test for explosion#3839 * Update exemples.py (explosion#3838) Added missing hyphen and accent. * Update error raising for CLI pretrain to fix explosion#3840 (explosion#3843) * Add check for empty input file to CLI pretrain * Raise error if JSONL is not a dict or contains neither `tokens` nor `text` key * Skip empty values for correct pretrain keys and log a counter as warning * Add tests for CLI pretrain core function make_docs. * Add a short hint for the `tokens` key to the CLI pretrain docs * Add success message to CLI pretrain * Update model loading to fix the tests * Skip empty values and do not create docs out of it * Change vector training to work with latest gensim (fix explosion#3749) (explosion#3757) * Dependency tree pattern matcher (explosion#3465) * Functional dependency tree pattern matcher * Tests fail due to inconsistent behaviour * Renamed dependencymatcher and added optimizations * Add optional `id` property to EntityRuler patterns (explosion#3591) * Adding support for entity_id in EntityRuler pipeline component * Adding Spacy Contributor aggreement * Updating EntityRuler to use string.format instead of f strings * Update Entity Ruler to support an 'id' attribute per pattern that explicitly identifies an entity. * Fixing tests * Remove custom extension entity_id and use built in ent_id token attribute. * Changing entity_id to ent_id for consistent naming * entity_ids => ent_ids * Removing kb, cleaning up tests, making util functions private, use rsplit instead of split * Update tokenizer.md for construction example (explosion#3790) * Update tokenizer.md for construction example Self contained example. You should really say what nlp is so that the example will work as is * Update CONTRIBUTOR_AGREEMENT.md * Restore contributor agreement * Adjust construction examples * Auto-format [ci skip]
Description
Fix a bug in the test of JapaneseTokenizer.
This PR may require @polm's review.
Types of change
Bug fix
Checklist