forked from CogStack/MedCAT
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #92 from CogStack/spacy-v3
Upgrade spaCy to v3 and add the CI build pipeline
- Loading branch information
Showing
30 changed files
with
516 additions
and
322 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
name: build | ||
|
||
on: | ||
push: | ||
branches: [ master ] | ||
pull_request: | ||
branches: [ master ] | ||
|
||
jobs: | ||
build: | ||
|
||
runs-on: ubuntu-latest | ||
strategy: | ||
matrix: | ||
python-version: [ 3.7, 3.8, 3.9 ] | ||
max-parallel: 3 | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install -r requirements.txt | ||
- name: Test | ||
run: | | ||
python -m unittest discover |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -41,3 +41,4 @@ tmp.py | |
|
||
# models files | ||
*.dat | ||
!examples/*.dat |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
house 34444 0.3232 0.123213 1.231231 | ||
dog 14444 0.76762 0.76767 1.45454 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,38 @@ | ||
import re | ||
|
||
def tag_skip_and_punct(doc, config): | ||
def tag_skip_and_punct(nlp, name, config): | ||
r''' Detects and tags spacy tokens that are punctuation and that should be skipped. | ||
Args: | ||
doc (`spacy.tokens.Doc`): | ||
Spacy document that will be tagged. | ||
config (`medcat.config.Config`): | ||
Global config for medcat. | ||
Return: | ||
(`spacy.tokens.Doc): | ||
Tagged spacy document | ||
Args: | ||
nlp (spacy.language.<lng>): | ||
The base spacy NLP pipeline. | ||
name (`str`): | ||
The component instance name. | ||
config (`medcat.config.Config`): | ||
Global config for medcat. | ||
''' | ||
# Make life easier | ||
cnf_p = config.preprocessing | ||
|
||
for token in doc: | ||
if config.punct_checker.match(token.lower_) and token.text not in cnf_p['keep_punct']: | ||
# There can't be punct in a token if it also has text | ||
token._.is_punct = True | ||
token._.to_skip = True | ||
elif config.word_skipper.match(token.lower_): | ||
# Skip if specific strings | ||
token._.to_skip = True | ||
elif cnf_p['skip_stopwords'] and token.is_stop: | ||
token._.to_skip = True | ||
|
||
return doc | ||
|
||
return _Tagger(nlp, name, config) | ||
|
||
|
||
class _Tagger(object): | ||
|
||
def __init__(self, nlp, name, config): | ||
self.nlp = nlp | ||
self.name = name | ||
self.config = config | ||
|
||
def __call__(self, doc): | ||
# Make life easier | ||
cnf_p = self.config.preprocessing | ||
|
||
for token in doc: | ||
if self.config.punct_checker.match(token.lower_) and token.text not in cnf_p['keep_punct']: | ||
# There can't be punct in a token if it also has text | ||
token._.is_punct = True | ||
token._.to_skip = True | ||
elif self.config.word_skipper.match(token.lower_): | ||
# Skip if specific strings | ||
token._.to_skip = True | ||
elif cnf_p['skip_stopwords'] and token.is_stop: | ||
token._.to_skip = True | ||
|
||
return doc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
. | ||
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
. | ||
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
. | ||
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_md-0.4.0.tar.gz |
Oops, something went wrong.