The goal of the project is to implement part-of-speech(pos) tagger using structured learning.
This project uses Conditional Random Fields modeling method and sklearn-crfsuite implementation of this method.
Source code of the project can be found on github.
Basic implementation of the tagger is similar to sklearn-crfsuite tutorial, but uses different data sets, feature sets and labels.
crf-pos-tagger
uses pos_train.conll
dataset for training model and
pos_test.conll
for evaluating results.
Datasets contains twitter messages, which contains pairs (pos tag, token) and
separated by empty line. Token is a word, mention, url, hashtag, number,
punctuation mark, special symbol and so on. Twits are not very good in terms of
right spelling(ppl, u, ill, etc) and absolutely afwul in terms of cases(i LOVE U
sO MuCH). Also, it is pretty hard to separate sentences, it may finishes with
period of may not. That is why crf-pos-tagger
uses twits instead of sentences.
For such dataset and with assumption about twit ~== sentence
it is pretty easy
to write parser, which generates list of lists of pairs.
def parse_file(filename):
f = open(filename, 'r')
raw = f.readlines()
sentences = []
s = []
for line in raw:
if line.strip():
tag, token = line.strip().split('\t')
s.append((token, tag))
else:
sentences.append(s)
s = []
return sentences
sklearn-crfsuite
provides five algorithms:
– ‘lbfgs’ - Gradient descent using the L-BFGS method
– ‘l2sgd’ - Stochastic Gradient Descent with L2 regularization term
– ‘ap’ - Averaged Perceptron
– ‘pa’ - Passive Aggressive (PA)
– ‘arow’ - Adaptive Regularization Of Weight Vector (AROW)
Choice of algorithm wasn’t based on implementation details. Playing with different values of different parameters and different feature sets showed that following configuration one of the best results for project needs:
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
)
Evaluation is straightforward: crf-pos-tagger
compares tags generated by model
and tags written in pos_test.conll
for each token and calculates number of
matched divided by number of all tokens.
Default implementation without features shows the percentage of matches equal to
0.1053
.
After implementation of basics it is necessary to add some features to our model to improve.
feature set | result |
---|---|
without features | 0.1053 |
+word suffix | 0.7636 |
+mention | 0.7756 |
+hashtag | 0.7859 |
+lowercased suffix | 0.8042 |
+urls | 0.8082 |
+word itself | 0.8203 |
+number | 0.8211 |
+more lowercased suffixes | 0.8412 |
+is title | 0.8519 |
+is upper | 0.8537 |
First most obvious and intuitive idea is using suffix of the word to determine part of the speech. Trying different length of the suffix shows that most score reached at last for characters.
'word[-4:]': word[-4:],
Obvious features for dataset based on twits, which helps model to find twitter
specific parts of speech. 66 and 26 lines out of 2361 contains @
and #
.
'mention': word.startswith('@') and len(word) > 1,
'hashtag': word.startswith('#') and len(word) > 1,
This improvement also uses knowledge of dataset nature. Twitter users don’t care
about chAracTeRs
case. That is why lowercased suffix can increase accuracy.
Another twitter specific part of speech is an url.
def is_url(s):
# https://gist.github.com/gruber/249502#gistcomment-6465
if re.match(r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', s):
return True
else:
return False
'url': is_url(word),
Why not? Seems it may help in some cases and it is actually helps a lot. Of course we use lowercased version of the word.
'word.lower()': word.lower(),
Another binary feature, which just emits True
if token is a number.
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
'number': is_number(word),
Adding more prefixes with different length significantly improves model’s score.
'word[-3:]': word.lower()[-3:],
'word[-2:]': word.lower()[-2:],
'word[-1:]': word.lower()[-1:],
Many users don’t care about case of there letters, but some people do. Probably the case of the character correlates with part of speech in some cases.
'word.istitle()': word.istitle(),
Can be useful for recognition of abbreviations or I
token for example.
'word.isupper()': word.isupper(),
Some more advanced techniques can be used for improving model accuracy. For example some attributes of neighbors can be added to features(if previous token is number probability of current token being noun is bigger, maybe :).
Position in the sentence also can affect the results. It’s more likely to be noun or proposition at the beginning of the sentence than other part of speech.
It is study project and it is not ready for any kind of production usage, but feel free to contribute via PR, issue, comment or any other way.