layout | title |
---|---|
default |
Generic Features Library |
This document describes the generic features library that is available as part
of ddlib
, the utility library included with DeepDive (under
$DEEPDIVE_HOME/ddlib/ddlib
).
By "generic features" we denote a set of features that are not application- or domain-dependent and can be used to obtain good baseline quality for mention and relation extractions. Feature engineering is indeed one of the most time-consuming operation in Knowledge Base Construction (KBC) and it is often difficult to start building feature from scratch. The goal of the generic features library is to allow users of DeepDive who are not KBC experts to get their application off the ground with good starting quality.
The 'generic features' library leverages on Natural Language Processing (NLP) annotations (Part of Speech, Named Entity Recognition, dependency paths, ...) to the sentences in the corpus to build the features. Examples of features include: the dependency path between two mentions composing a relation mention, the Named Entity Recognition tags of the words composing a mention, the dependency path between a mention and a keyword from a user-specified dictionary, and many others. See below for the complete list.
The user of the library can optionally specify one or more dictionaries. These are sets of words that the user believes are relevant for the correct classification of mentions and relations, and are often domain-/application-specific. The generic features library uses the dictionaries to create additional features, allowing the inclusion of domain-knowledge in the set of features. More details about dictionaries and their use in the library are in the Using Dictionaries section below.
The generic feature library creates two different sets of features for mentions and relations, due to the different nature of these objects, and to which features are more relevant for each type.
There are various "classes" of generic features, which can be distinguished by their prefix.
The list of generic features for a mention is the following:
- The set of Part of Speech tag(s) of the word(s) composing the mention (prefix:
POS_SEQ
); - The set of Named Entity Recognition tag(s) of the word(s) composing the
mention (
NER_SEQ
); - The set of lemmas of the word(s) composing the mention (
LEMMA_SEQ
); - The set of word(s) composing the mention (
WORD_SEQ
); - The (sum of the) length(s) of the word(s) composing the mention (
LENGTH
); - A feature denoting whether the first word of the mention starts with a capital
letter (
STARTS_WITH_CAPITAL
); - The lemmas and the NER tags in a window of size up to 3 around the mention,
both on the left and on the right of the mention. These are also combined
(i.e., a window on the left and a window on the right are merged into a
single feature), to give a total of (up to) 15 features (3 on left, 3 on
right, 3 times 3 combinations of left and right) with lemmas, and 15 for
NERs (
W
); - Features denoting whether the mention (or a substring of it of length up to 3)
appears in a user-specified dictionary (
IN_DICT
); - Features indicating whether the sentence containing the mention also contains
some keyword that appears in a user-specified dictionary (
KW_IND
); - The shortest dependency path(s) between the mention and the keyword(s) from
user-specified dictionaries that appear in the sentence. Multiple variants
of the dependency path are used as feature (edge labels and lemmas, edge
labels only, edge labels and lemmas replaced with dictionary identifier if
the lemma is in a dictionary) (
KW
);
The list of generic features for a relation is the following (the prefixes are the same as the ones for the mentions, except where otherwise specified):
- The set of Part of Speech tags for the words between the mentions in the relation;
- The set of Named Entity Recognition tags for the words between the mentions in the relation;
- The set of lemmas of the words between the mentions in the relation;
- The set of words between the mentions in the relation;
- The sum of the lengths of the words in the mentions;
- Indicator feature for whether the mentions start with a capital letter;
- The n-grams of size up to 3 of the lemmas and the NER tags of the words
between the mentions in the relation (prefix:
NGRAM
); - The lemmas and the NERs in a window of size up to 3 around the mentions composing the relation. These are only combined (i.e., a left window and a right window are merged into a single feature), giving a total of (up to) 9 features for the lemmas, and 9 for the NERS;
- Features denoting whether the mentions in the relation (or substrings of them of size up to 3) appear in some user-specified dictionaries;
- Indicator features denoting whether the sentence containing the relation also contains keywords appearing in user-specified dictionaries;
- The shortest dependency paths between the mentions and keywords in user-specified dictionaries that are in the sentence. Each feature is composed by both dependency paths from each mention to the keyword. Multiple variants of the paths are used, as in the mention case;
If the two mentions composing a relation are 'inverted' with respect to a canonical order defined by the user, a prefix indicating this fact is prepended to all the generic features;
In order to use the "generic features" functionality, the user must import
ddlib
in her Python extractor:
import ddlib
$DEEPDIVE_HOME/ddlib/ddlib
must appear in the user's PYTHONPATH
environmental variable in order to be able to use ddlib
.
As explained in the introduction of this document, the user may optionally specify one or more dictionaries of keywords that are used to create generic features and can be seen as a way to incorporate domain-/application-specific knowledge to the set of generic features.
Dictionaries are seen as sets of keywords that are mapped to a dictionary identifier. All keywords in a dictionary are mapped to the same dictionary identifier. Keywords are replaced dictionary identifiers in some features, with the effect of reducing sparsity. In practice, a dictionary is a plain text file containing one keyword per line:
keyword1
keyword2
keyword3
...
Note that keywords can actually be composed by multiple words.
The user may load a dictionary by calling the ddlib.load_dictionary
function,
e.g.:
import ddlib
...
ddlib.load_dictionary("marriage_keywords.txt", dict_id="marry")
...
The dict_id
parameter is optional and allows the user to specify the
dictionary identifier. If this is not specified, the system will use an
incremental positive integer as identifier. Multiple dictionaries can be loaded
through multiple calls and they will all be used in the generic features.
The library represents features as strings.
To obtain the generic features for a mention, the library provides the generator
ddlib.get_generic_features_mentions
, which can be used as follows:
import ddlib
...
for feature in ddlib.get_generic_features_mention(sentence, span):
# do something with the feature
The first parameter sentence
is a ordered list of ddlib.Word
objects, where
each object represents a word in the sentence and the list is sorted according
to the order of the words in the sentence. The second parameter, span
, is a
ddlib.Span
object, representing the text span corresponding to the mention.
Consult the Pydoc documentation (and the code) for ddlib for more information
about these objects and how to generate them (especially the get_sentence
and
get_span
functions)
For relations, the user can obtain the generic features using the
ddlib.get_generic_features_relations
as follows:
import ddlib
...
for feature in ddlib.get_generic_features_relation(sentence, span1, span2):
# do something with the feature
The parameters are respectively a ordered list of ddlib.Word
objects and the
two ddlib.Span
objects representing mentions composing the relation.
We remark that ddlib.get_generic_featurse_mention
and
ddlib.get_generic_features_relation
are Python
generators, so they should be used
in a loop.
Moreover, the generators may yield multiple copies of the same feature (e.g., if
a word appears twice between two mentions in a relation, the feature
NGRAM_1_[word]
will be generated twice). It is the user's responsibility to
filter out duplicated features if needed.