nltk.classify.maxent module¶
A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.
Terminology: ‘feature’¶
The term feature is usually used to refer to some property of an
unlabeled token. For example, when performing word sense
disambiguation, we might define a 'prevword'
feature whose value is
the word preceding the target word. However, in the context of
maxent modeling, the term feature is typically used to refer to a
property of a “labeled” token. In order to prevent confusion, we
will introduce two distinct terms to disambiguate these two different
concepts:
An “input-feature” is a property of an unlabeled token.
A “joint-feature” is a property of a labeled token.
In the rest of the nltk.classify
module, the term “features” is
used to refer to what we will call “input-features” in this module.
In literature that describes and discusses maximum entropy models, input-features are typically called “contexts”, and joint-features are simply referred to as “features”.
Converting Input-Features to Joint-Features¶
In maximum entropy models, joint-features are required to have numeric
values. Typically, each input-feature input_feat
is mapped to a
set of joint-features of the form:
For all values of feat_val
and some_label
. This mapping is
performed by classes that implement the MaxentFeatureEncodingI
interface.
- class nltk.classify.maxent.BinaryMaxentFeatureEncoding[source]¶
Bases:
MaxentFeatureEncodingI
A feature encoding that generates vectors containing a binary joint-features of the form:
joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label){{ 0 otherwiseWhere
fname
is the name of an input-feature,fval
is a value for that input-feature, andlabel
is a label.Typically, these features are constructed based on a training corpus, using the
train()
method. This method will create one feature for each combination offname
,fval
, andlabel
that occurs at least once in the training corpus.The
unseen_features
parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname]){ and l == label{{ 0 otherwiseWhere
is_unseen(fname, fval)
is true if the encoding does not contain any joint features that are true whenfs[fname]==fval
.The
alwayson_features
parameter can be used to add “always-on features”, which have the form:| joint_feat(fs, l) = { 1 if (l == label) | { | { 0 otherwise
These always-on features allow the maxent model to directly model the prior probabilities of each label.
- __init__(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶
- Parameters
labels – A list of the “known labels” for this encoding.
mapping – A dictionary mapping from
(fname,fval,label)
tuples to corresponding joint-feature indexes. These indexes must be the set of integers from 0…len(mapping). Ifmapping[fname,fval,label]=id
, thenself.encode(..., fname:fval, ..., label)[id]
is 1; otherwise, it is 0.unseen_features – If true, then include unseen value features in the generated joint-feature vectors.
alwayson_features – If true, then include always-on features in the generated joint-feature vectors.
- describe(f_id)[source]¶
- Returns
A string describing the value of the joint-feature whose index in the generated feature vectors is
fid
.- Return type
str
- encode(featureset, label)[source]¶
Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.- Return type
list(tuple(int, int))
- labels()[source]¶
- Returns
A list of the “known labels” – i.e., all labels
l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.- Return type
list
- length()[source]¶
- Returns
The size of the fixed-length joint-feature vectors that are generated by this encoding.
- Return type
int
- classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]¶
Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionBinaryMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.- Parameters
train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding.labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in
train_toks
will be used.options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
- nltk.classify.maxent.ConditionalExponentialClassifier¶
Alias for MaxentClassifier.
- class nltk.classify.maxent.FunctionBackedMaxentFeatureEncoding[source]¶
Bases:
MaxentFeatureEncodingI
A feature encoding that calls a user-supplied function to map a given featureset/label pair to a sparse joint-feature vector.
- __init__(func, length, labels)[source]¶
Construct a new feature encoding based on the given function.
- Parameters
func ((callable)) –
A function that takes two arguments, a featureset and a label, and returns the sparse joint feature vector that encodes them:
func(featureset, label) -> feature_vector
This sparse joint feature vector (
feature_vector
) is a list of(index,value)
tuples.length (int) – The size of the fixed-length joint-feature vectors that are generated by this encoding.
labels (list) – A list of the “known labels” for this encoding – i.e., all labels
l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.
- describe(fid)[source]¶
- Returns
A string describing the value of the joint-feature whose index in the generated feature vectors is
fid
.- Return type
str
- encode(featureset, label)[source]¶
Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.- Return type
list(tuple(int, int))
- class nltk.classify.maxent.GISEncoding[source]¶
Bases:
BinaryMaxentFeatureEncoding
A binary feature encoding which adds one new joint-feature to the joint-features defined by
BinaryMaxentFeatureEncoding
: a correction feature, whose value is chosen to ensure that the sparse vector always sums to a constant non-negative number. This new feature is used to ensure two preconditions for the GIS training algorithm:At least one feature vector index must be nonzero for every token.
The feature vector must sum to a constant non-negative number for every token.
- property C¶
The non-negative constant that all encoded feature vectors will sum to.
- __init__(labels, mapping, unseen_features=False, alwayson_features=False, C=None)[source]¶
- Parameters
C – The correction constant. The value of the correction feature is based on this value. In particular, its value is
C - sum([v for (f,v) in encoding])
.- Seealso
BinaryMaxentFeatureEncoding.__init__
- describe(f_id)[source]¶
- Returns
A string describing the value of the joint-feature whose index in the generated feature vectors is
fid
.- Return type
str
- class nltk.classify.maxent.MaxentClassifier[source]¶
Bases:
ClassifierI
A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each
(featureset, label)
pair to a vector. The probability of each label is then computed using the following equation:dotprod(weights, encode(fs,label)) prob(fs|label) = --------------------------------------------------- sum(dotprod(weights, encode(fs,l)) for l in labels)
Where
dotprod
is the dot product:dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
- ALGORITHMS = ['GIS', 'IIS', 'MEGAM', 'TADM']¶
A list of the algorithm names that are accepted for the
train()
method’salgorithm
parameter.
- __init__(encoding, weights, logarithmic=True)[source]¶
Construct a new maxent classifier model. Typically, new classifier models are created using the
train()
method.- Parameters
encoding (MaxentFeatureEncodingI) – An encoding that is used to convert the featuresets that are given to the
classify
method into joint-feature vectors, which are used by the maxent classifier model.weights (list of float) – The feature weight vector for this classifier.
logarithmic (bool) – If false, then use non-logarithmic weights.
- classify(featureset)[source]¶
- Returns
the most appropriate label for the given featureset.
- Return type
label
- explain(featureset, columns=4)[source]¶
Print a table showing the effect of each of the features in the given feature set, and how they combine to determine the probabilities of each label for that featureset.
- labels()[source]¶
- Returns
the list of category labels used by this classifier.
- Return type
list of (immutable)
- most_informative_features(n=10)[source]¶
Generates the ranked list of informative features from most to least.
- prob_classify(featureset)[source]¶
- Returns
a probability distribution over labels for the given featureset.
- Return type
- set_weights(new_weights)[source]¶
Set the feature weight vector for this classifier. :param new_weights: The new feature weight vector. :type new_weights: list of float
- show_most_informative_features(n=10, show='all')[source]¶
- Parameters
show (str) – all, neg, or pos (for negative-only or positive-only)
n (int) – The no. of top features
- classmethod train(train_toks, algorithm=None, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **cutoffs)[source]¶
Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.
- Return type
- Returns
The new maxent classifier
- Parameters
train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.
algorithm (str) –
A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.
Iterative Scaling Methods: Generalized Iterative Scaling (
'GIS'
), Improved Iterative Scaling ('IIS'
)External Libraries (requiring megam): LM-BFGS algorithm, with training performed by Megam (
'megam'
)
The default algorithm is
'IIS'
.trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.
encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. If none is specified, then a
BinaryMaxentFeatureEncoding
will be built based on the features that are attested in the training corpus.labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.
gaussian_prior_sigma – The sigma value for a gaussian prior on model weights. Currently, this is supported by
megam
. For other algorithms, its value is ignored.cutoffs –
Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)
max_iter=v
: Terminate afterv
iterations.min_ll=v
: Terminate after the negative average log-likelihood drops underv
.min_lldelta=v
: Terminate if a single iteration improves log likelihood by less thanv
.
- class nltk.classify.maxent.MaxentFeatureEncodingI[source]¶
Bases:
object
A mapping that converts a set of input-feature values to a vector of joint-feature values, given a label. This conversion is necessary to translate featuresets into a format that can be used by maximum entropy models.
The set of joint-features used by a given encoding is fixed, and each index in the generated joint-feature vectors corresponds to a single joint-feature. The length of the generated joint-feature vectors is therefore constant (for a given encoding).
Because the joint-feature vectors generated by
MaxentFeatureEncodingI
are typically very sparse, they are represented as a list of(index, value)
tuples, specifying the value of each non-zero joint-feature.Feature encodings are generally created using the
train()
method, which generates an appropriate encoding based on the input-feature values and labels that are present in a given corpus.- describe(fid)[source]¶
- Returns
A string describing the value of the joint-feature whose index in the generated feature vectors is
fid
.- Return type
str
- encode(featureset, label)[source]¶
Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.- Return type
list(tuple(int, int))
- labels()[source]¶
- Returns
A list of the “known labels” – i.e., all labels
l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.- Return type
list
- length()[source]¶
- Returns
The size of the fixed-length joint-feature vectors that are generated by this encoding.
- Return type
int
- train(train_toks)[source]¶
Construct and return new feature encoding, based on a given training corpus
train_toks
.- Parameters
train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
- class nltk.classify.maxent.TadmEventMaxentFeatureEncoding[source]¶
Bases:
BinaryMaxentFeatureEncoding
- __init__(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶
- Parameters
labels – A list of the “known labels” for this encoding.
mapping – A dictionary mapping from
(fname,fval,label)
tuples to corresponding joint-feature indexes. These indexes must be the set of integers from 0…len(mapping). Ifmapping[fname,fval,label]=id
, thenself.encode(..., fname:fval, ..., label)[id]
is 1; otherwise, it is 0.unseen_features – If true, then include unseen value features in the generated joint-feature vectors.
alwayson_features – If true, then include always-on features in the generated joint-feature vectors.
- describe(fid)[source]¶
- Returns
A string describing the value of the joint-feature whose index in the generated feature vectors is
fid
.- Return type
str
- encode(featureset, label)[source]¶
Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.- Return type
list(tuple(int, int))
- labels()[source]¶
- Returns
A list of the “known labels” – i.e., all labels
l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.- Return type
list
- length()[source]¶
- Returns
The size of the fixed-length joint-feature vectors that are generated by this encoding.
- Return type
int
- classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]¶
Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionBinaryMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.- Parameters
train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding.labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in
train_toks
will be used.options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
- class nltk.classify.maxent.TadmMaxentClassifier[source]¶
Bases:
MaxentClassifier
- classmethod train(train_toks, **kwargs)[source]¶
Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.
- Return type
- Returns
The new maxent classifier
- Parameters
train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.
algorithm (str) –
A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.
Iterative Scaling Methods: Generalized Iterative Scaling (
'GIS'
), Improved Iterative Scaling ('IIS'
)External Libraries (requiring megam): LM-BFGS algorithm, with training performed by Megam (
'megam'
)
The default algorithm is
'IIS'
.trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.
encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. If none is specified, then a
BinaryMaxentFeatureEncoding
will be built based on the features that are attested in the training corpus.labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.
gaussian_prior_sigma – The sigma value for a gaussian prior on model weights. Currently, this is supported by
megam
. For other algorithms, its value is ignored.cutoffs –
Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)
max_iter=v
: Terminate afterv
iterations.min_ll=v
: Terminate after the negative average log-likelihood drops underv
.min_lldelta=v
: Terminate if a single iteration improves log likelihood by less thanv
.
- class nltk.classify.maxent.TypedMaxentFeatureEncoding[source]¶
Bases:
MaxentFeatureEncodingI
A feature encoding that generates vectors containing integer, float and binary joint-features of the form:
Binary (for string and boolean features):
joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label){{ 0 otherwiseValue (for integer and float features):
joint_feat(fs, l) = { fval if (fs[fname] == type(fval)){ and (l == label){{ not encoded otherwiseWhere
fname
is the name of an input-feature,fval
is a value for that input-feature, andlabel
is a label.Typically, these features are constructed based on a training corpus, using the
train()
method.For string and boolean features [type(fval) not in (int, float)] this method will create one feature for each combination of
fname
,fval
, andlabel
that occurs at least once in the training corpus.For integer and float features [type(fval) in (int, float)] this method will create one feature for each combination of
fname
andlabel
that occurs at least once in the training corpus.For binary features the
unseen_features
parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname]){ and l == label{{ 0 otherwiseWhere
is_unseen(fname, fval)
is true if the encoding does not contain any joint features that are true whenfs[fname]==fval
.The
alwayson_features
parameter can be used to add “always-on features”, which have the form:joint_feat(fs, l) = { 1 if (l == label){{ 0 otherwiseThese always-on features allow the maxent model to directly model the prior probabilities of each label.
- __init__(labels, mapping, unseen_features=False, alwayson_features=False)[source]¶
- Parameters
labels – A list of the “known labels” for this encoding.
mapping – A dictionary mapping from
(fname,fval,label)
tuples to corresponding joint-feature indexes. These indexes must be the set of integers from 0…len(mapping). Ifmapping[fname,fval,label]=id
, thenself.encode({..., fname:fval, ...
, label)[id]} is 1; otherwise, it is 0.unseen_features – If true, then include unseen value features in the generated joint-feature vectors.
alwayson_features – If true, then include always-on features in the generated joint-feature vectors.
- describe(f_id)[source]¶
- Returns
A string describing the value of the joint-feature whose index in the generated feature vectors is
fid
.- Return type
str
- encode(featureset, label)[source]¶
Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of
(index, value)
tuples, specifying the value of each non-zero joint-feature.- Return type
list(tuple(int, int))
- labels()[source]¶
- Returns
A list of the “known labels” – i.e., all labels
l
such thatself.encode(fs,l)
can be a nonzero joint-feature vector for some value offs
.- Return type
list
- length()[source]¶
- Returns
The size of the fixed-length joint-feature vectors that are generated by this encoding.
- Return type
int
- classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]¶
Construct and return new feature encoding, based on a given training corpus
train_toks
. See the class descriptionTypedMaxentFeatureEncoding
for a description of the joint-features that will be included in this encoding.Note: recognized feature values types are (int, float), over types are interpreted as regular binary features.
- Parameters
train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.
count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than
count_cutoff
times in the training corpus, then that joint-feature is not included in the generated encoding.labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in
train_toks
will be used.options – Extra parameters for the constructor, such as
unseen_features
andalwayson_features
.
- nltk.classify.maxent.calculate_deltas(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)[source]¶
Calculate the update values for the classifier weights for this iteration of IIS. These update weights are the value of
delta
that solves the equation:ffreq_empirical[i] = SUM[fs,l] (classifier.prob_classify(fs).prob(l) * feature_vector(fs,l)[i] * exp(delta[i] * nf(feature_vector(fs,l))))
- Where:
(fs,l) is a (featureset, label) tuple from
train_toks
feature_vector(fs,l) =
encoding.encode(fs,l)
nf(vector) =
sum([val for (id,val) in vector])
This method uses Newton’s method to solve this equation for delta[i]. In particular, it starts with a guess of
delta[i]
= 1; and iteratively updatesdelta
with:delta[i] -= (ffreq_empirical[i] - sum1[i])/(-sum2[i])until convergence, where sum1 and sum2 are defined as:
sum1[i](delta) = SUM[fs,l] f[i](fs,l,delta)sum2[i](delta) = SUM[fs,l] (f[i](fs,l,delta).nf(feature_vector(fs,l)))f[i](fs,l,delta) = (classifier.prob_classify(fs).prob(l) .feature_vector(fs,l)[i] .exp(delta[i] . nf(feature_vector(fs,l))))Note that sum1 and sum2 depend on
delta
; so they need to be re-computed each iteration.The variables
nfmap
,nfarray
, andnftranspose
are used to generate a dense encoding for nf(ltext). This allows_deltas
to calculate sum1 and sum2 using matrices, which yields a significant performance improvement.- Parameters
train_toks (list(tuple(dict, str))) – The set of training tokens.
classifier (ClassifierI) – The current classifier.
ffreq_empirical (sequence of float) – An array containing the empirical frequency for each feature. The ith element of this array is the empirical frequency for feature i.
unattested (sequence of int) – An array that is 1 for features that are not attested in the training data; and 0 for features that are attested. In other words,
unattested[i]==0
iffffreq_empirical[i]==0
.nfmap (dict(int -> int)) – A map that can be used to compress
nf
to a dense vector.nfarray (array(float)) – An array that can be used to uncompress
nf
from a dense vector.nftranspose (array(float)) – The transpose of
nfarray
- nltk.classify.maxent.calculate_nfmap(train_toks, encoding)[source]¶
Construct a map that can be used to compress
nf
(which is typically sparse).nf(feature_vector) is the sum of the feature values for feature_vector.
This represents the number of features that are active for a given labeled text. This method finds all values of nf(t) that are attested for at least one token in the given list of training tokens; and constructs a dictionary mapping these attested values to a continuous range 0…N. For example, if the only values of nf() that were attested were 3, 5, and 7, then
_nfmap
might return the dictionary{3:0, 5:1, 7:2}
.- Returns
A map that can be used to compress
nf
to a dense vector.- Return type
dict(int -> int)
- nltk.classify.maxent.train_maxent_classifier_with_gis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]¶
Train a new
ConditionalExponentialClassifier
, using the given training samples, using the Generalized Iterative Scaling algorithm. ThisConditionalExponentialClassifier
will encode the model that maximizes entropy from all the models that are empirically consistent withtrain_toks
.- See
train_maxent_classifier()
for parameter descriptions.
- nltk.classify.maxent.train_maxent_classifier_with_iis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]¶
Train a new
ConditionalExponentialClassifier
, using the given training samples, using the Improved Iterative Scaling algorithm. ThisConditionalExponentialClassifier
will encode the model that maximizes entropy from all the models that are empirically consistent withtrain_toks
.- See
train_maxent_classifier()
for parameter descriptions.
- nltk.classify.maxent.train_maxent_classifier_with_megam(train_toks, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **kwargs)[source]¶
Train a new
ConditionalExponentialClassifier
, using the given training samples, using the externalmegam
library. ThisConditionalExponentialClassifier
will encode the model that maximizes entropy from all the models that are empirically consistent withtrain_toks
.- See
train_maxent_classifier()
for parameter descriptions.- See
nltk.classify.megam