nltk.classify.maxent module

A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy. A probability distribution is “empirically consistent” with a set of training data if its estimated frequency with which a class and a feature vector value co-occur is equal to the actual frequency in the data.

Terminology: ‘feature’

The term feature is usually used to refer to some property of an unlabeled token. For example, when performing word sense disambiguation, we might define a 'prevword' feature whose value is the word preceding the target word. However, in the context of maxent modeling, the term feature is typically used to refer to a property of a “labeled” token. In order to prevent confusion, we will introduce two distinct terms to disambiguate these two different concepts:

  • An “input-feature” is a property of an unlabeled token.

  • A “joint-feature” is a property of a labeled token.

In the rest of the nltk.classify module, the term “features” is used to refer to what we will call “input-features” in this module.

In literature that describes and discusses maximum entropy models, input-features are typically called “contexts”, and joint-features are simply referred to as “features”.

Converting Input-Features to Joint-Features

In maximum entropy models, joint-features are required to have numeric values. Typically, each input-feature input_feat is mapped to a set of joint-features of the form:

joint_feat(token, label) = { 1 if input_feat(token) == feat_val
{ and label == some_label
{
{ 0 otherwise

For all values of feat_val and some_label. This mapping is performed by classes that implement the MaxentFeatureEncodingI interface.

class nltk.classify.maxent.BinaryMaxentFeatureEncoding[source]

Bases: MaxentFeatureEncodingI

A feature encoding that generates vectors containing a binary joint-features of the form:

joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label)
{
{ 0 otherwise

Where fname is the name of an input-feature, fval is a value for that input-feature, and label is a label.

Typically, these features are constructed based on a training corpus, using the train() method. This method will create one feature for each combination of fname, fval, and label that occurs at least once in the training corpus.

The unseen_features parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:

joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname])
{ and l == label
{
{ 0 otherwise

Where is_unseen(fname, fval) is true if the encoding does not contain any joint features that are true when fs[fname]==fval.

The alwayson_features parameter can be used to add “always-on features”, which have the form:

|  joint_feat(fs, l) = { 1 if (l == label)
|                      {
|                      { 0 otherwise

These always-on features allow the maxent model to directly model the prior probabilities of each label.

__init__(labels, mapping, unseen_features=False, alwayson_features=False)[source]
Parameters
  • labels – A list of the “known labels” for this encoding.

  • mapping – A dictionary mapping from (fname,fval,label) tuples to corresponding joint-feature indexes. These indexes must be the set of integers from 0…len(mapping). If mapping[fname,fval,label]=id, then self.encode(..., fname:fval, ..., label)[id] is 1; otherwise, it is 0.

  • unseen_features – If true, then include unseen value features in the generated joint-feature vectors.

  • alwayson_features – If true, then include always-on features in the generated joint-feature vectors.

describe(f_id)[source]
Returns

A string describing the value of the joint-feature whose index in the generated feature vectors is fid.

Return type

str

encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type

list(tuple(int, int))

labels()[source]
Returns

A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.

Return type

list

length()[source]
Returns

The size of the fixed-length joint-feature vectors that are generated by this encoding.

Return type

int

classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description BinaryMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Parameters
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.

  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.

  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.

  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.

nltk.classify.maxent.ConditionalExponentialClassifier

Alias for MaxentClassifier.

class nltk.classify.maxent.FunctionBackedMaxentFeatureEncoding[source]

Bases: MaxentFeatureEncodingI

A feature encoding that calls a user-supplied function to map a given featureset/label pair to a sparse joint-feature vector.

__init__(func, length, labels)[source]

Construct a new feature encoding based on the given function.

Parameters
  • func ((callable)) –

    A function that takes two arguments, a featureset and a label, and returns the sparse joint feature vector that encodes them:

    func(featureset, label) -> feature_vector
    

    This sparse joint feature vector (feature_vector) is a list of (index,value) tuples.

  • length (int) – The size of the fixed-length joint-feature vectors that are generated by this encoding.

  • labels (list) – A list of the “known labels” for this encoding – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.

describe(fid)[source]
Returns

A string describing the value of the joint-feature whose index in the generated feature vectors is fid.

Return type

str

encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type

list(tuple(int, int))

labels()[source]
Returns

A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.

Return type

list

length()[source]
Returns

The size of the fixed-length joint-feature vectors that are generated by this encoding.

Return type

int

class nltk.classify.maxent.GISEncoding[source]

Bases: BinaryMaxentFeatureEncoding

A binary feature encoding which adds one new joint-feature to the joint-features defined by BinaryMaxentFeatureEncoding: a correction feature, whose value is chosen to ensure that the sparse vector always sums to a constant non-negative number. This new feature is used to ensure two preconditions for the GIS training algorithm:

  • At least one feature vector index must be nonzero for every token.

  • The feature vector must sum to a constant non-negative number for every token.

property C

The non-negative constant that all encoded feature vectors will sum to.

__init__(labels, mapping, unseen_features=False, alwayson_features=False, C=None)[source]
Parameters

C – The correction constant. The value of the correction feature is based on this value. In particular, its value is C - sum([v for (f,v) in encoding]).

Seealso

BinaryMaxentFeatureEncoding.__init__

describe(f_id)[source]
Returns

A string describing the value of the joint-feature whose index in the generated feature vectors is fid.

Return type

str

encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type

list(tuple(int, int))

length()[source]
Returns

The size of the fixed-length joint-feature vectors that are generated by this encoding.

Return type

int

class nltk.classify.maxent.MaxentClassifier[source]

Bases: ClassifierI

A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each (featureset, label) pair to a vector. The probability of each label is then computed using the following equation:

                          dotprod(weights, encode(fs,label))
prob(fs|label) = ---------------------------------------------------
                 sum(dotprod(weights, encode(fs,l)) for l in labels)

Where dotprod is the dot product:

dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
ALGORITHMS = ['GIS', 'IIS', 'MEGAM', 'TADM']

A list of the algorithm names that are accepted for the train() method’s algorithm parameter.

__init__(encoding, weights, logarithmic=True)[source]

Construct a new maxent classifier model. Typically, new classifier models are created using the train() method.

Parameters
  • encoding (MaxentFeatureEncodingI) – An encoding that is used to convert the featuresets that are given to the classify method into joint-feature vectors, which are used by the maxent classifier model.

  • weights (list of float) – The feature weight vector for this classifier.

  • logarithmic (bool) – If false, then use non-logarithmic weights.

classify(featureset)[source]
Returns

the most appropriate label for the given featureset.

Return type

label

explain(featureset, columns=4)[source]

Print a table showing the effect of each of the features in the given feature set, and how they combine to determine the probabilities of each label for that featureset.

labels()[source]
Returns

the list of category labels used by this classifier.

Return type

list of (immutable)

most_informative_features(n=10)[source]

Generates the ranked list of informative features from most to least.

prob_classify(featureset)[source]
Returns

a probability distribution over labels for the given featureset.

Return type

ProbDistI

set_weights(new_weights)[source]

Set the feature weight vector for this classifier. :param new_weights: The new feature weight vector. :type new_weights: list of float

show_most_informative_features(n=10, show='all')[source]
Parameters
  • show (str) – all, neg, or pos (for negative-only or positive-only)

  • n (int) – The no. of top features

classmethod train(train_toks, algorithm=None, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **cutoffs)[source]

Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.

Return type

MaxentClassifier

Returns

The new maxent classifier

Parameters
  • train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.

  • algorithm (str) –

    A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.

    • Iterative Scaling Methods: Generalized Iterative Scaling ('GIS'), Improved Iterative Scaling ('IIS')

    • External Libraries (requiring megam): LM-BFGS algorithm, with training performed by Megam ('megam')

    The default algorithm is 'IIS'.

  • trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.

  • encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. If none is specified, then a BinaryMaxentFeatureEncoding will be built based on the features that are attested in the training corpus.

  • labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.

  • gaussian_prior_sigma – The sigma value for a gaussian prior on model weights. Currently, this is supported by megam. For other algorithms, its value is ignored.

  • cutoffs

    Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)

    • max_iter=v: Terminate after v iterations.

    • min_ll=v: Terminate after the negative average log-likelihood drops under v.

    • min_lldelta=v: Terminate if a single iteration improves log likelihood by less than v.

weights()[source]
Returns

The feature weight vector for this classifier.

Return type

list of float

class nltk.classify.maxent.MaxentFeatureEncodingI[source]

Bases: object

A mapping that converts a set of input-feature values to a vector of joint-feature values, given a label. This conversion is necessary to translate featuresets into a format that can be used by maximum entropy models.

The set of joint-features used by a given encoding is fixed, and each index in the generated joint-feature vectors corresponds to a single joint-feature. The length of the generated joint-feature vectors is therefore constant (for a given encoding).

Because the joint-feature vectors generated by MaxentFeatureEncodingI are typically very sparse, they are represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Feature encodings are generally created using the train() method, which generates an appropriate encoding based on the input-feature values and labels that are present in a given corpus.

describe(fid)[source]
Returns

A string describing the value of the joint-feature whose index in the generated feature vectors is fid.

Return type

str

encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type

list(tuple(int, int))

labels()[source]
Returns

A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.

Return type

list

length()[source]
Returns

The size of the fixed-length joint-feature vectors that are generated by this encoding.

Return type

int

train(train_toks)[source]

Construct and return new feature encoding, based on a given training corpus train_toks.

Parameters

train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.

class nltk.classify.maxent.TadmEventMaxentFeatureEncoding[source]

Bases: BinaryMaxentFeatureEncoding

__init__(labels, mapping, unseen_features=False, alwayson_features=False)[source]
Parameters
  • labels – A list of the “known labels” for this encoding.

  • mapping – A dictionary mapping from (fname,fval,label) tuples to corresponding joint-feature indexes. These indexes must be the set of integers from 0…len(mapping). If mapping[fname,fval,label]=id, then self.encode(..., fname:fval, ..., label)[id] is 1; otherwise, it is 0.

  • unseen_features – If true, then include unseen value features in the generated joint-feature vectors.

  • alwayson_features – If true, then include always-on features in the generated joint-feature vectors.

describe(fid)[source]
Returns

A string describing the value of the joint-feature whose index in the generated feature vectors is fid.

Return type

str

encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type

list(tuple(int, int))

labels()[source]
Returns

A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.

Return type

list

length()[source]
Returns

The size of the fixed-length joint-feature vectors that are generated by this encoding.

Return type

int

classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description BinaryMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Parameters
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.

  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.

  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.

  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.

class nltk.classify.maxent.TadmMaxentClassifier[source]

Bases: MaxentClassifier

classmethod train(train_toks, **kwargs)[source]

Train a new maxent classifier based on the given corpus of training samples. This classifier will have its weights chosen to maximize entropy while remaining empirically consistent with the training corpus.

Return type

MaxentClassifier

Returns

The new maxent classifier

Parameters
  • train_toks (list) – Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.

  • algorithm (str) –

    A case-insensitive string, specifying which algorithm should be used to train the classifier. The following algorithms are currently available.

    • Iterative Scaling Methods: Generalized Iterative Scaling ('GIS'), Improved Iterative Scaling ('IIS')

    • External Libraries (requiring megam): LM-BFGS algorithm, with training performed by Megam ('megam')

    The default algorithm is 'IIS'.

  • trace (int) – The level of diagnostic tracing output to produce. Higher values produce more verbose output.

  • encoding (MaxentFeatureEncodingI) – A feature encoding, used to convert featuresets into feature vectors. If none is specified, then a BinaryMaxentFeatureEncoding will be built based on the features that are attested in the training corpus.

  • labels (list(str)) – The set of possible labels. If none is given, then the set of all labels attested in the training data will be used instead.

  • gaussian_prior_sigma – The sigma value for a gaussian prior on model weights. Currently, this is supported by megam. For other algorithms, its value is ignored.

  • cutoffs

    Arguments specifying various conditions under which the training should be halted. (Some of the cutoff conditions are not supported by some algorithms.)

    • max_iter=v: Terminate after v iterations.

    • min_ll=v: Terminate after the negative average log-likelihood drops under v.

    • min_lldelta=v: Terminate if a single iteration improves log likelihood by less than v.

class nltk.classify.maxent.TypedMaxentFeatureEncoding[source]

Bases: MaxentFeatureEncodingI

A feature encoding that generates vectors containing integer, float and binary joint-features of the form:

Binary (for string and boolean features):

joint_feat(fs, l) = { 1 if (fs[fname] == fval) and (l == label)
{
{ 0 otherwise

Value (for integer and float features):

joint_feat(fs, l) = { fval if (fs[fname] == type(fval))
{ and (l == label)
{
{ not encoded otherwise

Where fname is the name of an input-feature, fval is a value for that input-feature, and label is a label.

Typically, these features are constructed based on a training corpus, using the train() method.

For string and boolean features [type(fval) not in (int, float)] this method will create one feature for each combination of fname, fval, and label that occurs at least once in the training corpus.

For integer and float features [type(fval) in (int, float)] this method will create one feature for each combination of fname and label that occurs at least once in the training corpus.

For binary features the unseen_features parameter can be used to add “unseen-value features”, which are used whenever an input feature has a value that was not encountered in the training corpus. These features have the form:

joint_feat(fs, l) = { 1 if is_unseen(fname, fs[fname])
{ and l == label
{
{ 0 otherwise

Where is_unseen(fname, fval) is true if the encoding does not contain any joint features that are true when fs[fname]==fval.

The alwayson_features parameter can be used to add “always-on features”, which have the form:

joint_feat(fs, l) = { 1 if (l == label)
{
{ 0 otherwise

These always-on features allow the maxent model to directly model the prior probabilities of each label.

__init__(labels, mapping, unseen_features=False, alwayson_features=False)[source]
Parameters
  • labels – A list of the “known labels” for this encoding.

  • mapping – A dictionary mapping from (fname,fval,label) tuples to corresponding joint-feature indexes. These indexes must be the set of integers from 0…len(mapping). If mapping[fname,fval,label]=id, then self.encode({..., fname:fval, ..., label)[id]} is 1; otherwise, it is 0.

  • unseen_features – If true, then include unseen value features in the generated joint-feature vectors.

  • alwayson_features – If true, then include always-on features in the generated joint-feature vectors.

describe(f_id)[source]
Returns

A string describing the value of the joint-feature whose index in the generated feature vectors is fid.

Return type

str

encode(featureset, label)[source]

Given a (featureset, label) pair, return the corresponding vector of joint-feature values. This vector is represented as a list of (index, value) tuples, specifying the value of each non-zero joint-feature.

Return type

list(tuple(int, int))

labels()[source]
Returns

A list of the “known labels” – i.e., all labels l such that self.encode(fs,l) can be a nonzero joint-feature vector for some value of fs.

Return type

list

length()[source]
Returns

The size of the fixed-length joint-feature vectors that are generated by this encoding.

Return type

int

classmethod train(train_toks, count_cutoff=0, labels=None, **options)[source]

Construct and return new feature encoding, based on a given training corpus train_toks. See the class description TypedMaxentFeatureEncoding for a description of the joint-features that will be included in this encoding.

Note: recognized feature values types are (int, float), over types are interpreted as regular binary features.

Parameters
  • train_toks (list(tuple(dict, str))) – Training data, represented as a list of pairs, the first member of which is a feature dictionary, and the second of which is a classification label.

  • count_cutoff (int) – A cutoff value that is used to discard rare joint-features. If a joint-feature’s value is 1 fewer than count_cutoff times in the training corpus, then that joint-feature is not included in the generated encoding.

  • labels (list) – A list of labels that should be used by the classifier. If not specified, then the set of labels attested in train_toks will be used.

  • options – Extra parameters for the constructor, such as unseen_features and alwayson_features.

nltk.classify.maxent.calculate_deltas(train_toks, classifier, unattested, ffreq_empirical, nfmap, nfarray, nftranspose, encoding)[source]

Calculate the update values for the classifier weights for this iteration of IIS. These update weights are the value of delta that solves the equation:

ffreq_empirical[i]
       =
SUM[fs,l] (classifier.prob_classify(fs).prob(l) *
           feature_vector(fs,l)[i] *
           exp(delta[i] * nf(feature_vector(fs,l))))
Where:
  • (fs,l) is a (featureset, label) tuple from train_toks

  • feature_vector(fs,l) = encoding.encode(fs,l)

  • nf(vector) = sum([val for (id,val) in vector])

This method uses Newton’s method to solve this equation for delta[i]. In particular, it starts with a guess of delta[i] = 1; and iteratively updates delta with:

delta[i] -= (ffreq_empirical[i] - sum1[i])/(-sum2[i])

until convergence, where sum1 and sum2 are defined as:

sum1[i](delta) = SUM[fs,l] f[i](fs,l,delta)
sum2[i](delta) = SUM[fs,l] (f[i](fs,l,delta).nf(feature_vector(fs,l)))
f[i](fs,l,delta) = (classifier.prob_classify(fs).prob(l) .
feature_vector(fs,l)[i] .
exp(delta[i] . nf(feature_vector(fs,l))))

Note that sum1 and sum2 depend on delta; so they need to be re-computed each iteration.

The variables nfmap, nfarray, and nftranspose are used to generate a dense encoding for nf(ltext). This allows _deltas to calculate sum1 and sum2 using matrices, which yields a significant performance improvement.

Parameters
  • train_toks (list(tuple(dict, str))) – The set of training tokens.

  • classifier (ClassifierI) – The current classifier.

  • ffreq_empirical (sequence of float) – An array containing the empirical frequency for each feature. The ith element of this array is the empirical frequency for feature i.

  • unattested (sequence of int) – An array that is 1 for features that are not attested in the training data; and 0 for features that are attested. In other words, unattested[i]==0 iff ffreq_empirical[i]==0.

  • nfmap (dict(int -> int)) – A map that can be used to compress nf to a dense vector.

  • nfarray (array(float)) – An array that can be used to uncompress nf from a dense vector.

  • nftranspose (array(float)) – The transpose of nfarray

nltk.classify.maxent.calculate_empirical_fcount(train_toks, encoding)[source]
nltk.classify.maxent.calculate_estimated_fcount(classifier, train_toks, encoding)[source]
nltk.classify.maxent.calculate_nfmap(train_toks, encoding)[source]

Construct a map that can be used to compress nf (which is typically sparse).

nf(feature_vector) is the sum of the feature values for feature_vector.

This represents the number of features that are active for a given labeled text. This method finds all values of nf(t) that are attested for at least one token in the given list of training tokens; and constructs a dictionary mapping these attested values to a continuous range 0…N. For example, if the only values of nf() that were attested were 3, 5, and 7, then _nfmap might return the dictionary {3:0, 5:1, 7:2}.

Returns

A map that can be used to compress nf to a dense vector.

Return type

dict(int -> int)

nltk.classify.maxent.demo()[source]
nltk.classify.maxent.train_maxent_classifier_with_gis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]

Train a new ConditionalExponentialClassifier, using the given training samples, using the Generalized Iterative Scaling algorithm. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See

train_maxent_classifier() for parameter descriptions.

nltk.classify.maxent.train_maxent_classifier_with_iis(train_toks, trace=3, encoding=None, labels=None, **cutoffs)[source]

Train a new ConditionalExponentialClassifier, using the given training samples, using the Improved Iterative Scaling algorithm. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See

train_maxent_classifier() for parameter descriptions.

nltk.classify.maxent.train_maxent_classifier_with_megam(train_toks, trace=3, encoding=None, labels=None, gaussian_prior_sigma=0, **kwargs)[source]

Train a new ConditionalExponentialClassifier, using the given training samples, using the external megam library. This ConditionalExponentialClassifier will encode the model that maximizes entropy from all the models that are empirically consistent with train_toks.

See

train_maxent_classifier() for parameter descriptions.

See

nltk.classify.megam