nltk.classify.naivebayes module¶
A classifier based on the Naive Bayes algorithm. In order to find the probability for a label, this algorithm first uses the Bayes rule to express P(label|features) in terms of P(label) and P(features|label):
The algorithm then makes the ‘naive’ assumption that all features are independent, given the label:
Rather than computing P(features) explicitly, the algorithm just calculates the numerator for each label, and normalizes them so they sum to one:
- class nltk.classify.naivebayes.NaiveBayesClassifier[source]¶
Bases:
ClassifierI
A Naive Bayes classifier. Naive Bayes classifiers are paramaterized by two probability distributions:
P(label) gives the probability that an input will receive each label, given no information about the input’s features.
P(fname=fval|label) gives the probability that a given feature (fname) will receive a given value (fval), given that the label (label).
If the classifier encounters an input with a feature that has never been seen with any label, then rather than assigning a probability of 0 to all labels, it will ignore that feature.
The feature value ‘None’ is reserved for unseen feature values; you generally should not use ‘None’ as a feature value for one of your own features.
- __init__(label_probdist, feature_probdist)[source]¶
- Parameters:
label_probdist – P(label), the probability distribution over labels. It is expressed as a
ProbDistI
whose samples are labels. I.e., P(label) =label_probdist.prob(label)
.feature_probdist – P(fname=fval|label), the probability distribution for feature values, given labels. It is expressed as a dictionary whose keys are
(label, fname)
pairs and whose values areProbDistI
objects over feature values. I.e., P(fname=fval|label) =feature_probdist[label,fname].prob(fval)
. If a given(label,fname)
is not a key infeature_probdist
, then it is assumed that the corresponding P(fname=fval|label) is 0 for all values offval
.
- classify(featureset)[source]¶
- Returns:
the most appropriate label for the given featureset.
- Return type:
label
- labels()[source]¶
- Returns:
the list of category labels used by this classifier.
- Return type:
list of (immutable)
- most_informative_features(n=100)[source]¶
Return a list of the ‘most informative’ features used by this classifier. For the purpose of this function, the informativeness of a feature
(fname,fval)
is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:max[ P(fname=fval|label1) / P(fname=fval|label2) ]