nltk.tag package

Submodules

nltk.tag.api module

Interface for tagging each token in a sentence with supplementary information, such as its part of speech.

class nltk.tag.api.FeaturesetTaggerI[source]

Bases: nltk.tag.api.TaggerI

A tagger that requires tokens to be featuresets. A featureset is a dictionary that maps from feature names to feature values. See nltk.classify for more information about features and featuresets.

class nltk.tag.api.TaggerI[source]

Bases: builtins.object

A processing interface for assigning a tag to each token in a list. Tags are case sensitive strings that identify some property of each token, such as its part of speech or its sense.

Some taggers require specific types for their tokens. This is generally indicated by the use of a sub-interface to TaggerI. For example, featureset taggers, which are subclassed from FeaturesetTagger, require that each token be a featureset.

Subclasses must define:
  • either tag() or tag_sents() (or both)
evaluate(gold)[source]

Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.

Parameters:gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
Return type:float
tag(tokens)[source]

Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag).

Return type:list(tuple(str, str))
tag_sents(sentences)[source]

Apply self.tag() to each element of sentences. I.e.:

return [self.tag(sent) for sent in sentences]

nltk.tag.brill module

class nltk.tag.brill.BrillTagger(initial_tagger, rules, training_stats=None)[source]

Bases: nltk.tag.api.TaggerI

Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the TagRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using one of the TaggerTrainers available.

batch_tag_incremental(sequences, gold)[source]

Tags by applying each rule to the entire corpus (rather than all rules to a single sequence). The point is to collect statistics on the test set for individual rules.

NOTE: This is inefficient (does not build any index, so will traverse the entire corpus N times for N rules) – usually you would not care about statistics for individual rules and thus use batch_tag() instead

Parameters:
  • sequences (list of list of strings) – lists of token sequences (sentences, in some applications) to be tagged
  • gold (list of list of strings) – the gold standard
Returns:

tuple of (tagged_sequences, ordered list of rule scores (one for each rule))

classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.BrillTagger'
print_template_statistics(test_stats=None, printunused=True)[source]

Print a list of all templates, ranked according to efficiency.

If test_stats is available, the templates are ranked according to their relative contribution (summed for all rules created from a given template, weighted by score) to the performance on the test set. If no test_stats, then statistics collected during training are used instead. There is also an unweighted measure (just counting the rules). This is less informative, though, as many low-score rules will appear towards end of training.

Parameters:
  • test_stats (dict of str -> any (but usually numbers)) – dictionary of statistics collected during testing
  • printunused (bool) – if True, print a list of all unused templates
Returns:

None

Return type:

None

rules()[source]

Return the ordered list of transformation rules that this tagger has learnt

Returns:the ordered list of transformation rules that correct the initial tagging
Return type:list of Rules
tag(tokens)[source]
train_stats(statistic=None)[source]

Return a named statistic collected during training, or a dictionary of all available statistics if no name given

Parameters:statistic (str) – name of statistic
Returns:some statistic collected during training of this tagger
Return type:any (but usually a number)
class nltk.tag.brill.Pos(positions, end=None)[source]

Bases: nltk.tbl.feature.Feature

Feature which examines the tags of nearby tokens.

static extract_property(tokens, index)[source]

@return: The given token’s tag.

json_tag = 'nltk.tag.brill.Pos'
class nltk.tag.brill.Word(positions, end=None)[source]

Bases: nltk.tbl.feature.Feature

Feature which examines the text (word) of nearby tokens.

static extract_property(tokens, index)[source]

@return: The given token’s text.

json_tag = 'nltk.tag.brill.Word'
nltk.tag.brill.brill24()[source]

Return 24 templates of the seminal TBL paper, Brill (1995)

nltk.tag.brill.describe_template_sets()[source]

Print the available template sets in this demo, with a short description”

nltk.tag.brill.fntbl37()[source]

Return 37 templates taken from the postagging task of the fntbl distribution http://www.cs.jhu.edu/~rflorian/fntbl/ (37 is after excluding a handful which do not condition on Pos[0]; fntbl can do that but the current nltk implementation cannot.)

nltk.tag.brill.nltkdemo18()[source]

Return 18 templates, from the original nltk demo, in multi-feature syntax

nltk.tag.brill.nltkdemo18plus()[source]

Return 18 templates, from the original nltk demo, and additionally a few multi-feature ones (the motivation is easy comparison with nltkdemo18)

nltk.tag.brill_trainer module

class nltk.tag.brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=0, deterministic=None, ruleformat='str')[source]

Bases: builtins.object

A trainer for tbl taggers.

train(train_sents, max_rules=200, min_score=2, min_acc=None)[source]

Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than min_acc.

#imports >>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Pos, Word >>> from nltk.tag import RegexpTagger, BrillTaggerTrainer

#some data >>> from nltk.corpus import treebank >>> training_data = treebank.tagged_sents()[:100] >>> baseline_data = treebank.tagged_sents()[100:200] >>> gold_data = treebank.tagged_sents()[200:300] >>> testing_data = [untag(s) for s in gold_data]

>>> backoff = RegexpTagger([
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
... (r'(The|the|A|a|An|an)$', 'AT'),   # articles
... (r'.*able$', 'JJ'),                # adjectives
... (r'.*ness$', 'NN'),                # nouns formed from adjectives
... (r'.*ly$', 'RB'),                  # adverbs
... (r'.*s$', 'NNS'),                  # plural nouns
... (r'.*ing$', 'VBG'),                # gerunds
... (r'.*ed$', 'VBD'),                 # past tense verbs
... (r'.*', 'NN')                      # nouns (default)
... ])
>>> baseline = backoff #see NOTE1
>>> baseline.evaluate(gold_data) 
0.2450142...

#templates >>> Template._cleartemplates() #clear any templates created in earlier tests >>> templates = [Template(Pos([-1])), Template(Pos([-1]), Word([0]))]

#construct a BrillTaggerTrainer >>> tt = BrillTaggerTrainer(baseline, templates, trace=3) >>> tagger1 = tt.train(training_data, max_rules=10) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None) Finding initial useful rules...

Found 845 useful rules.
<BLANKLINE>
B |

S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e

——————+——————————————————-
132 132 0 0 | AT->DT if Pos:NN@[-1]
85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 47 63 16 161 | NN->IN if Pos:NNS@[-1] 33 33 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | IN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | IN->, if Pos:NNS@[-1] & Word:,@[0] 22 27 5 24 | NN->-NONE- if Pos:VBD@[-1] 17 17 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger1.rules()[1:3]
(Rule('001', 'NN', ',', [(Pos([-1]),'NN'), (Word([0]),',')]), Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]))
>>> train_stats = tagger1.train_stats()
>>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]

##FIXME: the following test fails – why? # #>>> tagger1.print_template_statistics(printunused=False) #TEMPLATE STATISTICS (TRAIN) 2 templates, 10 rules) #TRAIN ( 3163 tokens) initial 2358 0.2545 final: 1719 0.4565 ##ID | Score (train) | #Rules | Template #——————————————– #001 | 404 0.632 | 7 0.700 | Template(Pos([-1]),Word([0])) #000 | 235 0.368 | 3 0.300 | Template(Pos([-1])) #<BLANKLINE> #<BLANKLINE>

>>> tagger1.evaluate(gold_data) 
0.43996...
>>> (tagged, test_stats) = tagger1.batch_tag_incremental(testing_data, gold_data)
>>> tagged[33][12:] == [('foreign', 'IN'), ('debt', 'NN'), ('of', 'IN'), ('$', 'NN'), ('64', 'CD'),
... ('billion', 'NN'), ('*U*', 'NN'), ('--', 'NN'), ('the', 'DT'), ('third-highest', 'NN'), ('in', 'NN'),
... ('the', 'DT'), ('developing', 'VBG'), ('world', 'NN'), ('.', '.')]
True
>>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]

##a high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) Finding initial useful rules...

Found 845 useful rules.
<BLANKLINE>
B |

S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e

——————+——————————————————-
132 132 0 0 | AT->DT if Pos:NN@[-1]
85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 36 36 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | NN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | NN->, if Pos:NNS@[-1] & Word:,@[0] 19 19 0 6 | NN->VB if Pos:TO@[-1] 18 18 0 0 | CD->-NONE- if Pos:NN@[-1] & Word:0@[0] 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger2.evaluate(gold_data) 
0.44159544...
>>> tagger2.rules()[2:4]
(Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))

#NOTE1: (!!FIXME) A far better baseline uses nltk.tag.UnigramTagger, #with a RegexpTagger only as backoff. For instance, #>>> baseline = UnigramTagger(baseline_data, backoff=backoff) #However, as of Nov 2013, nltk.tag.UnigramTagger does not yield consistent results #between python versions. The simplistic backoff above is a workaround to make doctests #get consistent input.

Parameters:
  • train_sents (list(list(tuple))) – training data
  • max_rules (int) – output at most max_rules rules
  • min_score (int) – stop training when no rules better than min_score can be found
  • min_acc (float or None) – discard any rule with lower accuracy than min_acc
Returns:

the learned tagger

Return type:

BrillTagger

nltk.tag.brill_trainer_orig module

class nltk.tag.brill_trainer_orig.BrillTaggerTrainer(initial_tagger, templates, trace=0, deterministic=None, ruleformat='str')[source]

Bases: builtins.object

A trainer for tbl taggers, superseded by nltk.tag.brill_trainer.BrillTaggerTrainer

Parameters:deterministic – If true, then choose between rules that have the same score by picking the one whose __repr__ is lexicographically smaller. If false, then just pick the first rule we find with a given score – this will depend on the order in which keys are returned from dictionaries, and so may not be the same from one run to the next. If not specified, treat as true iff trace > 0.
train(train_sents, max_rules=200, min_score=2, min_acc=None)[source]

Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than min_acc.

#imports >>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Pos, Word >>> from nltk.tag import RegexpTagger >>> from nltk.tag.brill_trainer_orig import BrillTaggerTrainer

#some data >>> from nltk.corpus import treebank >>> training_data = treebank.tagged_sents()[:100] >>> baseline_data = treebank.tagged_sents()[100:200] >>> gold_data = treebank.tagged_sents()[200:300] >>> testing_data = [untag(s) for s in gold_data]

>>> backoff = RegexpTagger([
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
... (r'(The|the|A|a|An|an)$', 'AT'),   # articles
... (r'.*able$', 'JJ'),                # adjectives
... (r'.*ness$', 'NN'),                # nouns formed from adjectives
... (r'.*ly$', 'RB'),                  # adverbs
... (r'.*s$', 'NNS'),                  # plural nouns
... (r'.*ing$', 'VBG'),                # gerunds
... (r'.*ed$', 'VBD'),                 # past tense verbs
... (r'.*', 'NN')                      # nouns (default)
... ])
>>> baseline = backoff #see NOTE1
>>> baseline.evaluate(gold_data) 
0.2450142...

#templates >>> Template._cleartemplates() #clear any templates created in earlier tests >>> templates = [Template(Pos([-1])), Template(Pos([-1]), Word([0]))]

#construct a BrillTaggerTrainer >>> tt = BrillTaggerTrainer(baseline, templates, trace=3) >>> tagger1 = tt.train(training_data, max_rules=10) TBL train (orig) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None) <BLANKLINE>

B |

S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e

——————+——————————————————-
132 132 0 0 | AT->DT if Pos:NN@[-1]
85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 47 63 16 161 | NN->IN if Pos:NNS@[-1] 33 33 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | IN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | IN->, if Pos:NNS@[-1] & Word:,@[0] 22 27 5 24 | NN->-NONE- if Pos:VBD@[-1] 17 17 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger1.rules()[1:3]
(Rule('001', 'NN', ',', [(Pos([-1]),'NN'), (Word([0]),',')]), Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]))
>>> train_stats = tagger1.train_stats()
>>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]

##FIXME: the following test fails – why? # #>>> tagger1.print_template_statistics(printunused=False) #TEMPLATE STATISTICS (TRAIN) 2 templates, 10 rules) #TRAIN ( 3163 tokens) initial 2358 0.2545 final: 1719 0.4565 ##ID | Score (train) | #Rules | Template #——————————————– #001 | 404 0.632 | 7 0.700 | Template(Pos([-1]),Word([0])) #000 | 235 0.368 | 3 0.300 | Template(Pos([-1])) #<BLANKLINE> #<BLANKLINE>

>>> tagger1.evaluate(gold_data) 
0.43996...
>>> (tagged, test_stats) = tagger1.batch_tag_incremental(testing_data, gold_data)
>>> tagged[33][12:] == [('foreign', 'IN'), ('debt', 'NN'), ('of', 'IN'), ('$', 'NN'), ('64', 'CD'),
... ('billion', 'NN'), ('*U*', 'NN'), ('--', 'NN'), ('the', 'DT'), ('third-highest', 'NN'), ('in', 'NN'),
... ('the', 'DT'), ('developing', 'VBG'), ('world', 'NN'), ('.', '.')]
True
>>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]

##a high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (orig) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) <BLANKLINE>

B |

S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e

——————+——————————————————-
132 132 0 0 | AT->DT if Pos:NN@[-1]
85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 36 36 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | NN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | NN->, if Pos:NNS@[-1] & Word:,@[0] 19 19 0 6 | NN->VB if Pos:TO@[-1] 18 18 0 0 | CD->-NONE- if Pos:NN@[-1] & Word:0@[0] 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger2.evaluate(gold_data) 
0.44159544...
>>> tagger2.rules()[2:4]
(Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))

#NOTE1: (!!FIXME) A far better baseline uses nltk.tag.UnigramTagger, #with a RegexpTagger only as backoff. For instance, #>>> baseline = UnigramTagger(baseline_data, backoff=backoff) #However, as of Nov 2013, nltk.tag.UnigramTagger does not yield consistent results #between python versions. The simplistic backoff above is a workaround to make doctests #get consistent input.

Parameters:
  • train_sents (list(list(tuple))) – training data
  • max_rules (int) – output at most max_rules rules
  • min_score (int) – stop training when no rules better than min_score can be found
  • min_acc (float or None) – discard any rule with lower accuracy than min_acc
  • train_sents – training data
  • max_rules – output at most max_rules rules
  • min_score – stop training when no rules better than min_score can be found
  • min_acc – discard any rule with lower accuracy than min_acc
Returns:

the learned tagger

Return type:

BrillTagger

Returns:

the learned tagger

Return type:

BrillTagger

nltk.tag.hmm module

Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state. The HMM is an extension to the Markov chain, where each state corresponds deterministically to a given event. In the HMM the observation is a probabilistic function of the state. HMMs share the Markov chain’s assumption, being that the probability of transition from one state to another only depends on the current state - i.e. the series of states that led to the current state are not used. They are also time invariant.

The HMM is a directed graph, with probability weighted edges (representing the probability of a transition between the source and sink states) where each vertex emits an output symbol when entered. The symbol (or observation) is non-deterministically generated. For this reason, knowing that a sequence of output observations was generated by a given HMM does not mean that the corresponding sequence of states (and what the current state is) is known. This is the ‘hidden’ in the hidden markov model.

Formally, a HMM can be characterised by:

  • the output observation alphabet. This is the set of symbols which may be observed as output of the system.
  • the set of states.
  • the transition probabilities a_{ij} = P(s_t = j | s_{t-1} = i). These represent the probability of transition to each state from a given state.
  • the output probability matrix b_i(k) = P(X_t = o_k | s_t = i). These represent the probability of observing each symbol in a given state.
  • the initial state distribution. This gives the probability of starting in each state.

To ground this discussion, take a common NLP application, part-of-speech (POS) tagging. An HMM is desirable for this task as the highest probability tag sequence can be calculated for a given sequence of word forms. This differs from other tagging techniques which often tag each word individually, seeking to optimise each individual tagging greedily without regard to the optimal combination of tags for a larger unit, such as a sentence. The HMM does this with the Viterbi algorithm, which efficiently computes the optimal path through the graph given the sequence of words forms.

In POS tagging the states usually have a 1:1 correspondence with the tag alphabet - i.e. each state represents a single tag. The output observation alphabet is the set of word forms (the lexicon), and the remaining three parameters are derived by a training regime. With this information the probability of a given sentence can be easily derived, by simply summing the probability of each distinct path through the model. Similarly, the highest probability tagging sequence can be derived with the Viterbi algorithm, yielding a state sequence which can be mapped into a tag sequence.

This discussion assumes that the HMM has been trained. This is probably the most difficult task with the model, and requires either MLE estimates of the parameters or unsupervised learning using the Baum-Welch algorithm, a variant of EM.

For more information, please consult the source code for this module, which includes extensive demonstration code.

class nltk.tag.hmm.HiddenMarkovModelTagger(symbols, states, transitions, outputs, priors, transform=<function _identity at 0x10ee4de18>)[source]

Bases: nltk.tag.api.TaggerI

Hidden Markov model class, a generative model for labelling sequence data. These models define the joint probability of a sequence of symbols and their labels (state transitions) as the product of the starting state probability, the probability of each state transition, and the probability of each observation being generated from each state. This is described in more detail in the module documentation.

This implementation is based on the HMM description in Chapter 8, Huang, Acero and Hon, Spoken Language Processing and includes an extension for training shallow HMM parsers or specialized HMMs as in Molina et. al, 2002. A specialized HMM modifies training data by applying a specialization function to create a new training set that is more appropriate for sequential tagging with an HMM. A typical use case is chunking.

Parameters:
  • symbols (seq of any) – the set of output symbols (alphabet)
  • states (seq of any) – a set of states representing state space
  • transitions (ConditionalProbDistI) – transition probabilities; Pr(s_i | s_j) is the probability of transition from state i given the model is in state_j
  • outputs (ConditionalProbDistI) – output probabilities; Pr(o_k | s_i) is the probability of emitting symbol k when entering state i
  • priors (ProbDistI) – initial state distribution; Pr(s_i) is the probability of starting in state i
  • transform (callable) – an optional function for transforming training instances, defaults to the identity function.
best_path(unlabeled_sequence)[source]

Returns the state sequence of the optimal (most probable) path through the HMM. Uses the Viterbi algorithm to calculate this part by dynamic programming.

Returns:the state sequence
Return type:sequence of any
Parameters:unlabeled_sequence (list) – the sequence of unlabeled symbols
best_path_simple(unlabeled_sequence)[source]

Returns the state sequence of the optimal (most probable) path through the HMM. Uses the Viterbi algorithm to calculate this part by dynamic programming. This uses a simple, direct method, and is included for teaching purposes.

Returns:the state sequence
Return type:sequence of any
Parameters:unlabeled_sequence (list) – the sequence of unlabeled symbols
entropy(unlabeled_sequence)[source]

Returns the entropy over labellings of the given sequence. This is given by:

H(O) = - sum_S Pr(S | O) log Pr(S | O)

where the summation ranges over all state sequences, S. Let Z = Pr(O) = sum_S Pr(S, O)} where the summation ranges over all state sequences and O is the observation sequence. As such the entropy can be re-expressed as:

H = - sum_S Pr(S | O) log [ Pr(S, O) / Z ]
= log Z - sum_S Pr(S | O) log Pr(S, 0)
= log Z - sum_S Pr(S | O) [ log Pr(S_0) + sum_t Pr(S_t | S_{t-1}) + sum_t Pr(O_t | S_t) ]

The order of summation for the log terms can be flipped, allowing dynamic programming to be used to calculate the entropy. Specifically, we use the forward and backward probabilities (alpha, beta) giving:

H = log Z - sum_s0 alpha_0(s0) beta_0(s0) / Z * log Pr(s0)
+ sum_t,si,sj alpha_t(si) Pr(sj | si) Pr(O_t+1 | sj) beta_t(sj) / Z * log Pr(sj | si)
+ sum_t,st alpha_t(st) beta_t(st) / Z * log Pr(O_t | st)

This simply uses alpha and beta to find the probabilities of partial sequences, constrained to include the given state(s) at some point in time.

log_probability(sequence)[source]

Returns the log-probability of the given symbol sequence. If the sequence is labelled, then returns the joint log-probability of the symbol, state sequence. Otherwise, uses the forward algorithm to find the log-probability over all label sequences.

Returns:the log-probability of the sequence
Return type:float
Parameters:sequence (Token) – the sequence of symbols which must contain the TEXT property, and optionally the TAG property
point_entropy(unlabeled_sequence)[source]

Returns the pointwise entropy over the possible states at each position in the chain, given the observation sequence.

probability(sequence)[source]

Returns the probability of the given symbol sequence. If the sequence is labelled, then returns the joint probability of the symbol, state sequence. Otherwise, uses the forward algorithm to find the probability over all label sequences.

Returns:the probability of the sequence
Return type:float
Parameters:sequence (Token) – the sequence of symbols which must contain the TEXT property, and optionally the TAG property
random_sample(rng, length)[source]

Randomly sample the HMM to generate a sentence of a given length. This samples the prior distribution then the observation distribution and transition distribution for each subsequent observation and state. This will mostly generate unintelligible garbage, but can provide some amusement.

Returns:

the randomly created state/observation sequence, generated according to the HMM’s probability distributions. The SUBTOKENS have TEXT and TAG properties containing the observation and state respectively.

Return type:

list

Parameters:
  • rng (Random (or any object with a random() method)) – random number generator
  • length (int) – desired output length
reset_cache()[source]
tag(unlabeled_sequence)[source]

Tags the sequence with the highest probability state sequence. This uses the best_path method to find the Viterbi path.

Returns:a labelled sequence of symbols
Return type:list
Parameters:unlabeled_sequence (list) – the sequence of unlabeled symbols
test(test_sequence, verbose=False, **kwargs)[source]

Tests the HiddenMarkovModelTagger instance.

Parameters:
  • test_sequence (list(list)) – a sequence of labeled test instances
  • verbose (bool) – boolean flag indicating whether training should be verbose or include printed output
classmethod train(labeled_sequence, test_sequence=None, unlabeled_sequence=None, **kwargs)[source]

Train a new HiddenMarkovModelTagger using the given labeled and unlabeled training instances. Testing will be performed if test instances are provided.

Returns:

a hidden markov model tagger

Return type:

HiddenMarkovModelTagger

Parameters:
  • labeled_sequence (list(list)) – a sequence of labeled training instances, i.e. a list of sentences represented as tuples
  • test_sequence (list(list)) – a sequence of labeled test instances
  • unlabeled_sequence (list(list)) – a sequence of unlabeled training instances, i.e. a list of sentences represented as words
  • transform (function) – an optional function for transforming training instances, defaults to the identity function, see transform()
  • estimator (class or function) – an optional function or class that maps a condition’s frequency distribution to its probability distribution, defaults to a Lidstone distribution with gamma = 0.1
  • verbose (bool) – boolean flag indicating whether training should be verbose or include printed output
  • max_iterations (int) – number of Baum-Welch interations to perform
unicode_repr()
class nltk.tag.hmm.HiddenMarkovModelTrainer(states=None, symbols=None)[source]

Bases: builtins.object

Algorithms for learning HMM parameters from training data. These include both supervised learning (MLE) and unsupervised learning (Baum-Welch).

Creates an HMM trainer to induce an HMM with the given states and output symbol alphabet. A supervised and unsupervised training method may be used. If either of the states or symbols are not given, these may be derived from supervised training.

Parameters:
  • states (sequence of any) – the set of state labels
  • symbols (sequence of any) – the set of observation symbols
train(labeled_sequences=None, unlabeled_sequences=None, **kwargs)[source]

Trains the HMM using both (or either of) supervised and unsupervised techniques.

Returns:

the trained model

Return type:

HiddenMarkovModelTagger

Parameters:
  • labelled_sequences (list) – the supervised training data, a set of labelled sequences of observations
  • unlabeled_sequences (list) – the unsupervised training data, a set of sequences of observations
  • kwargs – additional arguments to pass to the training methods
train_supervised(labelled_sequences, estimator=None)[source]

Supervised training maximising the joint probability of the symbol and state sequences. This is done via collecting frequencies of transitions between states, symbol observations while within each state and which states start a sentence. These frequency distributions are then normalised into probability estimates, which can be smoothed if desired.

Returns:

the trained model

Return type:

HiddenMarkovModelTagger

Parameters:
  • labelled_sequences (list) – the training data, a set of labelled sequences of observations
  • estimator – a function taking a FreqDist and a number of bins and returning a CProbDistI; otherwise a MLE estimate is used
train_unsupervised(unlabeled_sequences, update_outputs=True, **kwargs)[source]

Trains the HMM using the Baum-Welch algorithm to maximise the probability of the data sequence. This is a variant of the EM algorithm, and is unsupervised in that it doesn’t need the state sequences for the symbols. The code is based on ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’, Lawrence Rabiner, IEEE, 1989.

Returns:the trained model
Return type:HiddenMarkovModelTagger
Parameters:unlabeled_sequences (list) – the training data, a set of sequences of observations

kwargs may include following parameters:

Parameters:
  • model – a HiddenMarkovModelTagger instance used to begin the Baum-Welch algorithm
  • max_iterations – the maximum number of EM iterations
  • convergence_logprob – the maximum change in log probability to allow convergence
nltk.tag.hmm.demo()[source]
nltk.tag.hmm.demo_bw()[source]
nltk.tag.hmm.demo_pos()[source]
nltk.tag.hmm.demo_pos_bw(test=10, supervised=20, unsupervised=10, verbose=True, max_iterations=5)[source]
nltk.tag.hmm.load_pos(num_sents)[source]
nltk.tag.hmm.logsumexp2(arr)[source]

nltk.tag.hunpos module

A module for interfacing with the HunPos open-source POS-tagger.

class nltk.tag.hunpos.HunposTagger(path_to_model, path_to_bin=None, encoding='ISO-8859-1', verbose=False)[source]

Bases: nltk.tag.api.TaggerI

A class for pos tagging with HunPos. The input is the paths to:
  • a model trained on training data
  • (optionally) the path to the hunpos-tag binary
  • (optionally) the encoding of the training data (default: ISO-8859-1)

Example:

>>> from nltk.tag.hunpos import HunposTagger
>>> ht = HunposTagger('english.model')
>>> ht.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
>>> ht.close()

This class communicates with the hunpos-tag binary via pipes. When the tagger object is no longer needed, the close() method should be called to free system resources. The class supports the context manager interface; if used in a with statement, the close() method is invoked automatically:

>>> with HunposTagger('english.model') as ht:
...     ht.tag('What is the airspeed of an unladen swallow ?'.split())
...
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
close()[source]

Closes the pipe to the hunpos executable.

tag(tokens)[source]

Tags a single sentence: a list of words. The tokens should not contain any newline characters.

nltk.tag.hunpos.setup_module(module)[source]

nltk.tag.mapping module

Interface for converting POS tags from various treebanks to the universal tagset of Petrov, Das, & McDonald.

The tagset consists of the following 12 coarse tags:

VERB - verbs (all tenses and modes) NOUN - nouns (common and proper) PRON - pronouns ADJ - adjectives ADV - adverbs ADP - adpositions (prepositions and postpositions) CONJ - conjunctions DET - determiners NUM - cardinal numbers PRT - particles or other function words X - other: foreign words, typos, abbreviations . - punctuation

@see: http://arxiv.org/abs/1104.2086 and http://code.google.com/p/universal-pos-tags/

nltk.tag.mapping.map_tag(source, target, source_tag)[source]

Maps the tag from the source tagset to the target tagset.

>>> map_tag('en-ptb', 'universal', 'VBZ')
'VERB'
>>> map_tag('en-ptb', 'universal', 'VBP')
'VERB'
>>> map_tag('en-ptb', 'universal', '``')
'.'
nltk.tag.mapping.tagset_mapping(source, target)[source]

Retrieve the mapping dictionary between tagsets.

>>> tagset_mapping('ru-rnc', 'universal') == {'!': '.', 'A': 'ADJ', 'C': 'CONJ', 'AD': 'ADV',            'NN': 'NOUN', 'VG': 'VERB', 'COMP': 'CONJ', 'NC': 'NUM', 'VP': 'VERB', 'P': 'ADP',            'IJ': 'X', 'V': 'VERB', 'Z': 'X', 'VI': 'VERB', 'YES_NO_SENT': 'X', 'PTCL': 'PRT'}
True

nltk.tag.senna module

A module for interfacing with the SENNA pipeline.

class nltk.tag.senna.CHKTagger(path, encoding='utf-8')[source]

Bases: nltk.tag.senna.SennaTagger

A chunker.

The input is: - path to the directory that contains SENNA executables. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import CHKTagger
>>> chktagger = CHKTagger('/usr/share/senna-v2.0')
>>> chktagger.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', u'B-NP'), ('is', u'B-VP'), ('the', u'B-NP'), ('airspeed', u'I-NP'),
('of', u'B-PP'), ('an', u'B-NP'), ('unladen', u'I-NP'), ('swallow',u'I-NP'),
('?', u'O')]
tag_sents(sentences)[source]

Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).

exception nltk.tag.senna.Error[source]

Bases: builtins.Exception

Basic error handling class to be extended by the module specific exceptions

exception nltk.tag.senna.ExecutableNotFound[source]

Bases: nltk.tag.senna.Error

Raised if the senna executable does not exist

class nltk.tag.senna.NERTagger(path, encoding='utf-8')[source]

Bases: nltk.tag.senna.SennaTagger

A named entity extractor.

The input is: - path to the directory that contains SENNA executables. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import NERTagger
>>> nertagger = NERTagger('/usr/share/senna-v2.0')
>>> nertagger.tag('Shakespeare theatre was in London .'.split())
[('Shakespeare', u'B-PER'), ('theatre', u'O'), ('was', u'O'), ('in', u'O'),
('London', u'B-LOC'), ('.', u'O')]
>>> nertagger.tag('UN headquarters are in NY , USA .'.split())
[('UN', u'B-ORG'), ('headquarters', u'O'), ('are', u'O'), ('in', u'O'),
('NY', u'B-LOC'), (',', u'O'), ('USA', u'B-LOC'), ('.', u'O')]
tag_sents(sentences)[source]

Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).

class nltk.tag.senna.POSTagger(path, encoding='utf-8')[source]

Bases: nltk.tag.senna.SennaTagger

A Part of Speech tagger.

The input is: - path to the directory that contains SENNA executables. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import POSTagger
>>> postagger = POSTagger('/usr/share/senna-v2.0')
>>> postagger.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'),
('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
tag_sents(sentences)[source]

Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).

exception nltk.tag.senna.RunFailure[source]

Bases: nltk.tag.senna.Error

Raised if the pipeline fails to execute

class nltk.tag.senna.SennaTagger(senna_path, operations, encoding='utf-8')[source]

Bases: nltk.tag.api.TaggerI

A general interface of the SENNA pipeline that supports any of the operations specified in SUPPORTED_OPERATIONS.

Applying multiple operations at once has the speed advantage. For example, senna v2.0 will calculate the POS tags in case you are extracting the named entities. Applying both of the operations will cost only the time of extracting the named entities.

SENNA pipeline has a fixed maximum size of the sentences that it can read. By default it is 1024 token/sentence. If you have larger sentences, changing the MAX_SENTENCE_SIZE value in SENNA_main.c should be considered and your system specific binary should be rebuilt. Otherwise this could introduce misalignment errors.

The input is: - path to the directory that contains SENNA executables. - List of the operations needed to be performed. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import SennaTagger
>>> pipeline = SennaTagger('/usr/share/senna-v2.0', ['pos', 'chk', 'ner'])
>>> sent = u'Düsseldorf is an international business center'.split()
>>> pipeline.tag(sent)
[{'word': u'D\xfcsseldorf', 'chk': u'B-NP', 'ner': u'B-PER', 'pos': u'NNP'},
{'word': u'is', 'chk': u'B-VP', 'ner': u'O', 'pos': u'VBZ'},
{'word': u'an', 'chk': u'B-NP', 'ner': u'O', 'pos': u'DT'},
{'word': u'international', 'chk': u'I-NP', 'ner': u'O', 'pos': u'JJ'},
{'word': u'business', 'chk': u'I-NP', 'ner': u'O', 'pos': u'NN'},
{'word': u'center', 'chk': u'I-NP', 'ner': u'O','pos': u'NN'}]
SUPPORTED_OPERATIONS = ['pos', 'chk', 'ner']
executable[source]

A property that determines the system specific binary that should be used in the pipeline. In case, the system is not known the senna binary will be used.

tag(tokens)[source]

Applies the specified operation(s) on a list of tokens.

tag_sents(sentences)[source]

Applies the tag method over a list of sentences. This method will return a list of dictionaries. Every dictionary will contain a word with its calculated annotations/tags.

exception nltk.tag.senna.SentenceMisalignment[source]

Bases: nltk.tag.senna.Error

Raised if the new sentence is shorter than the original one or the number of sentences in the result is less than the input.

nltk.tag.senna.setup_module(module)[source]

nltk.tag.sequential module

Classes for tagging sentences sequentially, left to right. The abstract base class SequentialBackoffTagger serves as the base class for all the taggers in this module. Tagging of individual words is performed by the method choose_tag(), which is defined by subclasses of SequentialBackoffTagger. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted instead. Any SequentialBackoffTagger may serve as a backoff tagger for any other SequentialBackoffTagger.

class nltk.tag.sequential.AffixTagger(train=None, model=None, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.ContextTagger

A tagger that chooses a token’s tag based on a leading or trailing substring of its word string. (It is important to note that these substrings are not necessarily “true” morphological affixes). In particular, a fixed-length substring of the word is looked up in a table, and the corresponding tag is returned. Affix taggers are typically constructed by training them on a tagged corpus.

Construct a new affix tagger.

Parameters:
  • affix_length – The length of the affixes that should be considered during training and tagging. Use negative numbers for suffixes.
  • min_stem_length – Any words whose length is less than min_stem_length+abs(affix_length) will be assigned a tag of None by this tagger.
context(tokens, index, history)[source]
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.AffixTagger'
class nltk.tag.sequential.BigramTagger(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.NgramTagger

A tagger that chooses a token’s tag based its word string and on the preceding words’ tag. In particular, a tuple consisting of the previous tag and the word is looked up in a table, and the corresponding tag is returned.

Parameters:
  • train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
  • model (dict) – The tagger model
  • backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
  • cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.BigramTagger'
class nltk.tag.sequential.ClassifierBasedPOSTagger(feature_detector=None, train=None, classifier_builder=<function NaiveBayesClassifier.train at 0x10e58c378>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)[source]

Bases: nltk.tag.sequential.ClassifierBasedTagger

A classifier based part of speech tagger.

feature_detector(tokens, index, history)[source]
class nltk.tag.sequential.ClassifierBasedTagger(feature_detector=None, train=None, classifier_builder=<function NaiveBayesClassifier.train at 0x10e58c378>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger, nltk.tag.api.FeaturesetTaggerI

A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function:

feature_detector(tokens, index, history) -> featureset

Where tokens is the list of unlabeled tokens in the sentence; index is the index of the token for which feature detection should be performed; and history is list of the tags for all tokens before index.

Construct a new classifier-based sequential tagger.

Parameters:
  • feature_detector – A function used to generate the featureset input for the classifier:: feature_detector(tokens, index, history) -> featureset
  • train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
  • backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
  • classifier_builder – A function used to train a new classifier based on the data in train. It should take one argument, a list of labeled featuresets (i.e., (featureset, label) tuples).
  • classifier – The classifier that should be used by the tagger. This is only useful if you want to manually construct the classifier; normally, you would use train instead.
  • backoff – A backoff tagger, used if this tagger is unable to determine a tag for a given token.
  • cutoff_prob – If specified, then this tagger will fall back on its backoff tagger if the probability of the most likely tag is less than cutoff_prob.
choose_tag(tokens, index, history)[source]
classifier()[source]

Return the classifier that this tagger uses to choose a tag for each word in a sentence. The input for this classifier is generated using this tagger’s feature detector. See feature_detector()

feature_detector(tokens, index, history)[source]

Return the feature detector that this tagger uses to generate featuresets for its classifier. The feature detector is a function with the signature:

feature_detector(tokens, index, history) -> featureset

See classifier()

unicode_repr()
class nltk.tag.sequential.ContextTagger(context_to_tag, backoff=None)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger

An abstract base class for sequential backoff taggers that choose a tag for a token based on the value of its “context”. Different subclasses are used to define different contexts.

A ContextTagger chooses the tag for a token by calculating the token’s context, and looking up the corresponding tag in a table. This table can be constructed manually; or it can be automatically constructed based on a training corpus, using the _train() factory method.

Variables:_context_to_tag – Dictionary mapping contexts to tags.
choose_tag(tokens, index, history)[source]
context(tokens, index, history)[source]
Returns:the context that should be used to look up the tag for the specified token; or None if the specified token should not be handled by this tagger.
Return type:(hashable)
size()[source]
Returns:The number of entries in the table used by this tagger to map from contexts to tags.
unicode_repr()
class nltk.tag.sequential.DefaultTagger(tag)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger

A tagger that assigns the same tag to every token.

>>> from nltk.tag.sequential import DefaultTagger
>>> default_tagger = DefaultTagger('NN')
>>> list(default_tagger.tag('This is a test'.split()))
[('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('test', 'NN')]

This tagger is recommended as a backoff tagger, in cases where a more powerful tagger is unable to assign a tag to the word (e.g. because the word was not seen during training).

Parameters:tag (str) – The tag to assign to each token
choose_tag(tokens, index, history)[source]
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.DefaultTagger'
unicode_repr()
class nltk.tag.sequential.NgramTagger(n, train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.ContextTagger

A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags. In particular, a tuple (tags[i-n:i-1], words[i]) is looked up in a table, and the corresponding tag is returned. N-gram taggers are typically trained on a tagged corpus.

Train a new NgramTagger using the given training data or the supplied model. In particular, construct a new tagger whose table maps from each context (tag[i-n:i-1], word[i]) to the most frequent tag for that context. But exclude any contexts that are already tagged perfectly by the backoff tagger.

Parameters:
  • train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
  • backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
  • cutoff – If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.
context(tokens, index, history)[source]
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.NgramTagger'
class nltk.tag.sequential.RegexpTagger(regexps, backoff=None)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger

Regular Expression Tagger

The RegexpTagger assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:

>>> from nltk.corpus import brown
>>> from nltk.tag.sequential import RegexpTagger
>>> test_sent = brown.sents(categories='news')[0]
>>> regexp_tagger = RegexpTagger(
...     [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
...      (r'(The|the|A|a|An|an)$', 'AT'),   # articles
...      (r'.*able$', 'JJ'),                # adjectives
...      (r'.*ness$', 'NN'),                # nouns formed from adjectives
...      (r'.*ly$', 'RB'),                  # adverbs
...      (r'.*s$', 'NNS'),                  # plural nouns
...      (r'.*ing$', 'VBG'),                # gerunds
...      (r'.*ed$', 'VBD'),                 # past tense verbs
...      (r'.*', 'NN')                      # nouns (default)
... ])
>>> regexp_tagger
<Regexp Tagger: size=9>
>>> regexp_tagger.tag(test_sent)
[('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'),
('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'),
("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'),
('produced', 'VBD'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'),
('that', 'NN'), ('any', 'NN'), ('irregularities', 'NNS'), ('took', 'NN'),
('place', 'NN'), ('.', 'NN')]
Parameters:regexps (list(tuple(str, str))) – A list of (regexp, tag) pairs, each of which indicates that a word matching regexp should be tagged with tag. The pairs will be evalutated in order. If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None.
choose_tag(tokens, index, history)[source]
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.RegexpTagger'
unicode_repr()
class nltk.tag.sequential.SequentialBackoffTagger(backoff=None)[source]

Bases: nltk.tag.api.TaggerI

An abstract base class for taggers that tags words sequentially, left to right. Tagging of individual words is performed by the choose_tag() method, which should be defined by subclasses. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.

Variables:_taggers – A list of all the taggers that should be tried to tag a token (i.e., self and its backoff taggers).
backoff[source]

The backoff tagger for this tagger.

choose_tag(tokens, index, history)[source]

Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.

Return type:

str

Parameters:
  • tokens (list) – The list of words that are being tagged.
  • index (int) – The index of the word whose tag should be returned.
  • history (list(str)) – A list of the tags for all words before index.
tag(tokens)[source]
tag_one(tokens, index, history)[source]

Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.

Return type:

str

Parameters:
  • tokens (list) – The list of words that are being tagged.
  • index (int) – The index of the word whose tag should be returned.
  • history (list(str)) – A list of the tags for all words before index.
class nltk.tag.sequential.TrigramTagger(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.NgramTagger

A tagger that chooses a token’s tag based its word string and on the preceding two words’ tags. In particular, a tuple consisting of the previous two tags and the word is looked up in a table, and the corresponding tag is returned.

Parameters:
  • train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
  • model (dict) – The tagger model
  • backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
  • cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.TrigramTagger'
class nltk.tag.sequential.UnigramTagger(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.NgramTagger

Unigram Tagger

The UnigramTagger finds the most likely tag for each word in a training corpus, and then uses that information to assign tags to new tokens.

>>> from nltk.corpus import brown
>>> from nltk.tag.sequential import UnigramTagger
>>> test_sent = brown.sents(categories='news')[0]
>>> unigram_tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> for tok, tag in unigram_tagger.tag(test_sent):
...     print("(%s, %s), " % (tok, tag))
(The, AT), (Fulton, NP-TL), (County, NN-TL), (Grand, JJ-TL),
(Jury, NN-TL), (said, VBD), (Friday, NR), (an, AT),
(investigation, NN), (of, IN), (Atlanta's, NP$), (recent, JJ),
(primary, NN), (election, NN), (produced, VBD), (``, ``),
(no, AT), (evidence, NN), ('', ''), (that, CS), (any, DTI),
(irregularities, NNS), (took, VBD), (place, NN), (., .),
Parameters:
  • train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
  • model (dict) – The tagger model
  • backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
  • cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
context(tokens, index, history)[source]
classmethod decode_json_obj(obj)[source]
encode_json_obj()[source]
json_tag = 'nltk.tag.sequential.UnigramTagger'

nltk.tag.stanford module

A module for interfacing with the Stanford taggers.

class nltk.tag.stanford.NERTagger(*args, **kwargs)[source]

Bases: nltk.tag.stanford.StanfordTagger

A class for ner tagging with Stanford Tagger. The input is the paths to:

  • a model trained on training data
  • (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
  • (optionally) the encoding of the training data (default: ASCII)

Example:

>>> from nltk.tag.stanford import NERTagger
>>> st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
...                '/usr/share/stanford-ner/stanford-ner.jar') 
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
 ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
 ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
parse_output(text)[source]
class nltk.tag.stanford.POSTagger(*args, **kwargs)[source]

Bases: nltk.tag.stanford.StanfordTagger

A class for pos tagging with Stanford Tagger. The input is the paths to:
  • a model trained on training data
  • (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
  • (optionally) the encoding of the training data (default: ASCII)

Example:

>>> from nltk.tag.stanford import POSTagger
>>> st = POSTagger('/usr/share/stanford-postagger/models/english-bidirectional-distsim.tagger',
...                '/usr/share/stanford-postagger/stanford-postagger.jar') 
>>> st.tag('What is the airspeed of an unladen swallow ?'.split()) 
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
class nltk.tag.stanford.StanfordTagger(path_to_model, path_to_jar=None, encoding='ascii', verbose=False, java_options='-mx1000m')[source]

Bases: nltk.tag.api.TaggerI

An interface to Stanford taggers. Subclasses must define:

  • _cmd property: A property that returns the command that will be executed.
  • _SEPARATOR: Class constant that represents that character that is used to separate the tokens from their tags.
  • _JAR file: Class constant that represents the jar file name.
parse_output(text)[source]
tag(tokens)[source]
tag_sents(sentences)[source]

nltk.tag.tnt module

Implementation of ‘TnT - A Statisical Part of Speech Tagger’ by Thorsten Brants

http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf

class nltk.tag.tnt.TnT(unk=None, Trained=False, N=1000, C=False)[source]

Bases: nltk.tag.api.TaggerI

TnT - Statistical POS tagger

IMPORTANT NOTES:

  • DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
    • It is possible to provide an untrained POS tagger to create tags for unknown words, see __init__ function
  • SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
    • Due to the nature of this tagger, it works best when trained over sentence delimited input.
    • However it still produces good results if the training data and testing data are separated on all punctuation eg: [,.?!]
    • Input for training is expected to be a list of sentences where each sentence is a list of (word, tag) tuples
    • Input for tag function is a single sentence Input for tagdata function is a list of sentences Output is of a similar form
  • Function provided to process text that is unsegmented
    • Please see basic_sent_chop()

TnT uses a second order Markov model to produce tags for a sequence of input, specifically:

argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)

IE: the maximum projection of a set of probabilities

The set of possible tags for a given word is derived from the training data. It is the set of all tags that exact word has been assigned.

To speed up and get more precision, we can use log addition to instead multiplication, specifically:

argmax [Sigma(log(P(t_i|t_i-1,t_i-2))+log(P(w_i|t_i)))] +
log(P(t_T+1|t_T))

The probability of a tag for a given word is the linear interpolation of 3 markov models; a zero-order, first-order, and a second order model.

P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
l3*P(t_i| t_i-1, t_i-2)

A beam search is used to limit the memory usage of the algorithm. The degree of the beam can be changed using N in the initialization. N represents the maximum number of possible solutions to maintain while tagging.

It is possible to differentiate the tags which are assigned to capitalized words. However this does not result in a significant gain in the accuracy of the results.

tag(data)[source]

Tags a single sentence

Parameters:data ([string,]) – list of words
Returns:[(word, tag),]

Calls recursive function ‘_tagword’ to produce a list of tags

Associates the sequence of returned tags with the correct words in the input sequence

returns a list of (word, tag) tuples

tagdata(data)[source]

Tags each sentence in a list of sentences

:param data:list of list of words :type data: [[string,],] :return: list of list of (word, tag) tuples

Invokes tag(sent) function for each sentence compiles the results into a list of tagged sentences each tagged sentence is a list of (word, tag) tuples

train(data)[source]

Uses a set of tagged data to train the tagger. If an unknown word tagger is specified, it is trained on the same data.

Parameters:data (tuple(str)) – List of lists of (word, tag) tuples
nltk.tag.tnt.basic_sent_chop(data, raw=True)[source]

Basic method for tokenizing input into sentences for this tagger:

Parameters:
  • data (str or tuple(str, str)) – list of tokens (words or (word, tag) tuples)
  • raw (bool) – boolean flag marking the input data as a list of words or a list of tagged words
Returns:

list of sentences sentences are a list of tokens tokens are the same as the input

Function takes a list of tokens and separates the tokens into lists where each list represents a sentence fragment This function can separate both tagged and raw sequences into basic sentences.

Sentence markers are the set of [,.!?]

This is a simple method which enhances the performance of the TnT tagger. Better sentence tokenization will further enhance the results.

nltk.tag.tnt.demo()[source]
nltk.tag.tnt.demo2()[source]
nltk.tag.tnt.demo3()[source]

nltk.tag.util module

nltk.tag.util.str2tuple(s, sep='/')[source]

Given the string representation of a tagged token, return the corresponding tuple representation. The rightmost occurrence of sep in s will be used to divide s into a word string and a tag string. If sep does not occur in s, return (s, None).

>>> from nltk.tag.util import str2tuple
>>> str2tuple('fly/NN')
('fly', 'NN')
Parameters:
  • s (str) – The string representation of a tagged token.
  • sep (str) – The separator string used to separate word strings from tags.
nltk.tag.util.tuple2str(tagged_token, sep='/')[source]

Given the tuple representation of a tagged token, return the corresponding string representation. This representation is formed by concatenating the token’s word string, followed by the separator, followed by the token’s tag. (If the tag is None, then just return the bare word string.)

>>> from nltk.tag.util import tuple2str
>>> tagged_token = ('fly', 'NN')
>>> tuple2str(tagged_token)
'fly/NN'
Parameters:
  • tagged_token (tuple(str, str)) – The tuple representation of a tagged token.
  • sep (str) – The separator string used to separate word strings from tags.
nltk.tag.util.untag(tagged_sentence)[source]

Given a tagged sentence, return an untagged version of that sentence. I.e., return a list containing the first element of each tuple in tagged_sentence.

>>> from nltk.tag.util import untag
>>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
['John', 'saw', 'Mary']

Module contents

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available. It uses the Penn Treebank tagset:

>>> from nltk.tag import pos_tag  
>>> from nltk.tokenize import word_tokenize 
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]

This package defines several taggers, which take a token list (typically a sentence), assign a tag to each token, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...

For more information, please consult chapter 5 of the NLTK Book.

nltk.tag.pos_tag(tokens)[source]

Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.

>>> from nltk.tag import pos_tag 
>>> from nltk.tokenize import word_tokenize 
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]
Parameters:tokens (list(str)) – Sequence of tokens to be tagged
Returns:The tagged tokens
Return type:list(tuple(str, str))
nltk.tag.pos_tag_sents(sentences)[source]

Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.