nltk.tag.brill module

class nltk.tag.brill.Word[source]

Bases: nltk.tbl.feature.Feature

Feature which examines the text (word) of nearby tokens.

json_tag = 'nltk.tag.brill.Word'
static extract_property(tokens, index)[source]

@return: The given token’s text.

class nltk.tag.brill.Pos[source]

Bases: nltk.tbl.feature.Feature

Feature which examines the tags of nearby tokens.

json_tag = 'nltk.tag.brill.Pos'
static extract_property(tokens, index)[source]

@return: The given token’s tag.

nltk.tag.brill.nltkdemo18()[source]

Return 18 templates, from the original nltk demo, in multi-feature syntax

nltk.tag.brill.nltkdemo18plus()[source]

Return 18 templates, from the original nltk demo, and additionally a few multi-feature ones (the motivation is easy comparison with nltkdemo18)

nltk.tag.brill.fntbl37()[source]

Return 37 templates taken from the postagging task of the fntbl distribution https://www.cs.jhu.edu/~rflorian/fntbl/ (37 is after excluding a handful which do not condition on Pos[0]; fntbl can do that but the current nltk implementation cannot.)

nltk.tag.brill.brill24()[source]

Return 24 templates of the seminal TBL paper, Brill (1995)

nltk.tag.brill.describe_template_sets()[source]

Print the available template sets in this demo, with a short description”

class nltk.tag.brill.BrillTagger[source]

Bases: nltk.tag.api.TaggerI

Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the TagRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using one of the TaggerTrainers available.

json_tag = 'nltk.tag.BrillTagger'
__init__(initial_tagger, rules, training_stats=None)[source]
Parameters
  • initial_tagger (TaggerI) – The initial tagger

  • rules (list(TagRule)) – An ordered list of transformation rules that should be used to correct the initial tagging.

  • training_stats (dict) – A dictionary of statistics collected during training, for possible later use

encode_json_obj()[source]
classmethod decode_json_obj(obj)[source]
rules()[source]

Return the ordered list of transformation rules that this tagger has learnt

Returns

the ordered list of transformation rules that correct the initial tagging

Return type

list of Rules

train_stats(statistic=None)[source]

Return a named statistic collected during training, or a dictionary of all available statistics if no name given

Parameters

statistic (str) – name of statistic

Returns

some statistic collected during training of this tagger

Return type

any (but usually a number)

tag(tokens)[source]

Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag).

Return type

list(tuple(str, str))

print_template_statistics(test_stats=None, printunused=True)[source]

Print a list of all templates, ranked according to efficiency.

If test_stats is available, the templates are ranked according to their relative contribution (summed for all rules created from a given template, weighted by score) to the performance on the test set. If no test_stats, then statistics collected during training are used instead. There is also an unweighted measure (just counting the rules). This is less informative, though, as many low-score rules will appear towards end of training.

Parameters
  • test_stats (dict of str -> any (but usually numbers)) – dictionary of statistics collected during testing

  • printunused (bool) – if True, print a list of all unused templates

Returns

None

Return type

None

batch_tag_incremental(sequences, gold)[source]

Tags by applying each rule to the entire corpus (rather than all rules to a single sequence). The point is to collect statistics on the test set for individual rules.

NOTE: This is inefficient (does not build any index, so will traverse the entire corpus N times for N rules) – usually you would not care about statistics for individual rules and thus use batch_tag() instead

Parameters
  • sequences (list of list of strings) – lists of token sequences (sentences, in some applications) to be tagged

  • gold (list of list of strings) – the gold standard

Returns

tuple of (tagged_sequences, ordered list of rule scores (one for each rule))