nltk.tbl.demo module¶

nltk.tbl.demo.corpus_size(seqs)[source]¶

nltk.tbl.demo.demo()[source]¶: Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.

nltk.tbl.demo.demo_error_analysis()[source]¶: Writes a file with context for each erroneous word after tagging testing data

nltk.tbl.demo.demo_generated_templates()[source]¶

Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.

Note: training with 500 templates can easily fill all available even on relatively small corpora

nltk.tbl.demo.demo_high_accuracy_rules()[source]¶: Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.

nltk.tbl.demo.demo_learning_curve()[source]¶: Plot a learning curve – the contribution on tagging accuracy of the individual rules. Note: requires matplotlib

nltk.tbl.demo.demo_multifeature_template()[source]¶: Templates can have more than a single feature.

nltk.tbl.demo.demo_multiposition_feature()[source]¶

The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.

For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below.

nltk.tbl.demo.demo_repr_rule_format()[source]¶: Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))

nltk.tbl.demo.demo_serialize_tagger()[source]¶: Serializes the learned tagger to a file in pickle format; reloads it and validates the process.

nltk.tbl.demo.demo_str_rule_format()[source]¶: Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))

nltk.tbl.demo.demo_template_statistics()[source]¶

Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.

Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor).

nltk.tbl.demo.demo_verbose_rule_format()[source]¶: Exemplify Rule.format(“verbose”)

nltk.tbl.demo.postag(templates=None, tagged_data=None, num_sents=1000, max_rules=300, min_score=3, min_acc=None, train=0.8, trace=3, randomize=False, ruleformat='str', incremental_stats=False, template_stats=False, error_output=None, serialize_output=None, learning_curve_output=None, learning_curve_take=300, baseline_backoff_tagger=None, separate_baseline_data=False, cache_baseline_tagger=None)[source]¶

Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template

Parameters:

tagged_data (C{int}) – maximum number of rule instances to create
num_sents (C{int}) – how many sentences of training and testing data to use
max_rules (C{int}) – maximum number of rule instances to create
min_score (C{int}) – the minimum score for a rule in order for it to be considered
min_acc (C{float}) – the minimum score for a rule in order for it to be considered
train (C{float}) – the fraction of the the corpus to be used for training (1=all)
trace (C{int}) – the level of diagnostic tracing output to produce (0-4)
randomize (C{bool}) – whether the training data should be a random subset of the corpus
ruleformat (C{str}) – rule output format, one of “str”, “repr”, “verbose”
incremental_stats (C{bool}) – if true, will tag incrementally and collect stats for each rule (rather slow)
template_stats (C{bool}) – if true, will print per-template statistics collected in training and (optionally) testing
error_output (C{string}) – the file where errors will be saved
serialize_output (C{string}) – the file where the learned tbl tagger will be saved
learning_curve_output (C{string}) – filename of plot of learning curve(s) (train and also test, if available)
learning_curve_take (C{int}) – how many rules plotted
baseline_backoff_tagger (tagger) – the file where rules will be saved
separate_baseline_data (C{bool}) – use a fraction of the training data exclusively for training baseline
cache_baseline_tagger (C{string}) – cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)

Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high).

NLTK

Documentation

nltk.tbl.demo module¶