nltk.tbl.demo module

nltk.tbl.demo.corpus_size(seqs)[source]
nltk.tbl.demo.demo()[source]

Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.

nltk.tbl.demo.demo_error_analysis()[source]

Writes a file with context for each erroneous word after tagging testing data

nltk.tbl.demo.demo_generated_templates()[source]

Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.

Note: training with 500 templates can easily fill all available even on relatively small corpora

nltk.tbl.demo.demo_high_accuracy_rules()[source]

Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.

nltk.tbl.demo.demo_learning_curve()[source]

Plot a learning curve – the contribution on tagging accuracy of the individual rules. Note: requires matplotlib

nltk.tbl.demo.demo_multifeature_template()[source]

Templates can have more than a single feature.

nltk.tbl.demo.demo_multiposition_feature()[source]

The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.

For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below.

nltk.tbl.demo.demo_repr_rule_format()[source]

Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))

nltk.tbl.demo.demo_serialize_tagger()[source]

Serializes the learned tagger to a file in pickle format; reloads it and validates the process.

nltk.tbl.demo.demo_str_rule_format()[source]

Exemplify repr(Rule) (see also str(Rule) and Rule.format(“verbose”))

nltk.tbl.demo.demo_template_statistics()[source]

Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.

Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor).

nltk.tbl.demo.demo_verbose_rule_format()[source]

Exemplify Rule.format(“verbose”)

nltk.tbl.demo.postag(templates=None, tagged_data=None, num_sents=1000, max_rules=300, min_score=3, min_acc=None, train=0.8, trace=3, randomize=False, ruleformat='str', incremental_stats=False, template_stats=False, error_output=None, serialize_output=None, learning_curve_output=None, learning_curve_take=300, baseline_backoff_tagger=None, separate_baseline_data=False, cache_baseline_tagger=None)[source]

Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template

Parameters
  • tagged_data (C{int}) – maximum number of rule instances to create

  • num_sents (C{int}) – how many sentences of training and testing data to use

  • max_rules (C{int}) – maximum number of rule instances to create

  • min_score (C{int}) – the minimum score for a rule in order for it to be considered

  • min_acc (C{float}) – the minimum score for a rule in order for it to be considered

  • train (C{float}) – the fraction of the the corpus to be used for training (1=all)

  • trace (C{int}) – the level of diagnostic tracing output to produce (0-4)

  • randomize (C{bool}) – whether the training data should be a random subset of the corpus

  • ruleformat (C{str}) – rule output format, one of “str”, “repr”, “verbose”

  • incremental_stats (C{bool}) – if true, will tag incrementally and collect stats for each rule (rather slow)

  • template_stats (C{bool}) – if true, will print per-template statistics collected in training and (optionally) testing

  • error_output (C{string}) – the file where errors will be saved

  • serialize_output (C{string}) – the file where the learned tbl tagger will be saved

  • learning_curve_output (C{string}) – filename of plot of learning curve(s) (train and also test, if available)

  • learning_curve_take (C{int}) – how many rules plotted

  • baseline_backoff_tagger (tagger) – the file where rules will be saved

  • separate_baseline_data (C{bool}) – use a fraction of the training data exclusively for training baseline

  • cache_baseline_tagger (C{string}) – cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)

Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high).