nltk.toolbox module

Module for reading, writing and manipulating Toolbox databases and settings files.

class nltk.toolbox.StandardFormat[source]

Bases: object

Class for reading and processing standard format marker files and strings.

__init__(filename=None, encoding=None)[source]
close()[source]

Close a previously opened standard format marker file or string.

fields(strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)[source]

Return an iterator that returns the next field in a (marker, value) tuple, where marker and value are unicode strings if an encoding was specified in the fields() method. Otherwise they are non-unicode strings.

Parameters
  • strip (bool) – strip trailing whitespace from the last line of each field

  • unwrap (bool) – Convert newlines in a field to spaces.

  • encoding (str or None) – Name of an encoding to use. If it is specified then the fields() method returns unicode strings rather than non unicode strings.

  • errors (str) – Error handling scheme for codec. Same as the decode() builtin string method.

  • unicode_fields (sequence) – Set of marker names whose values are UTF-8 encoded. Ignored if encoding is None. If the whole file is UTF-8 encoded set encoding='utf8' and leave unicode_fields with its default value of None.

Return type

iter(tuple(str, str))

open(sfm_file)[source]

Open a standard format marker file for sequential reading.

Parameters

sfm_file (str) – name of the standard format marker input file

open_string(s)[source]

Open a standard format marker string for sequential reading.

Parameters

s (str) – string to parse as a standard format marker input file

raw_fields()[source]

Return an iterator that returns the next field in a (marker, value) tuple. Linebreaks and trailing white space are preserved except for the final newline in each field.

Return type

iter(tuple(str, str))

class nltk.toolbox.ToolboxData[source]

Bases: StandardFormat

parse(grammar=None, **kwargs)[source]
class nltk.toolbox.ToolboxSettings[source]

Bases: StandardFormat

This class is the base class for settings files.

__init__()[source]
parse(encoding=None, errors='strict', **kwargs)[source]

Return the contents of toolbox settings file with a nested structure.

Parameters
  • encoding (str) – encoding used by settings file

  • errors (str) – Error handling scheme for codec. Same as decode() builtin method.

  • kwargs (dict) – Keyword arguments passed to StandardFormat.fields()

Return type

ElementTree._ElementInterface

nltk.toolbox.add_blank_lines(tree, blanks_before, blanks_between)[source]

Add blank lines before all elements and subelements specified in blank_before.

Parameters
  • elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure

  • blank_before (dict(tuple)) – elements and subelements to add blank lines before

nltk.toolbox.add_default_fields(elem, default_fields)[source]

Add blank elements and subelements specified in default_fields.

Parameters
  • elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure

  • default_fields (dict(tuple)) – fields to add to each type of element and subelement

nltk.toolbox.demo()[source]
nltk.toolbox.remove_blanks(elem)[source]

Remove all elements and subelements with no text and no child elements.

Parameters

elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure

nltk.toolbox.sort_fields(elem, field_orders)[source]

Sort the elements and subelements in order specified in field_orders.

Parameters
  • elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure

  • field_orders (dict(tuple)) – order of fields for each type of element and subelement

nltk.toolbox.to_settings_string(tree, encoding=None, errors='strict', unicode_fields=None)[source]
nltk.toolbox.to_sfm_string(tree, encoding=None, errors='strict', unicode_fields=None)[source]

Return a string with a standard format representation of the toolbox data in tree (tree can be a toolbox database or a single record).

Parameters
  • tree (ElementTree._ElementInterface) – flat representation of toolbox data (whole database or single record)

  • encoding (str) – Name of an encoding to use.

  • errors (str) – Error handling scheme for codec. Same as the encode() builtin string method.

  • unicode_fields (dict(str) or set(str)) –

Return type

str