nltk.toolbox module¶
Module for reading, writing and manipulating Toolbox databases and settings files.
- class nltk.toolbox.StandardFormat[source]¶
Bases:
object
Class for reading and processing standard format marker files and strings.
- fields(strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)[source]¶
Return an iterator that returns the next field in a
(marker, value)
tuple, wheremarker
andvalue
are unicode strings if anencoding
was specified in thefields()
method. Otherwise they are non-unicode strings.- Parameters:
strip (bool) – strip trailing whitespace from the last line of each field
unwrap (bool) – Convert newlines in a field to spaces.
encoding (str or None) – Name of an encoding to use. If it is specified then the
fields()
method returns unicode strings rather than non unicode strings.errors (str) – Error handling scheme for codec. Same as the
decode()
builtin string method.unicode_fields (sequence) – Set of marker names whose values are UTF-8 encoded. Ignored if encoding is None. If the whole file is UTF-8 encoded set
encoding='utf8'
and leaveunicode_fields
with its default value of None.
- Return type:
iter(tuple(str, str))
- open(sfm_file)[source]¶
Open a standard format marker file for sequential reading.
- Parameters:
sfm_file (str) – name of the standard format marker input file
- class nltk.toolbox.ToolboxData[source]¶
Bases:
StandardFormat
- class nltk.toolbox.ToolboxSettings[source]¶
Bases:
StandardFormat
This class is the base class for settings files.
- parse(encoding=None, errors='strict', **kwargs)[source]¶
Return the contents of toolbox settings file with a nested structure.
- Parameters:
encoding (str) – encoding used by settings file
errors (str) – Error handling scheme for codec. Same as
decode()
builtin method.kwargs (dict) – Keyword arguments passed to
StandardFormat.fields()
- Return type:
ElementTree._ElementInterface
- nltk.toolbox.add_blank_lines(tree, blanks_before, blanks_between)[source]¶
Add blank lines before all elements and subelements specified in blank_before.
- Parameters:
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
blank_before (dict(tuple)) – elements and subelements to add blank lines before
- nltk.toolbox.add_default_fields(elem, default_fields)[source]¶
Add blank elements and subelements specified in default_fields.
- Parameters:
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
default_fields (dict(tuple)) – fields to add to each type of element and subelement
- nltk.toolbox.remove_blanks(elem)[source]¶
Remove all elements and subelements with no text and no child elements.
- Parameters:
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
- nltk.toolbox.sort_fields(elem, field_orders)[source]¶
Sort the elements and subelements in order specified in field_orders.
- Parameters:
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
field_orders (dict(tuple)) – order of fields for each type of element and subelement
- nltk.toolbox.to_sfm_string(tree, encoding=None, errors='strict', unicode_fields=None)[source]¶
Return a string with a standard format representation of the toolbox data in tree (tree can be a toolbox database or a single record).
- Parameters:
tree (ElementTree._ElementInterface) – flat representation of toolbox data (whole database or single record)
encoding (str) – Name of an encoding to use.
errors (str) – Error handling scheme for codec. Same as the
encode()
builtin string method.unicode_fields (dict(str) or set(str))
- Return type:
str