nltk.toolbox.StandardFormat

class nltk.toolbox.StandardFormat[source]

Bases: object

Class for reading and processing standard format marker files and strings.

__init__(filename=None, encoding=None)[source]
open(sfm_file)[source]

Open a standard format marker file for sequential reading.

Parameters

sfm_file (str) – name of the standard format marker input file

open_string(s)[source]

Open a standard format marker string for sequential reading.

Parameters

s (str) – string to parse as a standard format marker input file

raw_fields()[source]

Return an iterator that returns the next field in a (marker, value) tuple. Linebreaks and trailing white space are preserved except for the final newline in each field.

Return type

iter(tuple(str, str))

fields(strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)[source]

Return an iterator that returns the next field in a (marker, value) tuple, where marker and value are unicode strings if an encoding was specified in the fields() method. Otherwise they are non-unicode strings.

Parameters
  • strip (bool) – strip trailing whitespace from the last line of each field

  • unwrap (bool) – Convert newlines in a field to spaces.

  • encoding (str or None) – Name of an encoding to use. If it is specified then the fields() method returns unicode strings rather than non unicode strings.

  • errors (str) – Error handling scheme for codec. Same as the decode() builtin string method.

  • unicode_fields (sequence) – Set of marker names whose values are UTF-8 encoded. Ignored if encoding is None. If the whole file is UTF-8 encoded set encoding='utf8' and leave unicode_fields with its default value of None.

Return type

iter(tuple(str, str))

close()[source]

Close a previously opened standard format marker file or string.