nltk.data module

Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects. Resource files are identified using URLs, such as nltk:corpora/abc/rural.txt or https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg. The following URL protocols are supported:

  • file:path: Specifies the file whose path is path. Both relative and absolute paths may be used.

  • https://host/path: Specifies the file stored on the web server host at path path.

  • nltk:path: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified by nltk.data.path.

If no protocol is specified, then the default protocol nltk: will be used.

This module provides to functions that can be used to access a resource file, given its URL: load() loads a given resource, and adds it to a resource cache; and retrieve() copies a given resource to a local file.

nltk.data.AUTO_FORMATS = {'cfg': 'cfg', 'fcfg': 'fcfg', 'fol': 'fol', 'json': 'json', 'logic': 'logic', 'pcfg': 'pcfg', 'pickle': 'pickle', 'text': 'text', 'txt': 'text', 'val': 'val', 'yaml': 'yaml'}

A dictionary mapping from file extensions to format names, used by load() when format=”auto” to decide the format for a given resource url.

nltk.data.BufferedGzipFile(*args, **kwargs)

A GzipFile subclass for compatibility with older nltk releases.

Use GzipFile directly as it also buffers in all supported Python versions.

@deprecated: Use gzip.GzipFile instead as it also uses a buffer.

nltk.data.FORMATS = {'cfg': 'A context free grammar.', 'fcfg': 'A feature CFG.', 'fol': 'A list of first order logic expressions, parsed with nltk.sem.logic.Expression.fromstring.', 'json': 'A serialized python object, stored using the json module.', 'logic': 'A list of first order logic expressions, parsed with nltk.sem.logic.LogicParser.  Requires an additional logic_parser parameter', 'pcfg': 'A probabilistic CFG.', 'pickle': 'A serialized python object, stored using the pickle module.', 'raw': 'The raw (byte string) contents of a file.', 'text': 'The raw (unicode string) contents of a file. ', 'val': 'A semantic valuation, parsed by nltk.sem.Valuation.fromstring.', 'yaml': 'A serialized python object, stored using the yaml module.'}

A dictionary describing the formats that are supported by NLTK’s load() method. Keys are format names, and values are format descriptions.

class nltk.data.FileSystemPathPointer[source]

Bases: PathPointer, str

A path pointer that identifies a file which can be accessed directly via a given absolute path.

__init__(_path)[source]

Create a new path pointer for the given absolute path.

Raises

IOError – If the given path does not exist.

file_size()[source]

Return the size of the file pointed to by this path pointer, in bytes.

Raises

IOError – If the path specified by this pointer does not contain a readable file.

join(fileid)[source]

Return a new path pointer formed by starting at the path identified by this pointer, and then following the relative path given by fileid. The path components of fileid should be separated by forward slashes, regardless of the underlying file system’s path separator character.

open(encoding=None)[source]

Return a seekable read-only stream that can be used to read the contents of the file identified by this path pointer.

Raises

IOError – If the path specified by this pointer does not contain a readable file.

property path

The absolute path identified by this path pointer.

class nltk.data.GzipFileSystemPathPointer[source]

Bases: FileSystemPathPointer

A subclass of FileSystemPathPointer that identifies a gzip-compressed file located at a given absolute path. GzipFileSystemPathPointer is appropriate for loading large gzip-compressed pickle objects efficiently.

open(encoding=None)[source]

Return a seekable read-only stream that can be used to read the contents of the file identified by this path pointer.

Raises

IOError – If the path specified by this pointer does not contain a readable file.

class nltk.data.LazyLoader[source]

Bases: object

__init__(_path)[source]
class nltk.data.OpenOnDemandZipFile[source]

Bases: ZipFile

A subclass of zipfile.ZipFile that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once. OpenOnDemandZipFile must be constructed from a filename, not a file-like object (to allow re-opening). OpenOnDemandZipFile is read-only (i.e. write() and writestr() are disabled.

__init__(filename)[source]

Open the ZIP file with mode read ‘r’, write ‘w’, exclusive create ‘x’, or append ‘a’.

read(name)[source]

Return file bytes for name.

write(*args, **kwargs)[source]
Raises

NotImplementedError – OpenOnDemandZipfile is read-only

writestr(*args, **kwargs)[source]
Raises

NotImplementedError – OpenOnDemandZipfile is read-only

class nltk.data.PathPointer[source]

Bases: object

An abstract base class for ‘path pointers,’ used by NLTK’s data package to identify specific paths. Two subclasses exist: FileSystemPathPointer identifies a file that can be accessed directly via a given absolute path. ZipFilePathPointer identifies a file contained within a zipfile, that can be accessed by reading that zipfile.

abstract file_size()[source]

Return the size of the file pointed to by this path pointer, in bytes.

Raises

IOError – If the path specified by this pointer does not contain a readable file.

abstract join(fileid)[source]

Return a new path pointer formed by starting at the path identified by this pointer, and then following the relative path given by fileid. The path components of fileid should be separated by forward slashes, regardless of the underlying file system’s path separator character.

abstract open(encoding=None)[source]

Return a seekable read-only stream that can be used to read the contents of the file identified by this path pointer.

Raises

IOError – If the path specified by this pointer does not contain a readable file.

class nltk.data.SeekableUnicodeStreamReader[source]

Bases: object

A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader, which provide broken seek() and tell() methods.

This class was motivated by StreamBackedCorpusView, which makes extensive use of seek() and tell(), and needs to be able to handle unicode-encoded files.

Note: this class requires stateless decoders. To my knowledge, this shouldn’t cause a problem with any of python’s builtin unicode encodings.

DEBUG = True
__init__(stream, encoding, errors='strict')[source]
bytebuffer

A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.

char_seek_forward(offset)[source]

Move the read pointer forward by offset characters.

close()[source]

Close the underlying stream.

property closed

True if the underlying stream is closed.

decode

The function that is used to decode byte strings into unicode strings.

discard_line()[source]
encoding

The name of the encoding that should be used to encode the underlying stream.

errors

The error mode that should be used when decoding data from the underlying stream. Can be ‘strict’, ‘ignore’, or ‘replace’.

linebuffer

A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes the tell() operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.

property mode

The mode of the underlying stream.

property name

The name of the underlying stream.

next()[source]

Return the next decoded line from the underlying stream.

read(size=None)[source]

Read up to size bytes, decode them using this reader’s encoding, and return the resulting unicode string.

Parameters

size (int) – The maximum number of bytes to read. If not specified, then read as many bytes as possible.

Return type

unicode

readline(size=None)[source]

Read a line of text, decode it using this reader’s encoding, and return the resulting unicode string.

Parameters

size (int) – The maximum number of bytes to read. If no newline is encountered before size bytes have been read, then the returned value may not be a complete line of text.

readlines(sizehint=None, keepends=True)[source]

Read this file’s contents, decode them using this reader’s encoding, and return it as a list of unicode lines.

Return type

list(unicode)

Parameters
  • sizehint – Ignored.

  • keepends – If false, then strip newlines.

seek(offset, whence=0)[source]

Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.

Parameters
  • offset – A byte count offset.

  • whence – If 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).

stream

The underlying stream.

tell()[source]

Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.

xreadlines()[source]

Return self

nltk.data.clear_cache()[source]

Remove all objects from the resource cache. :see: load()

nltk.data.find(resource_name, paths=None)[source]

Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a LookupError, whose message gives a pointer to the installation instructions for the NLTK downloader.

Zip File Handling:

  • If resource_name contains a component with a .zip extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.

  • If any element of nltk.data.path has a .zip extension, then it is assumed to be a zipfile.

  • If a given resource name that does not contain any zipfile component is not found initially, then find() will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allows find() to map the resource name corpora/chat80/cities.pl to a zip file path pointer to corpora/chat80.zip/chat80/cities.pl.

  • When using find() to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise, find() will not locate the directory.

Parameters

resource_name (str or unicode) – The name of the resource to search for. Resource names are posix-style relative path names, such as corpora/brown. Directory names will be automatically converted to a platform-appropriate path separator.

Return type

str

nltk.data.load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)[source]

Load a given resource from the NLTK data package. The following resource formats are currently supported:

  • pickle

  • json

  • yaml

  • cfg (context free grammars)

  • pcfg (probabilistic CFGs)

  • fcfg (feature-based CFGs)

  • fol (formulas of First Order Logic)

  • logic (Logical formulas to be parsed by the given logic_parser)

  • val (valuation of First Order Logic model)

  • text (the file contents as a unicode string)

  • raw (the raw file contents as a byte string)

If no format is specified, load() will attempt to determine a format based on the resource name’s file extension. If that fails, load() will raise a ValueError exception.

For all text formats (everything except pickle, json, yaml and raw), it tries to decode the raw contents using UTF-8, and if that doesn’t work, it tries with ISO-8859-1 (Latin-1), unless the encoding is specified.

Parameters
  • resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.

  • cache (bool) – If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it.

  • verbose (bool) – If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.

  • logic_parser (LogicParser) – The parser that will be used to parse logical expressions.

  • fstruct_reader (FeatStructReader) – The parser that will be used to parse the feature structure of an fcfg.

  • encoding (str) – the encoding of the input; only used for text formats.

nltk.data.path = ['C:\\Users\\Tom/nltk_data', 'c:\\github\\nltk\\.env39\\nltk_data', 'c:\\github\\nltk\\.env39\\share\\nltk_data', 'c:\\github\\nltk\\.env39\\lib\\nltk_data', 'C:\\Users\\Tom\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data']

A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).

nltk.data.retrieve(resource_url, filename=None, verbose=True)[source]

Copy the given resource to a local file. If no filename is specified, then use the URL’s filename. If there is already a file named filename, then raise a ValueError.

Parameters

resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.

nltk.data.show_cfg(resource_url, escape='##')[source]

Write out a grammar file, ignoring escaped and empty lines.

Parameters
  • resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.

  • escape (str) – Prepended string that signals lines to be ignored