nltk.data module¶
Functions to find and load NLTK resource files, such as corpora,
grammars, and saved processing objects. Resource files are identified
using URLs, such as nltk:corpora/abc/rural.txt
or
https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg
.
The following URL protocols are supported:
file:path
: Specifies the file whose path is path. Both relative and absolute paths may be used.
https://host/path
: Specifies the file stored on the web server host at path path.
nltk:path
: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified bynltk.data.path
.
If no protocol is specified, then the default protocol nltk:
will
be used.
This module provides to functions that can be used to access a
resource file, given its URL: load()
loads a given resource, and
adds it to a resource cache; and retrieve()
copies a given resource
to a local file.
- nltk.data.AUTO_FORMATS = {'cfg': 'cfg', 'fcfg': 'fcfg', 'fol': 'fol', 'json': 'json', 'logic': 'logic', 'pcfg': 'pcfg', 'pickle': 'pickle', 'text': 'text', 'txt': 'text', 'val': 'val', 'yaml': 'yaml'}¶
A dictionary mapping from file extensions to format names, used by load() when format=”auto” to decide the format for a given resource url.
- nltk.data.BufferedGzipFile(*args, **kwargs)¶
A
GzipFile
subclass for compatibility with older nltk releases.Use
GzipFile
directly as it also buffers in all supported Python versions.@deprecated: Use gzip.GzipFile instead as it also uses a buffer.
- nltk.data.FORMATS = {'cfg': 'A context free grammar.', 'fcfg': 'A feature CFG.', 'fol': 'A list of first order logic expressions, parsed with nltk.sem.logic.Expression.fromstring.', 'json': 'A serialized python object, stored using the json module.', 'logic': 'A list of first order logic expressions, parsed with nltk.sem.logic.LogicParser. Requires an additional logic_parser parameter', 'pcfg': 'A probabilistic CFG.', 'pickle': 'A serialized python object, stored using the pickle module.', 'raw': 'The raw (byte string) contents of a file.', 'text': 'The raw (unicode string) contents of a file. ', 'val': 'A semantic valuation, parsed by nltk.sem.Valuation.fromstring.', 'yaml': 'A serialized python object, stored using the yaml module.'}¶
A dictionary describing the formats that are supported by NLTK’s load() method. Keys are format names, and values are format descriptions.
- class nltk.data.FileSystemPathPointer[source]¶
Bases:
PathPointer
,str
A path pointer that identifies a file which can be accessed directly via a given absolute path.
- __init__(_path)[source]¶
Create a new path pointer for the given absolute path.
- Raises:
IOError – If the given path does not exist.
- file_size()[source]¶
Return the size of the file pointed to by this path pointer, in bytes.
- Raises:
IOError – If the path specified by this pointer does not contain a readable file.
- join(fileid)[source]¶
Return a new path pointer formed by starting at the path identified by this pointer, and then following the relative path given by
fileid
. The path components offileid
should be separated by forward slashes, regardless of the underlying file system’s path separator character.
- open(encoding=None)[source]¶
Return a seekable read-only stream that can be used to read the contents of the file identified by this path pointer.
- Raises:
IOError – If the path specified by this pointer does not contain a readable file.
- property path¶
The absolute path identified by this path pointer.
- class nltk.data.GzipFileSystemPathPointer[source]¶
Bases:
FileSystemPathPointer
A subclass of
FileSystemPathPointer
that identifies a gzip-compressed file located at a given absolute path.GzipFileSystemPathPointer
is appropriate for loading large gzip-compressed pickle objects efficiently.
- class nltk.data.OpenOnDemandZipFile[source]¶
Bases:
ZipFile
A subclass of
zipfile.ZipFile
that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once.OpenOnDemandZipFile
must be constructed from a filename, not a file-like object (to allow re-opening).OpenOnDemandZipFile
is read-only (i.e.write()
andwritestr()
are disabled.
- class nltk.data.PathPointer[source]¶
Bases:
object
An abstract base class for ‘path pointers,’ used by NLTK’s data package to identify specific paths. Two subclasses exist:
FileSystemPathPointer
identifies a file that can be accessed directly via a given absolute path.ZipFilePathPointer
identifies a file contained within a zipfile, that can be accessed by reading that zipfile.- abstract file_size()[source]¶
Return the size of the file pointed to by this path pointer, in bytes.
- Raises:
IOError – If the path specified by this pointer does not contain a readable file.
- abstract join(fileid)[source]¶
Return a new path pointer formed by starting at the path identified by this pointer, and then following the relative path given by
fileid
. The path components offileid
should be separated by forward slashes, regardless of the underlying file system’s path separator character.
- class nltk.data.SeekableUnicodeStreamReader[source]¶
Bases:
object
A stream reader that automatically encodes the source byte stream into unicode (like
codecs.StreamReader
); but still supports theseek()
andtell()
operations correctly. This is in contrast tocodecs.StreamReader
, which provide brokenseek()
andtell()
methods.This class was motivated by
StreamBackedCorpusView
, which makes extensive use ofseek()
andtell()
, and needs to be able to handle unicode-encoded files.Note: this class requires stateless decoders. To my knowledge, this shouldn’t cause a problem with any of python’s builtin unicode encodings.
- DEBUG = True¶
- bytebuffer¶
A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.
- property closed¶
True if the underlying stream is closed.
- decode¶
The function that is used to decode byte strings into unicode strings.
- encoding¶
The name of the encoding that should be used to encode the underlying stream.
- errors¶
The error mode that should be used when decoding data from the underlying stream. Can be ‘strict’, ‘ignore’, or ‘replace’.
- linebuffer¶
A buffer used by
readline()
to hold characters that have been read, but have not yet been returned byread()
orreadline()
. This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes thetell()
operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.
- property mode¶
The mode of the underlying stream.
- property name¶
The name of the underlying stream.
- read(size=None)[source]¶
Read up to
size
bytes, decode them using this reader’s encoding, and return the resulting unicode string.- Parameters:
size (int) – The maximum number of bytes to read. If not specified, then read as many bytes as possible.
- Return type:
unicode
- readline(size=None)[source]¶
Read a line of text, decode it using this reader’s encoding, and return the resulting unicode string.
- Parameters:
size (int) – The maximum number of bytes to read. If no newline is encountered before
size
bytes have been read, then the returned value may not be a complete line of text.
- readlines(sizehint=None, keepends=True)[source]¶
Read this file’s contents, decode them using this reader’s encoding, and return it as a list of unicode lines.
- Return type:
list(unicode)
- Parameters:
sizehint – Ignored.
keepends – If false, then strip newlines.
- seek(offset, whence=0)[source]¶
Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.
- Parameters:
offset – A byte count offset.
whence – If 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).
- stream¶
The underlying stream.
- nltk.data.find(resource_name, paths=None)[source]¶
Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a
LookupError
, whose message gives a pointer to the installation instructions for the NLTK downloader.Zip File Handling:
If
resource_name
contains a component with a.zip
extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.If any element of
nltk.data.path
has a.zip
extension, then it is assumed to be a zipfile.If a given resource name that does not contain any zipfile component is not found initially, then
find()
will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allowsfind()
to map the resource namecorpora/chat80/cities.pl
to a zip file path pointer tocorpora/chat80.zip/chat80/cities.pl
.When using
find()
to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise,find()
will not locate the directory.
- Parameters:
resource_name (str or unicode) – The name of the resource to search for. Resource names are posix-style relative path names, such as
corpora/brown
. Directory names will be automatically converted to a platform-appropriate path separator.- Return type:
str
- nltk.data.load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)[source]¶
Load a given resource from the NLTK data package. The following resource formats are currently supported:
pickle
json
yaml
cfg
(context free grammars)pcfg
(probabilistic CFGs)fcfg
(feature-based CFGs)fol
(formulas of First Order Logic)logic
(Logical formulas to be parsed by the given logic_parser)val
(valuation of First Order Logic model)text
(the file contents as a unicode string)raw
(the raw file contents as a byte string)
If no format is specified,
load()
will attempt to determine a format based on the resource name’s file extension. If that fails,load()
will raise aValueError
exception.For all text formats (everything except
pickle
,json
,yaml
andraw
), it tries to decode the raw contents using UTF-8, and if that doesn’t work, it tries with ISO-8859-1 (Latin-1), unless theencoding
is specified.- Parameters:
resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
cache (bool) – If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it.
verbose (bool) – If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.
logic_parser (LogicParser) – The parser that will be used to parse logical expressions.
fstruct_reader (FeatStructReader) – The parser that will be used to parse the feature structure of an fcfg.
encoding (str) – the encoding of the input; only used for text formats.
- nltk.data.path = ['/Users/stevenbird/nltk_data', '/opt/local/Library/Frameworks/Python.framework/Versions/3.12/nltk_data', '/opt/local/Library/Frameworks/Python.framework/Versions/3.12/share/nltk_data', '/opt/local/Library/Frameworks/Python.framework/Versions/3.12/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']¶
A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).
- nltk.data.retrieve(resource_url, filename=None, verbose=True)[source]¶
Copy the given resource to a local file. If no filename is specified, then use the URL’s filename. If there is already a file named
filename
, then raise aValueError
.- Parameters:
resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
- nltk.data.show_cfg(resource_url, escape='##')[source]¶
Write out a grammar file, ignoring escaped and empty lines.
- Parameters:
resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
escape (str) – Prepended string that signals lines to be ignored