nltk Package¶
nltk
Package¶
The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online book is available. (If you use the library for academic research, please cite the book.)
Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. http://nltk.org/book
@version: 3.6
collocations
Module¶
Tools to identify collocations — words that often appear consecutively — within corpora. They may also be used to find other associations between word occurrences. See Manning and Schutze ch. 5 at http://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl package at http://ngram.sourceforge.net
Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation.
The BigramCollocationFinder
and TrigramCollocationFinder
classes provide
these functionalities, dependent on being provided a function which scores a
ngram given appropriate frequency counts. A number of standard association
measures are provided in bigram_measures and trigram_measures.
-
class
nltk.collocations.
BigramCollocationFinder
(word_fd, bigram_fd, window_size=2)[source]¶ Bases:
nltk.collocations.AbstractCollocationFinder
A tool for the finding and ranking of bigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.
-
default_ws
= 2¶
-
-
class
nltk.collocations.
QuadgramCollocationFinder
(word_fd, quadgram_fd, ii, iii, ixi, ixxi, iixi, ixii)[source]¶ Bases:
nltk.collocations.AbstractCollocationFinder
A tool for the finding and ranking of quadgram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.
-
default_ws
= 4¶
-
-
class
nltk.collocations.
TrigramCollocationFinder
(word_fd, bigram_fd, wildcard_fd, trigram_fd)[source]¶ Bases:
nltk.collocations.AbstractCollocationFinder
A tool for the finding and ranking of trigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.
-
bigram_finder
()[source]¶ Constructs a bigram collocation finder with the bigram and unigram data from this finder. Note that this does not include any filtering applied to this finder.
-
default_ws
= 3¶
-
data
Module¶
Functions to find and load NLTK resource files, such as corpora,
grammars, and saved processing objects. Resource files are identified
using URLs, such as nltk:corpora/abc/rural.txt
or
http://nltk.org/sample/toy.cfg
. The following URL protocols are
supported:
file:path
: Specifies the file whose path is path. Both relative and absolute paths may be used.
http://host/path
: Specifies the file stored on the web server host at path path.
nltk:path
: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified bynltk.data.path
.
If no protocol is specified, then the default protocol nltk:
will
be used.
This module provides to functions that can be used to access a
resource file, given its URL: load()
loads a given resource, and
adds it to a resource cache; and retrieve()
copies a given resource
to a local file.
-
nltk.data.
AUTO_FORMATS
= {'cfg': 'cfg', 'fcfg': 'fcfg', 'fol': 'fol', 'json': 'json', 'logic': 'logic', 'pcfg': 'pcfg', 'pickle': 'pickle', 'text': 'text', 'txt': 'text', 'val': 'val', 'yaml': 'yaml'}¶ A dictionary mapping from file extensions to format names, used by load() when format=”auto” to decide the format for a given resource url.
-
nltk.data.
BufferedGzipFile
(*args, **kwargs)¶ A
GzipFile
subclass for compatibility with older nltk releases.Use
GzipFile
directly as it also buffers in all supported Python versions.@deprecated: Use gzip.GzipFile instead as it also uses a buffer.
-
nltk.data.
FORMATS
= {'cfg': 'A context free grammar.', 'fcfg': 'A feature CFG.', 'fol': 'A list of first order logic expressions, parsed with nltk.sem.logic.Expression.fromstring.', 'json': 'A serialized python object, stored using the json module.', 'logic': 'A list of first order logic expressions, parsed with nltk.sem.logic.LogicParser. Requires an additional logic_parser parameter', 'pcfg': 'A probabilistic CFG.', 'pickle': 'A serialized python object, stored using the pickle module.', 'raw': 'The raw (byte string) contents of a file.', 'text': 'The raw (unicode string) contents of a file. ', 'val': 'A semantic valuation, parsed by nltk.sem.Valuation.fromstring.', 'yaml': 'A serialized python object, stored using the yaml module.'}¶ A dictionary describing the formats that are supported by NLTK’s load() method. Keys are format names, and values are format descriptions.
-
class
nltk.data.
FileSystemPathPointer
(_path)[source]¶ Bases:
nltk.data.PathPointer
,str
A path pointer that identifies a file which can be accessed directly via a given absolute path.
-
file_size
()[source]¶ Return the size of the file pointed to by this path pointer, in bytes.
- Raises
IOError – If the path specified by this pointer does not contain a readable file.
-
join
(fileid)[source]¶ Return a new path pointer formed by starting at the path identified by this pointer, and then following the relative path given by
fileid
. The path components offileid
should be separated by forward slashes, regardless of the underlying file system’s path seperator character.
-
open
(encoding=None)[source]¶ Return a seekable read-only stream that can be used to read the contents of the file identified by this path pointer.
- Raises
IOError – If the path specified by this pointer does not contain a readable file.
-
property
path
¶ The absolute path identified by this path pointer.
-
-
class
nltk.data.
GzipFileSystemPathPointer
(_path)[source]¶ Bases:
nltk.data.FileSystemPathPointer
A subclass of
FileSystemPathPointer
that identifies a gzip-compressed file located at a given absolute path.GzipFileSystemPathPointer
is appropriate for loading large gzip-compressed pickle objects efficiently.
-
class
nltk.data.
OpenOnDemandZipFile
(filename)[source]¶ Bases:
zipfile.ZipFile
A subclass of
zipfile.ZipFile
that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once.OpenOnDemandZipFile
must be constructed from a filename, not a file-like object (to allow re-opening).OpenOnDemandZipFile
is read-only (i.e.write()
andwritestr()
are disabled.
-
class
nltk.data.
PathPointer
[source]¶ Bases:
object
An abstract base class for ‘path pointers,’ used by NLTK’s data package to identify specific paths. Two subclasses exist:
FileSystemPathPointer
identifies a file that can be accessed directly via a given absolute path.ZipFilePathPointer
identifies a file contained within a zipfile, that can be accessed by reading that zipfile.-
abstract
file_size
()[source]¶ Return the size of the file pointed to by this path pointer, in bytes.
- Raises
IOError – If the path specified by this pointer does not contain a readable file.
-
abstract
join
(fileid)[source]¶ Return a new path pointer formed by starting at the path identified by this pointer, and then following the relative path given by
fileid
. The path components offileid
should be separated by forward slashes, regardless of the underlying file system’s path seperator character.
-
abstract
-
class
nltk.data.
SeekableUnicodeStreamReader
(stream, encoding, errors='strict')[source]¶ Bases:
object
A stream reader that automatically encodes the source byte stream into unicode (like
codecs.StreamReader
); but still supports theseek()
andtell()
operations correctly. This is in contrast tocodecs.StreamReader
, which provide brokenseek()
andtell()
methods.This class was motivated by
StreamBackedCorpusView
, which makes extensive use ofseek()
andtell()
, and needs to be able to handle unicode-encoded files.Note: this class requires stateless decoders. To my knowledge, this shouldn’t cause a problem with any of python’s builtin unicode encodings.
-
DEBUG
= True¶
-
bytebuffer
¶ A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.
-
property
closed
¶ True if the underlying stream is closed.
-
decode
¶ The function that is used to decode byte strings into unicode strings.
-
encoding
¶ The name of the encoding that should be used to encode the underlying stream.
-
errors
¶ The error mode that should be used when decoding data from the underlying stream. Can be ‘strict’, ‘ignore’, or ‘replace’.
-
linebuffer
¶ A buffer used by
readline()
to hold characters that have been read, but have not yet been returned byread()
orreadline()
. This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes thetell()
operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.
-
property
mode
¶ The mode of the underlying stream.
-
property
name
¶ The name of the underlying stream.
-
read
(size=None)[source]¶ Read up to
size
bytes, decode them using this reader’s encoding, and return the resulting unicode string.- Parameters
size (int) – The maximum number of bytes to read. If not specified, then read as many bytes as possible.
- Return type
unicode
-
readline
(size=None)[source]¶ Read a line of text, decode it using this reader’s encoding, and return the resulting unicode string.
- Parameters
size (int) – The maximum number of bytes to read. If no newline is encountered before
size
bytes have been read, then the returned value may not be a complete line of text.
-
readlines
(sizehint=None, keepends=True)[source]¶ Read this file’s contents, decode them using this reader’s encoding, and return it as a list of unicode lines.
- Return type
list(unicode)
- Parameters
sizehint – Ignored.
keepends – If false, then strip newlines.
-
seek
(offset, whence=0)[source]¶ Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.
- Parameters
offset – A byte count offset.
whence – If 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).
-
stream
¶ The underlying stream.
-
-
nltk.data.
find
(resource_name, paths=None)[source]¶ Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a
LookupError
, whose message gives a pointer to the installation instructions for the NLTK downloader.Zip File Handling:
If
resource_name
contains a component with a.zip
extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.If any element of
nltk.data.path
has a.zip
extension, then it is assumed to be a zipfile.If a given resource name that does not contain any zipfile component is not found initially, then
find()
will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allowsfind()
to map the resource namecorpora/chat80/cities.pl
to a zip file path pointer tocorpora/chat80.zip/chat80/cities.pl
.When using
find()
to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise,find()
will not locate the directory.
- Parameters
resource_name (str or unicode) – The name of the resource to search for. Resource names are posix-style relative path names, such as
corpora/brown
. Directory names will be automatically converted to a platform-appropriate path separator.- Return type
str
-
nltk.data.
load
(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None)[source]¶ Load a given resource from the NLTK data package. The following resource formats are currently supported:
pickle
json
yaml
cfg
(context free grammars)pcfg
(probabilistic CFGs)fcfg
(feature-based CFGs)fol
(formulas of First Order Logic)logic
(Logical formulas to be parsed by the given logic_parser)val
(valuation of First Order Logic model)text
(the file contents as a unicode string)raw
(the raw file contents as a byte string)
If no format is specified,
load()
will attempt to determine a format based on the resource name’s file extension. If that fails,load()
will raise aValueError
exception.For all text formats (everything except
pickle
,json
,yaml
andraw
), it tries to decode the raw contents using UTF-8, and if that doesn’t work, it tries with ISO-8859-1 (Latin-1), unless theencoding
is specified.- Parameters
resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
cache (bool) – If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it.
verbose (bool) – If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.
logic_parser (LogicParser) – The parser that will be used to parse logical expressions.
fstruct_reader (FeatStructReader) – The parser that will be used to parse the feature structure of an fcfg.
encoding (str) – the encoding of the input; only used for text formats.
-
nltk.data.
path
= ['/Users/sbird1/nltk_data', '/opt/local/Library/Frameworks/Python.framework/Versions/3.8/nltk_data', '/opt/local/Library/Frameworks/Python.framework/Versions/3.8/share/nltk_data', '/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']¶ A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).
-
nltk.data.
retrieve
(resource_url, filename=None, verbose=True)[source]¶ Copy the given resource to a local file. If no filename is specified, then use the URL’s filename. If there is already a file named
filename
, then raise aValueError
.- Parameters
resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
-
nltk.data.
show_cfg
(resource_url, escape='##')[source]¶ Write out a grammar file, ignoring escaped and empty lines.
- Parameters
resource_url (str) – A URL specifying where the resource should be loaded from. The default protocol is “nltk:”, which searches for the file in the the NLTK data package.
escape (str) – Prepended string that signals lines to be ignored
downloader
Module¶
The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
Downloading Packages¶
If called with no arguments, download()
will display an interactive
interface which can be used to download and install new packages.
If Tkinter is available, then a graphical interface will be shown,
otherwise a simple text interface will be provided.
Individual packages can be downloaded by calling the download()
function with a single argument, giving the package identifier for the
package that should be downloaded:
>>> download('treebank')
[nltk_data] Downloading package 'treebank'...
[nltk_data] Unzipping corpora/treebank.zip.
NLTK also provides a number of “package collections”, consisting of
a group of related packages. To download all packages in a
colleciton, simply call download()
with the collection’s
identifier:
>>> download('all-corpora')
[nltk_data] Downloading package 'abc'...
[nltk_data] Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data] Unzipping corpora/alpino.zip.
...
[nltk_data] Downloading package 'words'...
[nltk_data] Unzipping corpora/words.zip.
Download Directory¶
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user’s home directory. However, the download_dir
argument may be
used to specify a different installation target, if desired.
See Downloader.default_download_dir()
for more a detailed
description of how the default download directory is chosen.
NLTK Download Server¶
Before downloading any packages, the corpus and module downloader
contacts the NLTK download server, to retrieve an index file
describing the available packages. By default, this index file is
loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
.
If necessary, it is possible to create a new Downloader
object,
specifying a different URL for the package index file.
Usage:
python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
or:
python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
-
class
nltk.downloader.
Collection
(id, children, name=None, **kw)[source]¶ Bases:
object
A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by
Downloader
.-
children
¶ A list of the
Collections
orPackages
directly contained by this collection.
-
id
¶ A unique identifier for this collection.
-
name
¶ A string name for this collection.
-
packages
¶ A list of
Packages
contained by this collection or any collections it recursively contains.
-
-
class
nltk.downloader.
Downloader
(server_index_url=None, download_dir=None)[source]¶ Bases:
object
A class used to access the NLTK data server, which can be used to download corpora and other data packages.
-
DEFAULT_URL
= 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'¶ The default URL for the NLTK data server’s index. An alternative URL can be specified when creating a new
Downloader
object.
-
INDEX_TIMEOUT
= 3600¶ The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.
-
INSTALLED
= 'installed'¶ A status string indicating that a package or collection is installed and up-to-date.
-
NOT_INSTALLED
= 'not installed'¶ A status string indicating that a package or collection is not installed.
-
PARTIAL
= 'partial'¶ A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)
-
STALE
= 'out of date'¶ A status string indicating that a package or collection is corrupt or out-of-date.
-
default_download_dir
()[source]¶ Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the
download_dir
argument when callingdownload()
.On Windows, the default download directory is
PYTHONHOME/lib/nltk
, where PYTHONHOME is the directory containing Python, e.g.C:\Python25
.On all other platforms, the default directory is the first of the following which exists or which can be created with write permission:
/usr/share/nltk_data
,/usr/local/share/nltk_data
,/usr/lib/nltk_data
,/usr/local/lib/nltk_data
,~/nltk_data
.
-
download
(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>)[source]¶
-
property
download_dir
¶ The default directory to which packages will be downloaded. This defaults to the value returned by
default_download_dir()
. To override this default on a case-by-case basis, use thedownload_dir
argument when callingdownload()
.
-
index
()[source]¶ Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.
-
list
(download_dir=None, show_packages=True, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]¶
-
status
(info_or_id, download_dir=None)[source]¶ Return a constant describing the status of the given package or collection. Status can be one of
INSTALLED
,NOT_INSTALLED
,STALE
, orPARTIAL
.
-
property
url
¶ The URL for the data server’s index file.
-
-
class
nltk.downloader.
DownloaderGUI
(dataserver, use_threads=True)[source]¶ Bases:
object
Graphical interface for downloading packages from the NLTK data server.
-
COLUMNS
= ['', 'Identifier', 'Name', 'Size', 'Status', 'Unzipped Size', 'Copyright', 'Contact', 'License', 'Author', 'Subdir', 'Checksum']¶ A list of the names of columns. This controls the order in which the columns will appear. If this is edited, then
_package_to_columns()
may need to be edited to match.
-
COLUMN_WEIGHTS
= {'': 0, 'Name': 5, 'Size': 0, 'Status': 0}¶ A dictionary specifying how columns should be resized when the table is resized. Columns with weight 0 will not be resized at all; and columns with high weight will be resized more. Default weight (for columns not explicitly listed) is 1.
-
COLUMN_WIDTHS
= {'': 1, 'Identifier': 20, 'Name': 45, 'Size': 10, 'Status': 12, 'Unzipped Size': 10}¶ A dictionary specifying how wide each column should be, in characters. The default width (for columns not explicitly listed) is specified by
DEFAULT_COLUMN_WIDTH
.
-
DEFAULT_COLUMN_WIDTH
= 30¶ The default width for columns that are not explicitly listed in
COLUMN_WIDTHS
.
-
HELP
= 'This tool can be used to download a variety of corpora and models\nthat can be used with NLTK. Each corpus or model is distributed\nin a single zip file, known as a "package file." You can\ndownload packages individually, or you can download pre-defined\ncollections of packages.\n\nWhen you download a package, it will be saved to the "download\ndirectory." A default download directory is chosen when you run\n\nthe downloader; but you may also select a different download\ndirectory. On Windows, the default download directory is\n\n\n"package."\n\nThe NLTK downloader can be used to download a variety of corpora,\nmodels, and other data packages.\n\nKeyboard shortcuts::\n [return]\t Download\n [up]\t Select previous package\n [down]\t Select next package\n [left]\t Select previous tab\n [right]\t Select next tab\n'¶
-
INITIAL_COLUMNS
= ['', 'Identifier', 'Name', 'Size', 'Status']¶ The set of columns that should be displayed by default.
-
c
= 'Status'¶
-
-
class
nltk.downloader.
DownloaderMessage
[source]¶ Bases:
object
A status message object, used by
incr_download
to communicate its progress.
-
class
nltk.downloader.
ErrorMessage
(package, message)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server encountered an error
-
class
nltk.downloader.
FinishCollectionMessage
(collection)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished working on a collection of packages.
-
class
nltk.downloader.
FinishDownloadMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished downloading a package.
-
class
nltk.downloader.
FinishPackageMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished working on a package.
-
class
nltk.downloader.
FinishUnzipMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has finished unzipping a package.
-
class
nltk.downloader.
Package
(id, url, name=None, subdir='', size=None, unzipped_size=None, checksum=None, svn_revision=None, copyright='Unknown', contact='Unknown', license='Unknown', author='Unknown', unzip=True, **kw)[source]¶ Bases:
object
A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by
Downloader
. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.Author of this package.
-
checksum
¶ The MD-5 checksum of the package file.
-
contact
¶ Name & email of the person who should be contacted with questions about this package.
-
copyright
¶ Copyright holder for this package.
-
filename
¶ The filename that should be used for this package’s file. It is formed by joining
self.subdir
withself.id
, and using the same extension asurl
.
-
id
¶ A unique identifier for this package.
-
license
¶ License information for this package.
-
name
¶ A string name for this package.
-
size
¶ The filesize (in bytes) of the package file.
-
subdir
¶ The subdirectory where this package should be installed. E.g.,
'corpora'
or'taggers'
.
-
svn_revision
¶ A subversion revision number for this package.
-
unzip
¶ A flag indicating whether this corpus should be unzipped by default.
-
unzipped_size
¶ The total filesize of the files contained in the package’s zipfile.
-
url
¶ A URL that can be used to download this package’s file.
-
class
nltk.downloader.
ProgressMessage
(progress)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Indicates how much progress the data server has made
-
class
nltk.downloader.
SelectDownloadDirMessage
(download_dir)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Indicates what download directory the data server is using
-
class
nltk.downloader.
StaleMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
The package download file is out-of-date or corrupt
-
class
nltk.downloader.
StartCollectionMessage
(collection)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started working on a collection of packages.
-
class
nltk.downloader.
StartDownloadMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started downloading a package.
-
class
nltk.downloader.
StartPackageMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started working on a package.
-
class
nltk.downloader.
StartUnzipMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
Data server has started unzipping a package.
-
class
nltk.downloader.
UpToDateMessage
(package)[source]¶ Bases:
nltk.downloader.DownloaderMessage
The package download file is already up-to-date
-
nltk.downloader.
build_index
(root, base_url)[source]¶ Create a new data.xml index file, by combining the xml description files for various packages and collections.
root
should be the path to a directory containing the package xml and zip files; and the collection xml files. Theroot
directory is expected to have the following subdirectories:root/ packages/ .................. subdirectory for packages corpora/ ................. zip & xml files for corpora grammars/ ................ zip & xml files for grammars taggers/ ................. zip & xml files for taggers tokenizers/ .............. zip & xml files for tokenizers etc. collections/ ............... xml files for collections
For each package, there should be two files:
package.zip
(where package is the package name) which contains the package itself as a compressed zip file; andpackage.xml
, which is an xml description of the package. The zipfilepackage.zip
should expand to a single subdirectory namedpackage/
. The base filenamepackage
must match the identifier given in the package’s xml file.For each collection, there should be a single file
collection.zip
describing the collection, where collection is the name of the collection.All identifiers (for both packages and collections) must be unique.
-
nltk.downloader.
download
(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>)¶
-
nltk.downloader.
md5_hexdigest
(file)[source]¶ Calculate and return the MD5 checksum for a given file.
file
may either be a filename or an open stream.
featstruct
Module¶
Basic data classes for representing feature structures, and for
performing basic operations on those feature structures. A feature
structure is a mapping from feature identifiers to feature values,
where each feature value is either a basic value (such as a string or
an integer), or a nested feature structure. There are two types of
feature structure, implemented by two subclasses of FeatStruct
:
feature dictionaries, implemented by
FeatDict
, act like Python dictionaries. Feature identifiers may be strings or instances of theFeature
class.feature lists, implemented by
FeatList
, act like Python lists. Feature identifiers are integers.
Feature structures are typically used to represent partial information about objects. A feature identifier that is not mapped to a value stands for a feature whose value is unknown (not a feature without a value). Two feature structures that represent (potentially overlapping) information about the same object can be combined by unification. When two inconsistent feature structures are unified, the unification fails and returns None.
Features can be specified using “feature paths”, or tuples of feature identifiers that specify path through the nested feature structures to a value. Feature structures may contain reentrant feature values. A “reentrant feature value” is a single feature value that can be accessed via multiple feature paths. Unification preserves the reentrance relations imposed by both of the unified feature structures. In the feature structure resulting from unification, any modifications to a reentrant feature value will be visible using any of its feature paths.
Feature structure variables are encoded using the nltk.sem.Variable
class. The variables’ values are tracked using a bindings
dictionary, which maps variables to their values. When two feature
structures are unified, a fresh bindings dictionary is created to
track their values; and before unification completes, all bound
variables are replaced by their values. Thus, the bindings
dictionaries are usually strictly internal to the unification process.
However, it is possible to track the bindings of variables if you
choose to, by supplying your own initial bindings dictionary to the
unify()
function.
When unbound variables are unified with one another, they become aliased. This is encoded by binding one variable to the other.
Lightweight Feature Structures¶
Many of the functions defined by nltk.featstruct
can be applied
directly to simple Python dictionaries and lists, rather than to
full-fledged FeatDict
and FeatList
objects. In other words,
Python dicts
and lists
can be used as “light-weight” feature
structures.
>>> from nltk.featstruct import unify
>>> unify(dict(x=1, y=dict()), dict(a='a', y=dict(b='b')))
{'y': {'b': 'b'}, 'x': 1, 'a': 'a'}
However, you should keep in mind the following caveats:
Python dictionaries & lists ignore reentrance when checking for equality between values. But two FeatStructs with different reentrances are considered nonequal, even if all their base values are equal.
FeatStructs can be easily frozen, allowing them to be used as keys in hash tables. Python dictionaries and lists can not.
FeatStructs display reentrance in their string representations; Python dictionaries and lists do not.
FeatStructs may not be mixed with Python dictionaries and lists (e.g., when performing unification).
FeatStructs provide a number of useful methods, such as
walk()
andcyclic()
, which are not available for Python dicts and lists.
In general, if your feature structures will contain any reentrances,
or if you plan to use them as dictionary keys, it is strongly
recommended that you use full-fledged FeatStruct
objects.
-
class
nltk.featstruct.
FeatDict
(features=None, **morefeatures)[source]¶ Bases:
nltk.featstruct.FeatStruct
,dict
A feature structure that acts like a Python dictionary. I.e., a mapping from feature identifiers to feature values, where a feature identifier can be a string or a
Feature
; and where a feature value can be either a basic value (such as a string or an integer), or a nested feature structure. A feature identifiers for aFeatDict
is sometimes called a “feature name”.Two feature dicts are considered equal if they assign the same values to all features, and have the same reentrances.
- See
FeatStruct
for information about feature paths, reentrance, cyclic feature structures, mutability, freezing, and hashing.
-
clear
() → None. Remove all items from D.¶ If self is frozen, raise ValueError.
-
get
(name_or_path, default=None)[source]¶ If the feature with the given name or path exists, return its value; otherwise, return
default
.
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised If self is frozen, raise ValueError.
-
popitem
(*args, **kwargs)¶ Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty. If self is frozen, raise ValueError.
-
setdefault
(*args, **kwargs)¶ Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default. If self is frozen, raise ValueError.
-
class
nltk.featstruct.
FeatList
(features=None, **morefeatures)[source]¶ Bases:
nltk.featstruct.FeatStruct
,list
A list of feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure.
Feature lists may contain reentrant feature values. A “reentrant feature value” is a single feature value that can be accessed via multiple feature paths. Feature lists may also be cyclic.
Two feature lists are considered equal if they assign the same values to all features, and have the same reentrances.
- See
FeatStruct
for information about feature paths, reentrance, cyclic feature structures, mutability, freezing, and hashing.
-
append
(*args, **kwargs)¶ Append object to the end of the list. If self is frozen, raise ValueError.
-
extend
(*args, **kwargs)¶ Extend list by appending elements from the iterable. If self is frozen, raise ValueError.
-
insert
(*args, **kwargs)¶ Insert object before index. If self is frozen, raise ValueError.
-
pop
(*args, **kwargs)¶ Remove and return item at index (default last).
Raises IndexError if list is empty or index is out of range. If self is frozen, raise ValueError.
-
remove
(*args, **kwargs)¶ Remove first occurrence of value.
Raises ValueError if the value is not present. If self is frozen, raise ValueError.
-
reverse
(*args, **kwargs)¶ Reverse IN PLACE. If self is frozen, raise ValueError.
-
sort
(*args, **kwargs)¶ Sort the list in ascending order and return None.
The sort is in-place (i.e. the list itself is modified) and stable (i.e. the order of two equal elements is maintained).
If a key function is given, apply it once to each list item and sort them, ascending or descending, according to their function values.
The reverse flag can be set to sort in descending order. If self is frozen, raise ValueError.
-
class
nltk.featstruct.
FeatStruct
(features=None, **morefeatures)[source]¶ Bases:
nltk.sem.logic.SubstituteBindingsI
A mapping from feature identifiers to feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure. There are two types of feature structure:
feature dictionaries, implemented by
FeatDict
, act like Python dictionaries. Feature identifiers may be strings or instances of theFeature
class.feature lists, implemented by
FeatList
, act like Python lists. Feature identifiers are integers.
Feature structures may be indexed using either simple feature identifiers or ‘feature paths.’ A feature path is a sequence of feature identifiers that stand for a corresponding sequence of indexing operations. In particular,
fstruct[(f1,f2,...,fn)]
is equivalent tofstruct[f1][f2]...[fn]
.Feature structures may contain reentrant feature structures. A “reentrant feature structure” is a single feature structure object that can be accessed via multiple feature paths. Feature structures may also be cyclic. A feature structure is “cyclic” if there is any feature path from the feature structure to itself.
Two feature structures are considered equal if they assign the same values to all features, and have the same reentrancies.
By default, feature structures are mutable. They may be made immutable with the
freeze()
method. Once they have been frozen, they may be hashed, and thus used as dictionary keys.-
copy
(deep=True)[source]¶ Return a new copy of
self
. The new copy will not be frozen.- Parameters
deep – If true, create a deep copy; if false, create a shallow copy.
-
equal_values
(other, check_reentrance=False)[source]¶ Return True if
self
andother
assign the same value to to every feature. In particular, return true ifself[p]==other[p]
for every feature path p such thatself[p]
orother[p]
is a base value (i.e., not a nested feature structure).- Parameters
check_reentrance – If True, then also return False if there is any difference between the reentrances of
self
andother
.- Note
the
==
is equivalent toequal_values()
withcheck_reentrance=True
.
-
freeze
()[source]¶ Make this feature structure, and any feature structures it contains, immutable. Note: this method does not attempt to ‘freeze’ any feature value that is not a
FeatStruct
; it is recommended that you use only immutable feature values.
-
frozen
()[source]¶ Return True if this feature structure is immutable. Feature structures can be made immutable with the
freeze()
method. Immutable feature structures may not be made mutable again, but new mutable copies can be produced with thecopy()
method.
-
remove_variables
()[source]¶ Return the feature structure that is obtained by deleting any feature whose value is a
Variable
.- Return type
-
rename_variables
(vars=None, used_vars=(), new_vars=None)[source]¶ - See
nltk.featstruct.rename_variables()
-
class
nltk.featstruct.
FeatStructReader
(features=(*slash*, *type*), fdict_class=<class 'nltk.featstruct.FeatStruct'>, flist_class=<class 'nltk.featstruct.FeatList'>, logic_parser=None)[source]¶ Bases:
object
-
VALUE_HANDLERS
= [('read_fstruct_value', re.compile('\\s*(?:\\((\\d+)\\)\\s*)?(\\??[\\w-]+)?(\\[)')), ('read_var_value', re.compile('\\?[a-zA-Z_][a-zA-Z0-9_]*')), ('read_str_value', re.compile('[uU]?[rR]?([\'"])')), ('read_int_value', re.compile('-?\\d+')), ('read_sym_value', re.compile('[a-zA-Z_][a-zA-Z0-9_]*')), ('read_app_value', re.compile('<(app)\\((\\?[a-z][a-z]*)\\s*,\\s*(\\?[a-z][a-z]*)\\)>')), ('read_logic_value', re.compile('<(.*?)(?<!-)>')), ('read_set_value', re.compile('{')), ('read_tuple_value', re.compile('\\('))]¶ A table indicating how feature values should be processed. Each entry in the table is a pair (handler, regexp). The first entry with a matching regexp will have its handler called. Handlers should have the following signature:
def handler(s, position, reentrances, match): ...
and should return a tuple (value, position), where position is the string position where the value ended. (n.b.: order is important here!)
-
fromstring
(s, fstruct=None)[source]¶ Convert a string representation of a feature structure (as displayed by repr) into a
FeatStruct
. This process imposes the following restrictions on the string representation:Feature names cannot contain any of the following: whitespace, parentheses, quote marks, equals signs, dashes, commas, and square brackets. Feature names may not begin with plus signs or minus signs.
Only the following basic feature value are supported: strings, integers, variables, None, and unquoted alphanumeric strings.
For reentrant values, the first mention must specify a reentrance identifier and a value; and any subsequent mentions must use arrows (
'->'
) to reference the reentrance identifier.
-
read_partial
(s, position=0, reentrances=None, fstruct=None)[source]¶ Helper function that reads in a feature structure.
- Parameters
s – The string to read.
position – The position in the string to start parsing.
reentrances – A dictionary from reentrance ids to values. Defaults to an empty dictionary.
- Returns
A tuple (val, pos) of the feature structure created by parsing and the position where the parsed feature structure ends.
- Return type
bool
-
-
class
nltk.featstruct.
Feature
(name, default=None, display=None)[source]¶ Bases:
object
A feature identifier that’s specialized to put additional constraints, default values, etc.
-
property
default
¶ Default value for this feature.
-
property
display
¶ Custom display location: can be prefix, or slash.
-
property
name
¶ The name of this feature.
-
property
-
class
nltk.featstruct.
RangeFeature
(name, default=None, display=None)[source]¶ Bases:
nltk.featstruct.Feature
-
RANGE_RE
= re.compile('(-?\\d+):(-?\\d+)')¶
-
-
class
nltk.featstruct.
SlashFeature
(name, default=None, display=None)[source]¶ Bases:
nltk.featstruct.Feature
-
nltk.featstruct.
conflicts
(fstruct1, fstruct2, trace=0)[source]¶ Return a list of the feature paths of all features which are assigned incompatible values by
fstruct1
andfstruct2
.- Return type
list(tuple)
-
nltk.featstruct.
subsumes
(fstruct1, fstruct2)[source]¶ Return True if
fstruct1
subsumesfstruct2
. I.e., return true if unifyingfstruct1
withfstruct2
would result in a feature structure equal tofstruct2.
- Return type
bool
-
nltk.featstruct.
unify
(fstruct1, fstruct2, bindings=None, trace=False, fail=None, rename_vars=True, fs_class='default')[source]¶ Unify
fstruct1
withfstruct2
, and return the resulting feature structure. This unified feature structure is the minimal feature structure that contains all feature value assignments from bothfstruct1
andfstruct2
, and that preserves all reentrancies.If no such feature structure exists (because
fstruct1
andfstruct2
specify incompatible values for some feature), then unification fails, andunify
returns None.Bound variables are replaced by their values. Aliased variables are replaced by their representative variable (if unbound) or the value of their representative variable (if bound). I.e., if variable v is in
bindings
, then v is replaced bybindings[v]
. This will be repeated until the variable is replaced by an unbound variable or a non-variable value.Unbound variables are bound when they are unified with values; and aliased when they are unified with variables. I.e., if variable v is not in
bindings
, and is unified with a variable or value x, thenbindings[v]
is set to x.If
bindings
is unspecified, then all variables are assumed to be unbound. I.e.,bindings
defaults to an empty dict.>>> from nltk.featstruct import FeatStruct >>> FeatStruct('[a=?x]').unify(FeatStruct('[b=?x]')) [a=?x, b=?x2]
- Parameters
bindings (dict(Variable -> any)) – A set of variable bindings to be used and updated during unification.
trace (bool) – If true, generate trace output.
rename_vars (bool) – If True, then rename any variables in
fstruct2
that are also used infstruct1
, in order to avoid collisions on variable names.
grammar
Module¶
Basic data classes for representing context free grammars. A
“grammar” specifies which trees can represent the structure of a
given text. Each of these trees is called a “parse tree” for the
text (or simply a “parse”). In a “context free” grammar, the set of
parse trees for any piece of a text can depend only on that piece, and
not on the rest of the text (i.e., the piece’s context). Context free
grammars are often used to find possible syntactic structures for
sentences. In this context, the leaves of a parse tree are word
tokens; and the node values are phrasal categories, such as NP
and VP
.
The CFG
class is used to encode context free grammars. Each
CFG
consists of a start symbol and a set of productions.
The “start symbol” specifies the root node value for parse trees. For example,
the start symbol for syntactic parsing is usually S
. Start
symbols are encoded using the Nonterminal
class, which is discussed
below.
A Grammar’s “productions” specify what parent-child relationships a parse
tree can contain. Each production specifies that a particular
node can be the parent of a particular set of children. For example,
the production <S> -> <NP> <VP>
specifies that an S
node can
be the parent of an NP
node and a VP
node.
Grammar productions are implemented by the Production
class.
Each Production
consists of a left hand side and a right hand
side. The “left hand side” is a Nonterminal
that specifies the
node type for a potential parent; and the “right hand side” is a list
that specifies allowable children for that parent. This lists
consists of Nonterminals
and text types: each Nonterminal
indicates that the corresponding child may be a TreeToken
with the
specified node type; and each text type indicates that the
corresponding child may be a Token
with the with that type.
The Nonterminal
class is used to distinguish node values from leaf
values. This prevents the grammar from accidentally using a leaf
value (such as the English word “A”) as the node of a subtree. Within
a CFG
, all node values are wrapped in the Nonterminal
class. Note, however, that the trees that are specified by the grammar do
not include these Nonterminal
wrappers.
Grammars can also be given a more procedural interpretation. According to this interpretation, a Grammar specifies any tree structure tree that can be produced by the following procedure:
The operation of replacing the left hand side (lhs) of a production with the right hand side (rhs) in a tree (tree) is known as “expanding” lhs to rhs in tree.
-
class
nltk.grammar.
CFG
(start, productions, calculate_leftcorners=True)[source]¶ Bases:
object
A context-free grammar. A grammar consists of a start state and a set of productions. The set of terminals and nonterminals is implicitly specified by the productions.
If you need efficient key-based access to productions, you can use a subclass to implement it.
-
classmethod
binarize
(grammar, padding='@$@')[source]¶ Convert all non-binary rules into binary by introducing new tokens. Example:: Original:
A => B C D
- After Conversion:
A => B A@$@B A@$@B => C D
-
check_coverage
(tokens)[source]¶ Check whether the grammar rules cover the given list of tokens. If not, then raise an exception.
-
chomsky_normal_form
(new_token_padding='@$@', flexible=False)[source]¶ Returns a new Grammer that is in chomsky normal :param: new_token_padding
Customise new rule formation during binarisation
-
classmethod
eliminate_start
(grammar)[source]¶ Eliminate start rule in case it appears on RHS Example: S -> S0 S1 and S0 -> S1 S Then another rule S0_Sigma -> S is added
-
classmethod
fromstring
(input, encoding=None)[source]¶ Return the grammar instance corresponding to the input string(s).
- Parameters
input – a grammar, either in the form of a string or as a list of strings.
-
is_binarised
()[source]¶ Return True if all productions are at most binary. Note that there can still be empty and unary productions.
-
is_chomsky_normal_form
()[source]¶ Return True if the grammar is of Chomsky Normal Form, i.e. all productions are of the form A -> B C, or A -> “s”.
-
is_flexible_chomsky_normal_form
()[source]¶ Return True if all productions are of the forms A -> B C, A -> B, or A -> “s”.
-
is_leftcorner
(cat, left)[source]¶ True if left is a leftcorner of cat, where left can be a terminal or a nonterminal.
- Parameters
cat (Nonterminal) – the parent of the leftcorner
left (Terminal or Nonterminal) – the suggested leftcorner
- Return type
bool
-
is_nonlexical
()[source]¶ Return True if all lexical rules are “preterminals”, that is, unary rules which can be separated in a preprocessing step.
This means that all productions are of the forms A -> B1 … Bn (n>=0), or A -> “s”.
Note: is_lexical() and is_nonlexical() are not opposites. There are grammars which are neither, and grammars which are both.
-
leftcorner_parents
(cat)[source]¶ Return the set of all nonterminals for which the given category is a left corner. This is the inverse of the leftcorner relation.
- Parameters
cat (Nonterminal) – the suggested leftcorner
- Returns
the set of all parents to the leftcorner
- Return type
set(Nonterminal)
-
leftcorners
(cat)[source]¶ Return the set of all nonterminals that the given nonterminal can start with, including itself.
This is the reflexive, transitive closure of the immediate leftcorner relation: (A > B) iff (A -> B beta)
- Parameters
cat (Nonterminal) – the parent of the leftcorners
- Returns
the set of all leftcorners
- Return type
set(Nonterminal)
-
productions
(lhs=None, rhs=None, empty=False)[source]¶ Return the grammar productions, filtered by the left-hand side or the first item in the right-hand side.
- Parameters
lhs – Only return productions with the given left-hand side.
rhs – Only return productions with the given first item in the right-hand side.
empty – Only return productions with an empty right-hand side.
- Returns
A list of productions matching the given constraints.
- Return type
list(Production)
-
classmethod
remove_unitary_rules
(grammar)[source]¶ Remove nonlexical unitary rules and convert them to lexical
-
classmethod
-
class
nltk.grammar.
DependencyGrammar
(productions)[source]¶ Bases:
object
A dependency grammar. A DependencyGrammar consists of a set of productions. Each production specifies a head/modifier relationship between a pair of words.
-
class
nltk.grammar.
DependencyProduction
(lhs, rhs)[source]¶ Bases:
nltk.grammar.Production
A dependency grammar production. Each production maps a single head word to an unordered list of one or more modifier words.
-
class
nltk.grammar.
Nonterminal
(symbol)[source]¶ Bases:
object
A non-terminal symbol for a context free grammar.
Nonterminal
is a wrapper class for node values; it is used byProduction
objects to distinguish node values from leaf values. The node value that is wrapped by aNonterminal
is known as its “symbol”. Symbols are typically strings representing phrasal categories (such as"NP"
or"VP"
). However, more complex symbol types are sometimes used (e.g., for lexicalized grammars). Since symbols are node values, they must be immutable and hashable. TwoNonterminals
are considered equal if their symbols are equal.- See
CFG
,Production
- Variables
_symbol – The node value corresponding to this
Nonterminal
. This value must be immutable and hashable.
-
class
nltk.grammar.
PCFG
(start, productions, calculate_leftcorners=True)[source]¶ Bases:
nltk.grammar.CFG
A probabilistic context-free grammar. A PCFG consists of a start state and a set of productions with probabilities. The set of terminals and nonterminals is implicitly specified by the productions.
PCFG productions use the
ProbabilisticProduction
class.PCFGs
impose the constraint that the set of productions with any given left-hand-side must have probabilities that sum to 1 (allowing for a small margin of error).If you need efficient key-based access to productions, you can use a subclass to implement it.
- Variables
EPSILON – The acceptable margin of error for checking that productions with a given left-hand side have probabilities that sum to 1.
-
EPSILON
= 0.01¶
-
class
nltk.grammar.
ProbabilisticProduction
(lhs, rhs, **prob)[source]¶ Bases:
nltk.grammar.Production
,nltk.probability.ImmutableProbabilisticMixIn
A probabilistic context free grammar production. A PCFG
ProbabilisticProduction
is essentially just aProduction
that has an associated probability, which represents how likely it is that this production will be used. In particular, the probability of aProbabilisticProduction
records the likelihood that its right-hand side is the correct instantiation for any given occurrence of its left-hand side.- See
Production
-
class
nltk.grammar.
Production
(lhs, rhs)[source]¶ Bases:
object
A grammar production. Each production maps a single symbol on the “left-hand side” to a sequence of symbols on the “right-hand side”. (In the case of context-free productions, the left-hand side must be a
Nonterminal
, and the right-hand side is a sequence of terminals andNonterminals
.) “terminals” can be any immutable hashable object that is not aNonterminal
. Typically, terminals are strings representing words, such as"dog"
or"under"
.- See
CFG
- See
DependencyGrammar
- See
Nonterminal
- Variables
_lhs – The left-hand side of the production.
_rhs – The right-hand side of the production.
-
is_lexical
()[source]¶ Return True if the right-hand contain at least one terminal token.
- Return type
bool
-
is_nonlexical
()[source]¶ Return True if the right-hand side only contains
Nonterminals
- Return type
bool
-
nltk.grammar.
induce_pcfg
(start, productions)[source]¶ Induce a PCFG grammar from a list of productions.
The probability of a production A -> B C in a PCFG is:
count(A -> B C)P(B, C | A) = ————— where * is any right hand sidecount(A -> *)- Parameters
start (Nonterminal) – The start symbol
productions (list(Production)) – The list of productions that defines the grammar
-
nltk.grammar.
nonterminals
(symbols)[source]¶ Given a string containing a list of symbol names, return a list of
Nonterminals
constructed from those symbols.- Parameters
symbols (str) – The symbol name string. This string can be delimited by either spaces or commas.
- Returns
A list of
Nonterminals
constructed from the symbol names given insymbols
. TheNonterminals
are sorted in the same order as the symbols names.- Return type
list(Nonterminal)
-
nltk.grammar.
read_grammar
(input, nonterm_parser, probabilistic=False, encoding=None)[source]¶ Return a pair consisting of a starting category and a list of
Productions
.- Parameters
input – a grammar, either in the form of a string or else as a list of strings.
nonterm_parser – a function for parsing nonterminals. It should take a
(string, position)
as argument and return a(nonterminal, position)
as result.probabilistic (bool) – are the grammar rules probabilistic?
encoding (str) – the encoding of the grammar, if it is a binary string
probability
Module¶
Classes for representing and processing probabilistic information.
The FreqDist
class is used to encode “frequency distributions”,
which count the number of times that each outcome of an experiment
occurs.
The ProbDistI
class defines a standard interface for “probability
distributions”, which encode the probability of each outcome for an
experiment. There are two types of probability distribution:
“derived probability distributions” are created from frequency distributions. They attempt to model the probability distribution that generated the frequency distribution.
“analytic probability distributions” are created directly from parameters (such as variance).
The ConditionalFreqDist
class and ConditionalProbDistI
interface
are used to encode conditional distributions. Conditional probability
distributions can be derived or analytic; but currently the only
implementation of the ConditionalProbDistI
interface is
ConditionalProbDist
, a derived distribution.
-
class
nltk.probability.
ConditionalFreqDist
(cond_samples=None)[source]¶ Bases:
collections.defaultdict
A collection of frequency distributions for a single experiment run under different conditions. Conditional frequency distributions are used to record the number of times each sample occurred, given the condition under which the experiment was run. For example, a conditional frequency distribution could be used to record the frequency of each word (type) in a document, given its length. Formally, a conditional frequency distribution can be defined as a function that maps from each condition to the FreqDist for the experiment under that condition.
Conditional frequency distributions are typically constructed by repeatedly running an experiment under a variety of conditions, and incrementing the sample outcome counts for the appropriate conditions. For example, the following code will produce a conditional frequency distribution that encodes how often each word type occurs, given the length of that word type:
>>> from nltk.probability import ConditionalFreqDist >>> from nltk.tokenize import word_tokenize >>> sent = "the the the dog dog some other words that we do not care about" >>> cfdist = ConditionalFreqDist() >>> for word in word_tokenize(sent): ... condition = len(word) ... cfdist[condition][word] += 1
An equivalent way to do this is with the initializer:
>>> cfdist = ConditionalFreqDist((len(word), word) for word in word_tokenize(sent))
The frequency distribution for each condition is accessed using the indexing operator:
>>> cfdist[3] FreqDist({'the': 3, 'dog': 2, 'not': 1}) >>> cfdist[3].freq('the') 0.5 >>> cfdist[3]['dog'] 2
When the indexing operator is used to access the frequency distribution for a condition that has not been accessed before,
ConditionalFreqDist
creates a new empty FreqDist for that condition.-
N
()[source]¶ Return the total number of sample outcomes that have been recorded by this
ConditionalFreqDist
.- Return type
int
-
conditions
()[source]¶ Return a list of the conditions that have been accessed for this
ConditionalFreqDist
. Use the indexing operator to access the frequency distribution for a given condition. Note that the frequency distributions for some conditions may contain zero sample outcomes.- Return type
list
-
plot
(*args, **kwargs)[source]¶ Plot the given samples from the conditional frequency distribution. For a cumulative plot, specify cumulative=True. (Requires Matplotlib to be installed.)
- Parameters
samples (list) – The samples to plot
title (str) – The title for the graph
conditions (list) – The conditions to plot (default is all)
-
-
class
nltk.probability.
ConditionalProbDist
(cfdist, probdist_factory, *factory_args, **factory_kw_args)[source]¶ Bases:
nltk.probability.ConditionalProbDistI
A conditional probability distribution modeling the experiments that were used to generate a conditional frequency distribution. A ConditionalProbDist is constructed from a
ConditionalFreqDist
and aProbDist
factory:The
ConditionalFreqDist
specifies the frequency distribution for each condition.The
ProbDist
factory is a function that takes a condition’s frequency distribution, and returns its probability distribution. AProbDist
class’s name (such asMLEProbDist
orHeldoutProbDist
) can be used to specify that class’s constructor.
The first argument to the
ProbDist
factory is the frequency distribution that it should model; and the remaining arguments are specified by thefactory_args
parameter to theConditionalProbDist
constructor. For example, the following code constructs aConditionalProbDist
, where the probability distribution for each condition is anELEProbDist
with 10 bins:>>> from nltk.corpus import brown >>> from nltk.probability import ConditionalFreqDist >>> from nltk.probability import ConditionalProbDist, ELEProbDist >>> cfdist = ConditionalFreqDist(brown.tagged_words()[:5000]) >>> cpdist = ConditionalProbDist(cfdist, ELEProbDist, 10) >>> cpdist['passed'].max() 'VBD' >>> cpdist['passed'].prob('VBD') 0.423...
-
class
nltk.probability.
ConditionalProbDistI
[source]¶ Bases:
dict
A collection of probability distributions for a single experiment run under different conditions. Conditional probability distributions are used to estimate the likelihood of each sample, given the condition under which the experiment was run. For example, a conditional probability distribution could be used to estimate the probability of each word type in a document, given the length of the word type. Formally, a conditional probability distribution can be defined as a function that maps from each condition to the
ProbDist
for the experiment under that condition.
-
class
nltk.probability.
CrossValidationProbDist
(freqdists, bins)[source]¶ Bases:
nltk.probability.ProbDistI
The cross-validation estimate for the probability distribution of the experiment used to generate a set of frequency distribution. The “cross-validation estimate” for the probability of a sample is found by averaging the held-out estimates for the sample in each pair of frequency distributions.
-
SUM_TO_ONE
= False¶ True if the probabilities of the samples in this probability distribution will always sum to one.
-
discount
()[source]¶ Return the ratio by which counts are discounted on average: c*/c
- Return type
float
-
freqdists
()[source]¶ Return the list of frequency distributions that this
ProbDist
is based on.- Return type
list(FreqDist)
-
-
class
nltk.probability.
DictionaryConditionalProbDist
(probdist_dict)[source]¶ Bases:
nltk.probability.ConditionalProbDistI
An alternative ConditionalProbDist that simply wraps a dictionary of ProbDists rather than creating these from FreqDists.
-
class
nltk.probability.
DictionaryProbDist
(prob_dict=None, log=False, normalize=False)[source]¶ Bases:
nltk.probability.ProbDistI
A probability distribution whose probabilities are directly specified by a given dictionary. The given dictionary maps samples to probabilities.
-
logprob
(sample)[source]¶ Return the base 2 logarithm of the probability for a given sample.
- Parameters
sample (any) – The sample whose probability should be returned.
- Return type
float
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
-
class
nltk.probability.
ELEProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.LidstoneProbDist
The expected likelihood estimate for the probability distribution of the experiment used to generate a frequency distribution. The “expected likelihood estimate” approximates the probability of a sample with count c from an experiment with N outcomes and B bins as (c+0.5)/(N+B/2). This is equivalent to adding 0.5 to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution.
-
class
nltk.probability.
FreqDist
(samples=None)[source]¶ Bases:
collections.Counter
A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
Frequency distributions are generally constructed by running a number of experiments, and incrementing the count for a sample every time it is an outcome of an experiment. For example, the following code will produce a frequency distribution that encodes how often each word occurs in a text:
>>> from nltk.tokenize import word_tokenize >>> from nltk.probability import FreqDist >>> sent = 'This is an example sentence' >>> fdist = FreqDist() >>> for word in word_tokenize(sent): ... fdist[word.lower()] += 1
An equivalent way to do this is with the initializer:
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
-
B
()[source]¶ Return the total number of sample values (or “bins”) that have counts greater than zero. For the total number of sample outcomes recorded, use
FreqDist.N()
. (FreqDist.B() is the same as len(FreqDist).)- Return type
int
-
N
()[source]¶ Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use
FreqDist.B()
.- Return type
int
-
freq
(sample)[source]¶ Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist. Frequencies are always real numbers in the range [0, 1].
- Parameters
sample (any) – the sample whose frequency should be returned.
- Return type
float
-
max
()[source]¶ Return the sample with the greatest number of outcomes in this frequency distribution. If two or more samples have the same number of outcomes, return one of them; which sample is returned is undefined. If no outcomes have occurred in this frequency distribution, return None.
- Returns
The sample with the maximum number of outcomes in this frequency distribution.
- Return type
any or None
-
pformat
(maxlen=10)[source]¶ Return a string representation of this FreqDist.
- Parameters
maxlen (int) – The maximum number of items to display
- Return type
string
-
plot
(*args, **kwargs)[source]¶ Plot samples from the frequency distribution displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted. For a cumulative plot, specify cumulative=True. (Requires Matplotlib to be installed.)
- Parameters
title (bool) – The title for the graph
cumulative – A flag to specify whether the plot is cumulative (default = False)
-
pprint
(maxlen=10, stream=None)[source]¶ Print a string representation of this FreqDist to ‘stream’
- Parameters
maxlen (int) – The maximum number of items to print
stream – The stream to print to. stdout by default
-
r_Nr
(bins=None)[source]¶ Return the dictionary mapping r to Nr, the number of samples with frequency r, where Nr > 0.
- Parameters
bins (int) – The number of possible sample outcomes.
bins
is used to calculate Nr(0). In particular, Nr(0) isbins-self.B()
. Ifbins
is not specified, it defaults toself.B()
(so Nr(0) will be 0).- Return type
int
-
tabulate
(*args, **kwargs)[source]¶ Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.
- Parameters
samples (list) – The samples to plot (default is all samples)
cumulative – A flag to specify whether the freqs are cumulative (default = False)
-
-
class
nltk.probability.
HeldoutProbDist
(base_fdist, heldout_fdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions. These two frequency distributions are called the “heldout frequency distribution” and the “base frequency distribution.” The “heldout estimate” uses uses the “heldout frequency distribution” to predict the probability of each sample, given its frequency in the “base frequency distribution”.
In particular, the heldout estimate approximates the probability for a sample that occurs r times in the base distribution as the average frequency in the heldout distribution of all samples that occur r times in the base distribution.
This average frequency is Tr[r]/(Nr[r].N), where:
Tr[r] is the total count in the heldout distribution for all samples that occur r times in the base distribution.
Nr[r] is the number of samples that occur r times in the base distribution.
N is the number of outcomes recorded by the heldout frequency distribution.
In order to increase the efficiency of the
prob
member function, Tr[r]/(Nr[r].N) is precomputed for each value of r when theHeldoutProbDist
is created.- Variables
_estimate – A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample.
_estimate[r]
is calculated by finding the average frequency in the heldout distribution of all samples that occur r times in the base distribution. In particular,_estimate[r]
= Tr[r]/(Nr[r].N)._max_r – The maximum number of times that any sample occurs in the base distribution.
_max_r
is used to decide how large_estimate
must be.
-
SUM_TO_ONE
= False¶ True if the probabilities of the samples in this probability distribution will always sum to one.
-
base_fdist
()[source]¶ Return the base frequency distribution that this probability distribution is based on.
- Return type
-
discount
()[source]¶ Return the ratio by which counts are discounted on average: c*/c
- Return type
float
-
heldout_fdist
()[source]¶ Return the heldout frequency distribution that this probability distribution is based on.
- Return type
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
class
nltk.probability.
ImmutableProbabilisticMixIn
(**kwargs)[source]¶
-
class
nltk.probability.
KneserNeyProbDist
(freqdist, bins=None, discount=0.75)[source]¶ Bases:
nltk.probability.ProbDistI
Kneser-Ney estimate of a probability distribution. This is a version of back-off that counts how likely an n-gram is provided the n-1-gram had been seen in training. Extends the ProbDistI interface, requires a trigram FreqDist instance to train on. Optionally, a different from default discount value can be specified. The default discount is set to 0.75.
-
discount
()[source]¶ Return the value by which counts are discounted. By default set to 0.75.
- Return type
float
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
prob
(trigram)[source]¶ Return the probability for a given sample. Probabilities are always real numbers in the range [0, 1].
- Parameters
sample (any) – The sample whose probability should be returned.
- Return type
float
-
-
class
nltk.probability.
LaplaceProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.LidstoneProbDist
The Laplace estimate for the probability distribution of the experiment used to generate a frequency distribution. The “Laplace estimate” approximates the probability of a sample with count c from an experiment with N outcomes and B bins as (c+1)/(N+B). This is equivalent to adding one to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution.
-
class
nltk.probability.
LidstoneProbDist
(freqdist, gamma, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The Lidstone estimate for the probability distribution of the experiment used to generate a frequency distribution. The “Lidstone estimate” is parameterized by a real number gamma, which typically ranges from 0 to 1. The Lidstone estimate approximates the probability of a sample with count c from an experiment with N outcomes and B bins as
c+gamma)/(N+B*gamma)
. This is equivalent to adding gamma to the count for each bin, and taking the maximum likelihood estimate of the resulting frequency distribution.-
SUM_TO_ONE
= False¶ True if the probabilities of the samples in this probability distribution will always sum to one.
-
discount
()[source]¶ Return the ratio by which counts are discounted on average: c*/c
- Return type
float
-
freqdist
()[source]¶ Return the frequency distribution that this probability distribution is based on.
- Return type
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
-
class
nltk.probability.
MLEProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The maximum likelihood estimate for the probability distribution of the experiment used to generate a frequency distribution. The “maximum likelihood estimate” approximates the probability of each sample as the frequency of that sample in the frequency distribution.
-
freqdist
()[source]¶ Return the frequency distribution that this probability distribution is based on.
- Return type
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
-
class
nltk.probability.
MutableProbDist
(prob_dist, samples, store_logs=True)[source]¶ Bases:
nltk.probability.ProbDistI
An mutable probdist where the probabilities may be easily modified. This simply copies an existing probdist, storing the probability values in a mutable dictionary and providing an update method.
-
logprob
(sample)[source]¶ Return the base 2 logarithm of the probability for a given sample.
- Parameters
sample (any) – The sample whose probability should be returned.
- Return type
float
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
prob
(sample)[source]¶ Return the probability for a given sample. Probabilities are always real numbers in the range [0, 1].
- Parameters
sample (any) – The sample whose probability should be returned.
- Return type
float
-
samples
()[source]¶ Return a list of all samples that have nonzero probabilities. Use
prob
to find the probability of each sample.- Return type
list
-
update
(sample, prob, log=True)[source]¶ Update the probability for the given sample. This may cause the object to stop being the valid probability distribution - the user must ensure that they update the sample probabilities such that all samples have probabilities between 0 and 1 and that all probabilities sum to one.
- Parameters
sample (any) – the sample for which to update the probability
prob (float) – the new probability
log (bool) – is the probability already logged
-
-
class
nltk.probability.
ProbDistI
[source]¶ Bases:
object
A probability distribution for the outcomes of an experiment. A probability distribution specifies how likely it is that an experiment will have any given outcome. For example, a probability distribution could be used to predict the probability that a token in a document will have a given type. Formally, a probability distribution can be defined as a function mapping from samples to nonnegative real numbers, such that the sum of every number in the function’s range is 1.0. A
ProbDist
is often used to model the probability distribution of the experiment used to generate a frequency distribution.-
SUM_TO_ONE
= True¶ True if the probabilities of the samples in this probability distribution will always sum to one.
-
discount
()[source]¶ Return the ratio by which counts are discounted on average: c*/c
- Return type
float
-
generate
()[source]¶ Return a randomly selected sample from this probability distribution. The probability of returning each sample
samp
is equal toself.prob(samp)
.
-
logprob
(sample)[source]¶ Return the base 2 logarithm of the probability for a given sample.
- Parameters
sample (any) – The sample whose probability should be returned.
- Return type
float
-
abstract
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
-
class
nltk.probability.
ProbabilisticMixIn
(**kwargs)[source]¶ Bases:
object
A mix-in class to associate probabilities with other classes (trees, rules, etc.). To use the
ProbabilisticMixIn
class, define a new class that derives from an existing class and from ProbabilisticMixIn. You will need to define a new constructor for the new class, which explicitly calls the constructors of both its parent classes. For example:>>> from nltk.probability import ProbabilisticMixIn >>> class A: ... def __init__(self, x, y): self.data = (x,y) ... >>> class ProbabilisticA(A, ProbabilisticMixIn): ... def __init__(self, x, y, **prob_kwarg): ... A.__init__(self, x, y) ... ProbabilisticMixIn.__init__(self, **prob_kwarg)
See the documentation for the ProbabilisticMixIn
constructor<__init__>
for information about the arguments it expects.You should generally also redefine the string representation methods, the comparison methods, and the hashing method.
-
logprob
()[source]¶ Return
log(p)
, wherep
is the probability associated with this object.- Return type
float
-
-
class
nltk.probability.
SimpleGoodTuringProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:
Good Turing smoothing without tears” (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
“Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.
slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
intercept: a = E(y) - b.E(x)
-
SUM_TO_ONE
= False¶ True if the probabilities of the samples in this probability distribution will always sum to one.
-
discount
()[source]¶ This function returns the total mass of probability transfers from the seen samples to the unseen samples.
-
find_best_fit
(r, nr)[source]¶ Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
prob
(sample)[source]¶ Return the sample’s probability.
- Parameters
sample (str) – sample of the event
- Return type
float
-
class
nltk.probability.
UniformProbDist
(samples)[source]¶ Bases:
nltk.probability.ProbDistI
A probability distribution that assigns equal probability to each sample in a given set; and a zero probability to all other samples.
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
-
-
class
nltk.probability.
WittenBellProbDist
(freqdist, bins=None)[source]¶ Bases:
nltk.probability.ProbDistI
The Witten-Bell estimate of a probability distribution. This distribution allocates uniform probability mass to as yet unseen events by using the number of events that have only been seen once. The probability mass reserved for unseen events is equal to T / (N + T) where T is the number of observed event types and N is the total number of observed events. This equates to the maximum likelihood estimate of a new type event occurring. The remaining probability mass is discounted such that all probability estimates sum to one, yielding:
p = T / Z (N + T), if count = 0
p = c / (N + T), otherwise
-
discount
()[source]¶ Return the ratio by which counts are discounted on average: c*/c
- Return type
float
-
max
()[source]¶ Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
- Return type
any
text
Module¶
This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces. Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity.
-
class
nltk.text.
ConcordanceIndex
(tokens, key=<function ConcordanceIndex.<lambda>>)[source]¶ Bases:
object
An index that can be used to look up the offset locations at which a given word occurs in a document.
-
find_concordance
(word, width=80)[source]¶ Find all concordance lines given the query word.
Provided with a list of words, these will be found as a phrase.
-
offsets
(word)[source]¶ - Return type
list(int)
- Returns
A list of the offset positions at which the given word occurs. If a key function was specified for the index, then given word’s key will be looked up.
-
print_concordance
(word, width=80, lines=25)[source]¶ Print concordance lines given the query word. :param word: The target word or phrase (a list of strings) :type word: str or list :param lines: The number of lines to display (default=25) :type lines: int :param width: The width of each line, in characters (default=80) :type width: int :param save: The option to save the concordance. :type save: bool
-
-
class
nltk.text.
ContextIndex
(tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda>>)[source]¶ Bases:
object
A bidirectional index between words and their ‘contexts’ in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.
-
common_contexts
(words, fail_on_unknown=False)[source]¶ Find contexts where the specified words can all appear; and return a frequency distribution mapping each context to the number of times that context was used.
- Parameters
words (str) – The words used to seed the similarity search
fail_on_unknown – If true, then raise a value error if any of the given words do not occur at all in the index.
-
-
class
nltk.text.
Text
(tokens, name=None)[source]¶ Bases:
object
A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the
Text
class, and use the appropriate analysis function or class directly instead.A
Text
is typically initialized from a given document or corpus. E.g.:>>> import nltk.corpus >>> from nltk.text import Text >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
-
collocation_list
(num=20, window_size=2)[source]¶ Return collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocation_list()[:2] [('United', 'States'), ('fellow', 'citizens')]
- Parameters
num (int) – The maximum number of collocations to return.
window_size (int) – The number of tokens spanned by a collocation (default=2)
- Return type
list(tuple(str, str))
-
collocations
(num=20, window_size=2)[source]¶ Print collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocations() United States; fellow citizens; four years; ...
- Parameters
num (int) – The maximum number of collocations to print.
window_size (int) – The number of tokens spanned by a collocation (default=2)
-
common_contexts
(words, num=20)[source]¶ Find contexts where the specified words appear; list most frequent common contexts first.
- Parameters
words (str) – The words used to seed the similarity search
num (int) – The number of words to generate (default=20)
- Seealso
ContextIndex.common_contexts()
-
concordance
(word, width=79, lines=25)[source]¶ Prints a concordance for
word
with the specified context window. Word matching is not case-sensitive.- Parameters
word (str or list) – The target word or phrase (a list of strings)
width (int) – The width of each line, in characters (default=80)
lines (int) – The number of lines to display (default=25)
- Seealso
ConcordanceIndex
-
concordance_list
(word, width=79, lines=25)[source]¶ Generate a concordance for
word
with the specified context window. Word matching is not case-sensitive.- Parameters
word (str or list) – The target word or phrase (a list of strings)
width (int) – The width of each line, in characters (default=80)
lines (int) – The number of lines to display (default=25)
- Seealso
ConcordanceIndex
-
dispersion_plot
(words)[source]¶ Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
- Parameters
words (list(str)) – The words to be plotted
- Seealso
nltk.draw.dispersion_plot()
-
findall
(regexp)[source]¶ Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
- Parameters
regexp (str) – A regular expression
-
generate
(length=100, text_seed=None, random_seed=42)[source]¶ Print random text, generated using a trigram language model. See also help(nltk.lm).
- Parameters
length (int) – The length of text to generate (default=100)
text_seed (list(str)) – Generation can be conditioned on preceding context.
random_seed – A random seed or an instance of random.Random. If provided,
makes the random sampling part of generation reproducible. (default=42) :type random_seed: int
-
similar
(word, num=20)[source]¶ Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
- Parameters
word (str) – The word used to seed the similarity search
num (int) – The number of words to generate (default=20)
- Seealso
ContextIndex.similar_words()
-
-
class
nltk.text.
TextCollection
(source)[source]¶ Bases:
nltk.text.Text
A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows:
>>> import nltk.corpus >>> from nltk.text import TextCollection >>> print('hack'); from nltk.book import text1, text2, text3 hack... >>> gutenberg = TextCollection(nltk.corpus.gutenberg) >>> mytexts = TextCollection([text1, text2, text3])
Iterating over a TextCollection produces all the tokens of all the texts in order.
-
class
nltk.text.
TokenSearcher
(tokens)[source]¶ Bases:
object
A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets – e.g.,
'<the><window><is><still><open>'
. The regular expression passed to thefindall()
method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have'.'
not match the angle brackets.-
findall
(regexp)[source]¶ Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> from nltk.text import TokenSearcher >>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
- Parameters
regexp (str) – A regular expression
-
toolbox
Module¶
Module for reading, writing and manipulating Toolbox databases and settings files.
-
class
nltk.toolbox.
StandardFormat
(filename=None, encoding=None)[source]¶ Bases:
object
Class for reading and processing standard format marker files and strings.
-
fields
(strip=True, unwrap=True, encoding=None, errors='strict', unicode_fields=None)[source]¶ Return an iterator that returns the next field in a
(marker, value)
tuple, wheremarker
andvalue
are unicode strings if anencoding
was specified in thefields()
method. Otherwise they are non-unicode strings.- Parameters
strip (bool) – strip trailing whitespace from the last line of each field
unwrap (bool) – Convert newlines in a field to spaces.
encoding (str or None) – Name of an encoding to use. If it is specified then the
fields()
method returns unicode strings rather than non unicode strings.errors (str) – Error handling scheme for codec. Same as the
decode()
builtin string method.unicode_fields (sequence) – Set of marker names whose values are UTF-8 encoded. Ignored if encoding is None. If the whole file is UTF-8 encoded set
encoding='utf8'
and leaveunicode_fields
with its default value of None.
- Return type
iter(tuple(str, str))
-
open
(sfm_file)[source]¶ Open a standard format marker file for sequential reading.
- Parameters
sfm_file (str) – name of the standard format marker input file
-
-
class
nltk.toolbox.
ToolboxData
(filename=None, encoding=None)[source]¶ Bases:
nltk.toolbox.StandardFormat
-
class
nltk.toolbox.
ToolboxSettings
[source]¶ Bases:
nltk.toolbox.StandardFormat
This class is the base class for settings files.
-
parse
(encoding=None, errors='strict', **kwargs)[source]¶ Return the contents of toolbox settings file with a nested structure.
- Parameters
encoding (str) – encoding used by settings file
errors (str) – Error handling scheme for codec. Same as
decode()
builtin method.kwargs (dict) – Keyword arguments passed to
StandardFormat.fields()
- Return type
ElementTree._ElementInterface
-
-
nltk.toolbox.
add_blank_lines
(tree, blanks_before, blanks_between)[source]¶ Add blank lines before all elements and subelements specified in blank_before.
- Parameters
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
blank_before (dict(tuple)) – elements and subelements to add blank lines before
-
nltk.toolbox.
add_default_fields
(elem, default_fields)[source]¶ Add blank elements and subelements specified in default_fields.
- Parameters
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
default_fields (dict(tuple)) – fields to add to each type of element and subelement
-
nltk.toolbox.
remove_blanks
(elem)[source]¶ Remove all elements and subelements with no text and no child elements.
- Parameters
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
-
nltk.toolbox.
sort_fields
(elem, field_orders)[source]¶ Sort the elements and subelements in order specified in field_orders.
- Parameters
elem (ElementTree._ElementInterface) – toolbox data in an elementtree structure
field_orders (dict(tuple)) – order of fields for each type of element and subelement
-
nltk.toolbox.
to_sfm_string
(tree, encoding=None, errors='strict', unicode_fields=None)[source]¶ Return a string with a standard format representation of the toolbox data in tree (tree can be a toolbox database or a single record).
- Parameters
tree (ElementTree._ElementInterface) – flat representation of toolbox data (whole database or single record)
encoding (str) – Name of an encoding to use.
errors (str) – Error handling scheme for codec. Same as the
encode()
builtin string method.unicode_fields (dict(str) or set(str)) –
- Return type
str
translate
Module¶
Experimental features for machine translation. These interfaces are prone to change.
tree
Module¶
Class for representing hierarchical language structures, such as syntax trees and morphological trees.
-
class
nltk.tree.
ImmutableProbabilisticTree
(node, children=None, **prob_kwargs)[source]¶ Bases:
nltk.tree.ImmutableTree
,nltk.probability.ProbabilisticMixIn
-
class
nltk.tree.
ImmutableTree
(node, children=None)[source]¶ Bases:
nltk.tree.Tree
-
pop
(v=None)[source]¶ Remove and return item at index (default last).
Raises IndexError if list is empty or index is out of range.
-
set_label
(value)[source]¶ Set the node label. This will only succeed the first time the node label is set, which should occur in ImmutableTree.__init__().
-
sort
()[source]¶ Sort the list in ascending order and return None.
The sort is in-place (i.e. the list itself is modified) and stable (i.e. the order of two equal elements is maintained).
If a key function is given, apply it once to each list item and sort them, ascending or descending, according to their function values.
The reverse flag can be set to sort in descending order.
-
-
class
nltk.tree.
MultiParentedTree
(node, children=None)[source]¶ Bases:
nltk.tree.AbstractParentedTree
A
Tree
that automatically maintains parent pointers for multi-parented trees. The following are methods for querying the structure of a multi-parented tree:parents()
,parent_indices()
,left_siblings()
,right_siblings()
,roots
,treepositions
.Each
MultiParentedTree
may have zero or more parents. In particular, subtrees may be shared. If a singleMultiParentedTree
is used as multiple children of the same parent, then that parent will appear multiple times in itsparents()
method.MultiParentedTrees
should never be used in the same tree asTrees
orParentedTrees
. Mixing tree implementations may result in incorrect parent pointers and inTypeError
exceptions.-
left_siblings
()[source]¶ A list of all left siblings of this tree, in any of its parent trees. A tree may be its own left sibling if it is used as multiple contiguous children of the same parent. A tree may appear multiple times in this list if it is the left sibling of this tree with respect to multiple parents.
- Type
list(MultiParentedTree)
-
parent_indices
(parent)[source]¶ Return a list of the indices where this tree occurs as a child of
parent
. If this child does not occur as a child ofparent
, then the empty list is returned. The following is always true:for parent_index in ptree.parent_indices(parent): parent[parent_index] is ptree
-
parents
()[source]¶ The set of parents of this tree. If this tree has no parents, then
parents
is the empty set. To check if a tree is used as multiple children of the same parent, use theparent_indices()
method.- Type
list(MultiParentedTree)
-
right_siblings
()[source]¶ A list of all right siblings of this tree, in any of its parent trees. A tree may be its own right sibling if it is used as multiple contiguous children of the same parent. A tree may appear multiple times in this list if it is the right sibling of this tree with respect to multiple parents.
- Type
list(MultiParentedTree)
-
roots
()[source]¶ The set of all roots of this tree. This set is formed by tracing all possible parent paths until trees with no parents are found.
- Type
list(MultiParentedTree)
-
-
class
nltk.tree.
ParentedTree
(node, children=None)[source]¶ Bases:
nltk.tree.AbstractParentedTree
A
Tree
that automatically maintains parent pointers for single-parented trees. The following are methods for querying the structure of a parented tree:parent
,parent_index
,left_sibling
,right_sibling
,root
,treeposition
.Each
ParentedTree
may have at most one parent. In particular, subtrees may not be shared. Any attempt to reuse a singleParentedTree
as a child of more than one parent (or as multiple children of the same parent) will cause aValueError
exception to be raised.ParentedTrees
should never be used in the same tree asTrees
orMultiParentedTrees
. Mixing tree implementations may result in incorrect parent pointers and inTypeError
exceptions.-
parent_index
()[source]¶ The index of this tree in its parent. I.e.,
ptree.parent()[ptree.parent_index()] is ptree
. Note thatptree.parent_index()
is not necessarily equal toptree.parent.index(ptree)
, since theindex()
method returns the first child that is equal to its argument.
-
-
class
nltk.tree.
ProbabilisticMixIn
(**kwargs)[source]¶ Bases:
object
A mix-in class to associate probabilities with other classes (trees, rules, etc.). To use the
ProbabilisticMixIn
class, define a new class that derives from an existing class and from ProbabilisticMixIn. You will need to define a new constructor for the new class, which explicitly calls the constructors of both its parent classes. For example:>>> from nltk.probability import ProbabilisticMixIn >>> class A: ... def __init__(self, x, y): self.data = (x,y) ... >>> class ProbabilisticA(A, ProbabilisticMixIn): ... def __init__(self, x, y, **prob_kwarg): ... A.__init__(self, x, y) ... ProbabilisticMixIn.__init__(self, **prob_kwarg)
See the documentation for the ProbabilisticMixIn
constructor<__init__>
for information about the arguments it expects.You should generally also redefine the string representation methods, the comparison methods, and the hashing method.
-
logprob
()[source]¶ Return
log(p)
, wherep
is the probability associated with this object.- Return type
float
-
-
class
nltk.tree.
ProbabilisticTree
(node, children=None, **prob_kwargs)[source]¶
-
class
nltk.tree.
Tree
(node, children=None)[source]¶ Bases:
list
A Tree represents a hierarchical grouping of leaves and subtrees. For example, each constituent in a syntax tree is represented by a single Tree.
A tree’s children are encoded as a list of leaves and subtrees, where a leaf is a basic (non-tree) value; and a subtree is a nested Tree.
>>> from nltk.tree import Tree >>> print(Tree(1, [2, Tree(3, [4]), 5])) (1 2 (3 4) 5) >>> vp = Tree('VP', [Tree('V', ['saw']), ... Tree('NP', ['him'])]) >>> s = Tree('S', [Tree('NP', ['I']), vp]) >>> print(s) (S (NP I) (VP (V saw) (NP him))) >>> print(s[1]) (VP (V saw) (NP him)) >>> print(s[1,1]) (NP him) >>> t = Tree.fromstring("(S (NP I) (VP (V saw) (NP him)))") >>> s == t True >>> t[1][1].set_label('X') >>> t[1][1].label() 'X' >>> print(t) (S (NP I) (VP (V saw) (X him))) >>> t[0], t[1,1] = t[1,1], t[0] >>> print(t) (S (X him) (VP (V saw) (NP I)))
The length of a tree is the number of children it has.
>>> len(t) 2
The set_label() and label() methods allow individual constituents to be labeled. For example, syntax trees use this label to specify phrase tags, such as “NP” and “VP”.
Several Tree methods use “tree positions” to specify children or descendants of a tree. Tree positions are defined as follows:
The tree position i specifies a Tree’s ith child.
The tree position
()
specifies the Tree itself.If p is the tree position of descendant d, then p+i specifies the ith child of d.
I.e., every tree position is either a single index i, specifying
tree[i]
; or a sequence i1, i2, …, iN, specifyingtree[i1][i2]...[iN]
.Construct a new tree. This constructor can be called in one of two ways:
Tree(label, children)
constructs a new tree with thespecified label and list of children.
Tree.fromstring(s)
constructs a new tree by parsing the strings
.
-
chomsky_normal_form
(factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')[source]¶ This method can modify a tree in three ways:
Convert a tree into its Chomsky Normal Form (CNF) equivalent – Every subtree has either two non-terminals or one terminal as its children. This process requires the creation of more”artificial” non-terminal nodes.
Markov (vertical) smoothing of children in new artificial nodes
Horizontal (parent) annotation of nodes
- Parameters
factor (str = [left|right]) – Right or left factoring method (default = “right”)
horzMarkov (int | None) – Markov order for sibling smoothing in artificial nodes (None (default) = include all siblings)
vertMarkov (int | None) – Markov order for parent smoothing (0 (default) = no vertical annotation)
childChar (str) – A string used in construction of the artificial nodes, separating the head of the original subtree from the child nodes that have yet to be expanded (default = “|”)
parentChar (str) – A string used to separate the node representation from its vertical annotation
-
collapse_unary
(collapsePOS=False, collapseRoot=False, joinChar='+')[source]¶ Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by ‘joinChar’. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.
- Parameters
collapsePOS (bool) – ‘False’ (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
collapseRoot (bool) – ‘False’ (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
joinChar (str) – A string used to connect collapsed node values (default = “+”)
-
classmethod
convert
(tree)[source]¶ Convert a tree between different subtypes of Tree.
cls
determines which class will be used to encode the new tree.- Parameters
tree (Tree) – The tree that should be converted.
- Returns
The new Tree.
-
flatten
()[source]¶ Return a flat version of the tree, with all non-root non-terminals removed.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> print(t.flatten()) (S the dog chased the cat)
- Returns
a tree consisting of this tree’s root connected directly to its leaves, omitting all intervening non-terminal nodes.
- Return type
-
classmethod
fromlist
(l)[source]¶ - Parameters
l (list) – a tree represented as nested lists
- Returns
A tree corresponding to the list representation
l
.- Return type
Convert nested lists to a NLTK Tree
-
classmethod
fromstring
(s, brackets='()', read_node=None, read_leaf=None, node_pattern=None, leaf_pattern=None, remove_empty_top_bracketing=False)[source]¶ Read a bracketed tree string and return the resulting tree. Trees are represented as nested brackettings, such as:
(S (NP (NNP John)) (VP (V runs)))
- Parameters
s (str) – The string to read
brackets (str (length=2)) – The bracket characters used to mark the beginning and end of trees and subtrees.
read_leaf (read_node,) –
If specified, these functions are applied to the substrings of
s
corresponding to nodes and leaves (respectively) to obtain the values for those nodes and leaves. They should have the following signature:read_node(str) -> value
For example, these functions could be used to process nodes and leaves whose values should be some type other than string (such as
FeatStruct
). Note that by default, node strings and leaf strings are delimited by whitespace and brackets; to override this default, use thenode_pattern
andleaf_pattern
arguments.leaf_pattern (node_pattern,) – Regular expression patterns used to find node and leaf substrings in
s
. By default, both nodes patterns are defined to match any sequence of non-whitespace non-bracket characters.remove_empty_top_bracketing (bool) – If the resulting tree has an empty node label, and is length one, then return its single child instead. This is useful for treebank trees, which sometimes contain an extra level of bracketing.
- Returns
A tree corresponding to the string representation
s
. If this class method is called using a subclass of Tree, then it will return a tree of that type.- Return type
-
height
()[source]¶ Return the height of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.height() 5 >>> print(t[0,0]) (D the) >>> t[0,0].height() 2
- Returns
The height of this tree. The height of a tree containing no children is 1; the height of a tree containing only leaves is 2; and the height of any other tree is one plus the maximum of its children’s heights.
- Return type
int
-
label
()[source]¶ Return the node label of the tree.
>>> t = Tree.fromstring('(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))') >>> t.label() 'S'
- Returns
the node label (typically a string)
- Return type
any
-
leaf_treeposition
(index)[source]¶ - Returns
The tree position of the
index
-th leaf in this tree. I.e., iftp=self.leaf_treeposition(i)
, thenself[tp]==self.leaves()[i]
.- Raises
IndexError – If this tree contains fewer than
index+1
leaves, or ifindex<0
.
-
leaves
()[source]¶ Return the leaves of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.leaves() ['the', 'dog', 'chased', 'the', 'cat']
- Returns
a list containing this tree’s leaves. The order reflects the order of the leaves in the tree’s hierarchical structure.
- Return type
list
-
property
node
¶ Outdated method to access the node value; use the label() method instead.
-
pformat
(margin=70, indent=0, nodesep='', parens='()', quotes=False)[source]¶ - Returns
A pretty-printed string representation of this tree.
- Return type
str
- Parameters
margin (int) – The right margin at which to do line-wrapping.
indent (int) – The indentation level at which printing begins. This number is used to decide how far to indent subsequent lines.
nodesep – A string that is used to separate the node from the children. E.g., the default value
':'
gives trees like(S: (NP: I) (VP: (V: saw) (NP: it)))
.
-
pformat_latex_qtree
()[source]¶ Returns a representation of the tree compatible with the LaTeX qtree package. This consists of the string
\Tree
followed by the tree represented in bracketed notation.For example, the following result was generated from a parse tree of the sentence
The announcement astounded us
:\Tree [.I'' [.N'' [.D The ] [.N' [.N announcement ] ] ] [.I' [.V'' [.V' [.V astounded ] [.N'' [.N' [.N us ] ] ] ] ] ] ]
See http://www.ling.upenn.edu/advice/latex.html for the LaTeX style file for the qtree package.
- Returns
A latex qtree representation of this tree.
- Return type
str
-
pos
()[source]¶ Return a sequence of pos-tagged words extracted from the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.pos() [('the', 'D'), ('dog', 'N'), ('chased', 'V'), ('the', 'D'), ('cat', 'N')]
- Returns
a list of tuples containing leaves and pre-terminals (part-of-speech tags). The order reflects the order of the leaves in the tree’s hierarchical structure.
- Return type
list(tuple)
-
pretty_print
(sentence=None, highlight=(), stream=None, **kwargs)[source]¶ Pretty-print this tree as ASCII or Unicode art. For explanation of the arguments, see the documentation for nltk.treeprettyprinter.TreePrettyPrinter.
-
productions
()[source]¶ Generate the productions that correspond to the non-terminal nodes of the tree. For each subtree of the form (P: C1 C2 … Cn) this produces a production of the form P -> C1 C2 … Cn.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.productions() [S -> NP VP, NP -> D N, D -> 'the', N -> 'dog', VP -> V NP, V -> 'chased', NP -> D N, D -> 'the', N -> 'cat']
- Return type
list(Production)
-
set_label
(label)[source]¶ Set the node label of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.set_label("T") >>> print(t) (T (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))
- Parameters
label (any) – the node label (typically a string)
-
subtrees
(filter=None)[source]¶ Generate all the subtrees of this tree, optionally restricted to trees matching the filter function.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> for s in t.subtrees(lambda t: t.height() == 2): ... print(s) (D the) (N dog) (V chased) (D the) (N cat)
- Parameters
filter (function) – the function to filter all local trees
-
treeposition_spanning_leaves
(start, end)[source]¶ - Returns
The tree position of the lowest descendant of this tree that dominates
self.leaves()[start:end]
.- Raises
ValueError – if
end <= start
-
treepositions
(order='preorder')[source]¶ >>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))") >>> t.treepositions() [(), (0,), (0, 0), (0, 0, 0), (0, 1), (0, 1, 0), (1,), (1, 0), (1, 0, 0), ...] >>> for pos in t.treepositions('leaves'): ... t[pos] = t[pos][::-1].upper() >>> print(t) (S (NP (D EHT) (N GOD)) (VP (V DESAHC) (NP (D EHT) (N TAC))))
- Parameters
order – One of:
preorder
,postorder
,bothorder
,leaves
.
-
un_chomsky_normal_form
(expandUnary=True, childChar='|', parentChar='^', unaryChar='+')[source]¶ This method modifies the tree in three ways:
Transforms a tree in Chomsky Normal Form back to its original structure (branching greater than two)
Removes any parent annotation (if it exists)
(optional) expands unary subtrees (if previously collapsed with collapseUnary(…) )
- Parameters
expandUnary (bool) – Flag to expand unary or not (default = True)
childChar (str) – A string separating the head node from its children in an artificial node (default = “|”)
parentChar (str) – A sting separating the node label from its parent annotation (default = “^”)
unaryChar (str) – A string joining two non-terminals in a unary production (default = “+”)
-
nltk.tree.
sinica_parse
(s)[source]¶ Parse a Sinica Treebank string and return a tree. Trees are represented as nested brackettings, as shown in the following example (X represents a Chinese character): S(goal:NP(Head:Nep:XX)|theme:NP(Head:Nhaa:X)|quantity:Dab:X|Head:VL2:X)#0(PERIODCATEGORY)
- Returns
A tree corresponding to the string representation.
- Return type
- Parameters
s (str) – The string to be converted
treetransforms
Module¶
A collection of methods for tree (grammar) transformations used in parsing natural language.
Although many of these methods are technically grammar transformations (ie. Chomsky Norm Form), when working with treebanks it is much more natural to visualize these modifications in a tree structure. Hence, we will do all transformation directly to the tree itself. Transforming the tree directly also allows us to do parent annotation. A grammar can then be simply induced from the modified tree.
The following is a short tutorial on the available transformations.
Chomsky Normal Form (binarization)
It is well known that any grammar has a Chomsky Normal Form (CNF) equivalent grammar where CNF is defined by every production having either two non-terminals or one terminal on its right hand side. When we have hierarchically structured data (ie. a treebank), it is natural to view this in terms of productions where the root of every subtree is the head (left hand side) of the production and all of its children are the right hand side constituents. In order to convert a tree into CNF, we simply need to ensure that every subtree has either two subtrees as children (binarization), or one leaf node (non-terminal). In order to binarize a subtree with more than two children, we must introduce artificial nodes.
There are two popular methods to convert a tree into CNF: left factoring and right factoring. The following example demonstrates the difference between them. Example:
Original Right-Factored Left-Factored A A A / | \ / \ / \ B C D ==> B A|<C-D> OR A|<B-C> D / \ / \ C D B CParent Annotation
In addition to binarizing the tree, there are two standard modifications to node labels we can do in the same traversal: parent annotation and Markov order-N smoothing (or sibling smoothing).
The purpose of parent annotation is to refine the probabilities of productions by adding a small amount of context. With this simple addition, a CYK (inside-outside, dynamic programming chart parse) can improve from 74% to 79% accuracy. A natural generalization from parent annotation is to grandparent annotation and beyond. The tradeoff becomes accuracy gain vs. computational complexity. We must also keep in mind data sparcity issues. Example:
Original Parent Annotation A A^<?> / | \ / \ B C D ==> B^<A> A|<C-D>^<?> where ? is the / \ parent of A C^<A> D^<A>Markov order-N smoothing
Markov smoothing combats data sparcity issues as well as decreasing computational requirements by limiting the number of children included in artificial nodes. In practice, most people use an order 2 grammar. Example:
Original No Smoothing Markov order 1 Markov order 2 etc. __A__ A A A / /|\ \ / \ / \ / \ B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D> / \ / \ / \ C ... C ... C ...Annotation decisions can be thought about in the vertical direction (parent, grandparent, etc) and the horizontal direction (number of siblings to keep). Parameters to the following functions specify these values. For more information see:
Dan Klein and Chris Manning (2003) “Accurate Unlexicalized Parsing”, ACL-03. http://www.aclweb.org/anthology/P03-1054
Unary Collapsing
Collapse unary productions (ie. subtrees with a single child) into a new non-terminal (Tree node). This is useful when working with algorithms that do not allow unary productions, yet you do not wish to lose the parent information. Example:
A | B ==> A+B / \ / \ C D C D
-
nltk.treetransforms.
chomsky_normal_form
(tree, factor='right', horzMarkov=None, vertMarkov=0, childChar='|', parentChar='^')[source]¶
-
nltk.treetransforms.
collapse_unary
(tree, collapsePOS=False, collapseRoot=False, joinChar='+')[source]¶ Collapse subtrees with a single child (ie. unary productions) into a new non-terminal (Tree node) joined by ‘joinChar’. This is useful when working with algorithms that do not allow unary productions, and completely removing the unary productions would require loss of useful information. The Tree is modified directly (since it is passed by reference) and no value is returned.
- Parameters
tree (Tree) – The Tree to be collapsed
collapsePOS (bool) – ‘False’ (default) will not collapse the parent of leaf nodes (ie. Part-of-Speech tags) since they are always unary productions
collapseRoot (bool) – ‘False’ (default) will not modify the root production if it is unary. For the Penn WSJ treebank corpus, this corresponds to the TOP -> productions.
joinChar (str) – A string used to connect collapsed node values (default = “+”)
util
Module¶
-
nltk.util.
acyclic_branches_depth_first
(tree, children=<built-in function iter>, depth=-1, cut_mark=None, traversed=None)[source]¶ Traverse the nodes of a tree in depth-first order, discarding eventual cycles within the same branch, but keep duplicate pathes in different branches. Add cut_mark (when defined) if cycles were truncated.
The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.
Catches only only cycles within the same branch, but keeping cycles from different branches:
>>> import nltk >>> from nltk.util import acyclic_branches_depth_first as tree >>> wn=nltk.corpus.wordnet >>> from pprint import pprint >>> pprint(tree(wn.synset('certified.a.01'), lambda s:s.also_sees(), cut_mark='...', depth=4)) [Synset('certified.a.01'), [Synset('authorized.a.01'), [Synset('lawful.a.01'), [Synset('legal.a.01'), "Cycle(Synset('lawful.a.01'),0,...)", [Synset('legitimate.a.01'), '...']], [Synset('straight.a.06'), [Synset('honest.a.01'), '...'], "Cycle(Synset('lawful.a.01'),0,...)"]], [Synset('legitimate.a.01'), "Cycle(Synset('authorized.a.01'),1,...)", [Synset('legal.a.01'), [Synset('lawful.a.01'), '...'], "Cycle(Synset('legitimate.a.01'),0,...)"], [Synset('valid.a.01'), "Cycle(Synset('legitimate.a.01'),0,...)", [Synset('reasonable.a.01'), '...']]], [Synset('official.a.01'), "Cycle(Synset('authorized.a.01'),1,...)"]], [Synset('documented.a.01')]]
-
nltk.util.
acyclic_breadth_first
(tree, children=<built-in function iter>, maxdepth=-1)[source]¶ Traverse the nodes of a tree in breadth-first order, discarding eventual cycles.
The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.
-
nltk.util.
acyclic_depth_first
(tree, children=<built-in function iter>, depth=-1, cut_mark=None, traversed=None)[source]¶ Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated.
The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.
Catches all cycles:
>>> import nltk >>> from nltk.util import acyclic_depth_first as acyclic_tree >>> wn=nltk.corpus.wordnet >>> from pprint import pprint >>> pprint(acyclic_tree(wn.synset('dog.n.01'), lambda s:s.hypernyms(),cut_mark='...')) [Synset('dog.n.01'), [Synset('canine.n.02'), [Synset('carnivore.n.01'), [Synset('placental.n.01'), [Synset('mammal.n.01'), [Synset('vertebrate.n.01'), [Synset('chordate.n.01'), [Synset('animal.n.01'), [Synset('organism.n.01'), [Synset('living_thing.n.01'), [Synset('whole.n.02'), [Synset('object.n.01'), [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]]]]], [Synset('domestic_animal.n.01'), "Cycle(Synset('animal.n.01'),-3,...)"]]
-
nltk.util.
acyclic_dic2tree
(node, dic)[source]¶ Convert acyclic dictionary ‘dic’, where the keys are nodes, and the values are lists of children, to output tree suitable for pprint(), starting at root ‘node’, with subtrees as nested lists.
-
nltk.util.
bigrams
(sequence, **kwargs)[source]¶ Return the bigrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import bigrams >>> list(bigrams([1,2,3,4,5])) [(1, 2), (2, 3), (3, 4), (4, 5)]
Use bigrams for a list version of this function.
- Parameters
sequence (sequence or iter) – the source data to be converted into bigrams
- Return type
iter(tuple)
-
nltk.util.
binary_search_file
(file, key, cache={}, cacheDepth=- 1)[source]¶ Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.
- Parameters
file (file) – the file to be searched through.
key (str) – the identifier we are searching for.
-
nltk.util.
breadth_first
(tree, children=<built-in function iter>, maxdepth=-1)[source]¶ Traverse the nodes of a tree in breadth-first order. (No check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.
-
nltk.util.
choose
(n, k)[source]¶ This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient).
This is the scipy.special.comb() with long integer computation but this approximation is faster, see https://github.com/nltk/nltk/issues/1181
>>> choose(4, 2) 6 >>> choose(6, 2) 15
- Parameters
n (int) – The number of things.
r (int) – The number of times a thing is taken.
-
nltk.util.
elementtree_indent
(elem, level=0)[source]¶ Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.
- Parameters
elem (ElementTree._ElementInterface) – element to be indented. will be modified.
level (nonnegative integer) – level of indentation for this element
- Return type
ElementTree._ElementInterface
- Returns
Contents of elem indented to reflect its structure
-
nltk.util.
everygrams
(sequence, min_len=1, max_len=- 1, pad_left=False, pad_right=False, **kwargs)[source]¶ Returns all possible ngrams generated from a sequence of items, as an iterator.
>>> sent = 'a b c'.split()
- New version outputs for everygrams.
>>> list(everygrams(sent)) [('a',), ('a', 'b'), ('a', 'b', 'c'), ('b',), ('b', 'c'), ('c',)]
- Old version outputs for everygrams.
>>> sorted(everygrams(sent), key=len) [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
>>> list(everygrams(sent, max_len=2)) [('a',), ('a', 'b'), ('b',), ('b', 'c'), ('c',)]
- Parameters
sequence (sequence or iter) – the source data to be converted into ngrams. If max_len is not provided, this sequence will be loaded into memory
min_len (int) – minimum length of the ngrams, aka. n-gram order/degree of ngram
max_len (int) – maximum length of the ngrams (set to length of sequence by default)
pad_left (bool) – whether the ngrams should be left-padded
pad_right (bool) – whether the ngrams should be right-padded
- Return type
iter(tuple)
-
nltk.util.
flatten
(*args)[source]¶ Flatten a list.
>>> from nltk.util import flatten >>> flatten(1, 2, ['b', 'a' , ['c', 'd']], 3) [1, 2, 'b', 'a', 'c', 'd', 3]
- Parameters
args – items and lists to be combined into a single list
- Return type
list
-
nltk.util.
guess_encoding
(data)[source]¶ Given a byte string, attempt to decode it. Tries the standard ‘UTF8’ and ‘latin-1’ encodings, Plus several gathered from locale information.
The calling program must first call:
locale.setlocale(locale.LC_ALL, '')
If successful it returns
(decoded_unicode, successful_encoding)
. If unsuccessful it raises aUnicodeError
.
-
nltk.util.
in_idle
()[source]¶ Return True if this function is run within idle. Tkinter programs that are run in idle should never call
Tk.mainloop
; so this function should be used to gate all calls toTk.mainloop
.- Warning
This function works by checking
sys.stdin
. If the user has modifiedsys.stdin
, then it may return incorrect results.- Return type
bool
-
nltk.util.
invert_graph
(graph)[source]¶ Inverts a directed graph.
- Parameters
graph (dict(set)) – the graph, represented as a dictionary of sets
- Returns
the inverted graph
- Return type
dict(set)
-
nltk.util.
ngrams
(sequence, n, **kwargs)[source]¶ Return the ngrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import ngrams >>> list(ngrams([1,2,3,4,5], 3)) [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Wrap with list for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True)) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>')) [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')] >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>')) [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)] >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')) [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
- Parameters
sequence (sequence or iter) – the source data to be converted into ngrams
n (int) – the degree of the ngrams
pad_left (bool) – whether the ngrams should be left-padded
pad_right (bool) – whether the ngrams should be right-padded
left_pad_symbol (any) – the symbol to use for left padding (default is None)
right_pad_symbol (any) – the symbol to use for right padding (default is None)
- Return type
sequence or iter
-
nltk.util.
pad_sequence
(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]¶ Returns a padded sequence of items before ngram extraction.
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>')) ['<s>', 1, 2, 3, 4, 5, '</s>'] >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>')) ['<s>', 1, 2, 3, 4, 5] >>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>')) [1, 2, 3, 4, 5, '</s>']
- Parameters
sequence (sequence or iter) – the source data to be padded
n (int) – the degree of the ngrams
pad_left (bool) – whether the ngrams should be left-padded
pad_right (bool) – whether the ngrams should be right-padded
left_pad_symbol (any) – the symbol to use for left padding (default is None)
right_pad_symbol (any) – the symbol to use for right padding (default is None)
- Return type
sequence or iter
-
nltk.util.
pr
(data, start=0, end=None)[source]¶ Pretty print a sequence of data items
- Parameters
data (sequence or iter) – the data stream to print
start (int) – the start position
end (int) – the end position
-
nltk.util.
print_string
(s, width=70)[source]¶ Pretty print a string, breaking lines on whitespace
- Parameters
s (str) – the string to print, consisting of words and spaces
width (int) – the display width
-
nltk.util.
re_show
(regexp, string, left='{', right='}')[source]¶ Return a string with markers surrounding the matched substrings. Search str for substrings matching
regexp
and wrap the matches with braces. This is convenient for learning about regular expressions.- Parameters
regexp (str) – The regular expression.
string (str) – The string being matched.
left (str) – The left delimiter (printed before the matched substring)
right (str) – The right delimiter (printed after the matched substring)
- Return type
str
-
nltk.util.
set_proxy
(proxy, user=None, password='')[source]¶ Set the HTTP proxy for Python to download through.
If
proxy
is None then tries to set proxy from environment or system settings.- Parameters
proxy – The HTTP proxy server to use. For example: ‘http://proxy.example.com:3128/’
user – The username to authenticate with. Use None to disable authentication.
password – The password to authenticate with.
-
nltk.util.
skipgrams
(sequence, n, k, **kwargs)[source]¶ Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
>>> sent = "Insurgents killed in ongoing fighting".split() >>> list(skipgrams(sent, 2, 2)) [('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')] >>> list(skipgrams(sent, 3, 2)) [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
- Parameters
sequence (sequence or iter) – the source data to be converted into trigrams
n (int) – the degree of the ngrams
k (int) – the skip distance
- Return type
iter(tuple)
-
nltk.util.
tokenwrap
(tokens, separator=' ', width=70)[source]¶ Pretty print a list of text tokens, breaking lines on whitespace
- Parameters
tokens (list) – the tokens to print
separator (str) – the string to use to separate tokens
width (int) – the display width (default=70)
-
nltk.util.
transitive_closure
(graph, reflexive=False)[source]¶ Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.
The algorithm is a slight modification of the “Marking Algorithm” of Ioannidis & Ramakrishnan (1998) “Efficient Transitive Closure Algorithms”.
- Parameters
graph (dict(set)) – the initial graph, represented as a dictionary of sets
reflexive (bool) – if set, also make the closure reflexive
- Return type
dict(set)
-
nltk.util.
trigrams
(sequence, **kwargs)[source]¶ Return the trigrams generated from a sequence of items, as an iterator. For example:
>>> from nltk.util import trigrams >>> list(trigrams([1,2,3,4,5])) [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Use trigrams for a list version of this function.
- Parameters
sequence (sequence or iter) – the source data to be converted into trigrams
- Return type
iter(tuple)
-
nltk.util.
unweighted_minimum_spanning_tree
(tree, children=<built-in function iter>)[source]¶ Output a Minimum Spanning Tree (MST) of an unweighted graph, by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.
The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.
>>> import nltk >>> from nltk.util import unweighted_minimum_spanning_tree as mst >>> wn=nltk.corpus.wordnet >>> from pprint import pprint >>> pprint(mst(wn.synset('bound.a.01'), lambda s:s.also_sees())) [Synset('bound.a.01'), [Synset('unfree.a.02'), [Synset('confined.a.02')], [Synset('dependent.a.01')], [Synset('restricted.a.01'), [Synset('classified.a.02')]]]]
wsd
Module¶
-
nltk.wsd.
lesk
(context_sentence, ambiguous_word, pos=None, synsets=None)[source]¶ Return a synset for an ambiguous word in a context.
- Parameters
context_sentence (iter) – The context sentence where the ambiguous word occurs, passed as an iterable of words.
ambiguous_word (str) – The ambiguous word that requires WSD.
pos (str) – A specified Part-of-Speech (POS).
synsets (iter) – Possible synsets of the ambiguous word.
- Returns
lesk_sense
The Synset() object with the highest signature overlaps.
This function is an implementation of the original Lesk algorithm (1986) [1].
Usage example:
>>> lesk(['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.'], 'bank', 'n') Synset('savings_bank.n.02')
[1] Lesk, Michael. “Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone.” Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, 1986. http://dl.acm.org/citation.cfm?id=318728
Subpackages¶
- nltk.app package
- nltk.ccg package
- nltk.chat package
- nltk.chunk package
- nltk.classify package
- Submodules
- nltk.classify.api module
- nltk.classify.decisiontree module
- nltk.classify.maxent module
- nltk.classify.megam module
- nltk.classify.naivebayes module
- nltk.classify.positivenaivebayes module
- nltk.classify.rte_classify module
- nltk.classify.scikitlearn module
- nltk.classify.senna module
- nltk.classify.svm module
- nltk.classify.tadm module
- nltk.classify.textcat module
- nltk.classify.util module
- nltk.classify.weka module
- Module contents
- nltk.cluster package
- nltk.corpus package
- Subpackages
- nltk.corpus.reader package
- Submodules
- nltk.corpus.reader.aligned module
- nltk.corpus.reader.api module
- nltk.corpus.reader.bnc module
- nltk.corpus.reader.bracket_parse module
- nltk.corpus.reader.categorized_sents module
- nltk.corpus.reader.chasen module
- nltk.corpus.reader.childes module
- nltk.corpus.reader.chunked module
- nltk.corpus.reader.cmudict module
- nltk.corpus.reader.comparative_sents module
- nltk.corpus.reader.conll module
- nltk.corpus.reader.crubadan module
- nltk.corpus.reader.dependency module
- nltk.corpus.reader.framenet module
- nltk.corpus.reader.ieer module
- nltk.corpus.reader.indian module
- nltk.corpus.reader.ipipan module
- nltk.corpus.reader.knbc module
- nltk.corpus.reader.lin module
- nltk.corpus.reader.mte module
- nltk.corpus.reader.nkjp module
- nltk.corpus.reader.nombank module
- nltk.corpus.reader.nps_chat module
- nltk.corpus.reader.opinion_lexicon module
- nltk.corpus.reader.panlex_lite module
- nltk.corpus.reader.panlex_swadesh module
- nltk.corpus.reader.pl196x module
- nltk.corpus.reader.plaintext module
- nltk.corpus.reader.ppattach module
- nltk.corpus.reader.propbank module
- nltk.corpus.reader.pros_cons module
- nltk.corpus.reader.reviews module
- nltk.corpus.reader.rte module
- nltk.corpus.reader.semcor module
- nltk.corpus.reader.senseval module
- nltk.corpus.reader.sentiwordnet module
- nltk.corpus.reader.sinica_treebank module
- nltk.corpus.reader.string_category module
- nltk.corpus.reader.switchboard module
- nltk.corpus.reader.tagged module
- nltk.corpus.reader.timit module
- nltk.corpus.reader.toolbox module
- nltk.corpus.reader.twitter module
- nltk.corpus.reader.udhr module
- nltk.corpus.reader.util module
- nltk.corpus.reader.verbnet module
- nltk.corpus.reader.wordlist module
- nltk.corpus.reader.wordnet module
- nltk.corpus.reader.xmldocs module
- nltk.corpus.reader.ycoe module
- Module contents
- nltk.corpus.reader package
- Submodules
- nltk.corpus.europarl_raw module
- nltk.corpus.util module
- Module contents
- Subpackages
- nltk.draw package
- nltk.inference package
- nltk.metrics package
- nltk.misc package
- nltk.parse package
- Submodules
- nltk.parse.api module
- nltk.parse.bllip module
- nltk.parse.chart module
- nltk.parse.corenlp module
- nltk.parse.dependencygraph module
- nltk.parse.earleychart module
- nltk.parse.evaluate module
- nltk.parse.featurechart module
- nltk.parse.generate module
- nltk.parse.malt module
- nltk.parse.nonprojectivedependencyparser module
- nltk.parse.pchart module
- nltk.parse.projectivedependencyparser module
- nltk.parse.recursivedescent module
- nltk.parse.shiftreduce module
- nltk.parse.stanford module
- nltk.parse.transitionparser module
- nltk.parse.util module
- nltk.parse.viterbi module
- Module contents
- nltk.sem package
- Submodules
- nltk.sem.boxer module
- nltk.sem.chat80 module
- nltk.sem.cooper_storage module
- nltk.sem.drt module
- nltk.sem.drt_glue_demo module
- nltk.sem.evaluate module
- nltk.sem.glue module
- nltk.sem.hole module
- nltk.sem.lfg module
- nltk.sem.linearlogic module
- nltk.sem.logic module
- nltk.sem.relextract module
- nltk.sem.skolemize module
- nltk.sem.util module
- Module contents
- nltk.stem package
- Submodules
- nltk.stem.api module
- nltk.stem.arlstem module
- nltk.stem.arlstem2 module
- nltk.stem.cistem module
- nltk.stem.isri module
- nltk.stem.lancaster module
- nltk.stem.porter module
- nltk.stem.regexp module
- nltk.stem.rslp module
- nltk.stem.snowball module
- nltk.stem.util module
- nltk.stem.wordnet module
- Module contents
- nltk.tag package
- Submodules
- nltk.tag.api module
- nltk.tag.brill module
- nltk.tag.brill_trainer module
- nltk.tag.crf module
- nltk.tag.hmm module
- nltk.tag.hunpos module
- nltk.tag.mapping module
- nltk.tag.perceptron module
- nltk.tag.senna module
- nltk.tag.sequential module
- nltk.tag.stanford module
- nltk.tag.tnt module
- nltk.tag.util module
- Module contents
- nltk.test package
- Subpackages
- nltk.test.unit package
- Subpackages
- nltk.test.unit.lm package
- nltk.test.unit.translate package
- Submodules
- nltk.test.unit.translate.test_bleu module
- nltk.test.unit.translate.test_gdfa module
- nltk.test.unit.translate.test_ibm1 module
- nltk.test.unit.translate.test_ibm2 module
- nltk.test.unit.translate.test_ibm3 module
- nltk.test.unit.translate.test_ibm4 module
- nltk.test.unit.translate.test_ibm5 module
- nltk.test.unit.translate.test_ibm_model module
- nltk.test.unit.translate.test_meteor module
- nltk.test.unit.translate.test_nist module
- nltk.test.unit.translate.test_stack_decoder module
- Module contents
- Submodules
- nltk.test.unit.conftest module
- nltk.test.unit.test_aline module
- nltk.test.unit.test_brill module
- nltk.test.unit.test_cfd_mutation module
- nltk.test.unit.test_cfg2chomsky module
- nltk.test.unit.test_chunk module
- nltk.test.unit.test_classify module
- nltk.test.unit.test_collocations module
- nltk.test.unit.test_concordance module
- nltk.test.unit.test_corenlp module
- nltk.test.unit.test_corpora module
- nltk.test.unit.test_corpus_views module
- nltk.test.unit.test_data module
- nltk.test.unit.test_disagreement module
- nltk.test.unit.test_freqdist module
- nltk.test.unit.test_hmm module
- nltk.test.unit.test_json2csv_corpus module
- nltk.test.unit.test_json_serialization module
- nltk.test.unit.test_naivebayes module
- nltk.test.unit.test_nombank module
- nltk.test.unit.test_pl196x module
- nltk.test.unit.test_pos_tag module
- nltk.test.unit.test_rte_classify module
- nltk.test.unit.test_seekable_unicode_stream_reader module
- nltk.test.unit.test_senna module
- nltk.test.unit.test_stem module
- nltk.test.unit.test_tag module
- nltk.test.unit.test_tgrep module
- nltk.test.unit.test_tokenize module
- nltk.test.unit.test_twitter_auth module
- nltk.test.unit.test_util module
- nltk.test.unit.test_wordnet module
- Module contents
- Subpackages
- nltk.test.unit package
- Submodules
- nltk.test.all module
- nltk.test.childes_fixt module
- nltk.test.classify_fixt module
- nltk.test.discourse_fixt module
- nltk.test.gensim_fixt module
- nltk.test.gluesemantics_malt_fixt module
- nltk.test.inference_fixt module
- nltk.test.nonmonotonic_fixt module
- nltk.test.portuguese_en_fixt module
- nltk.test.probability_fixt module
- Module contents
- Subpackages
- nltk.tokenize package
- Submodules
- nltk.tokenize.api module
- nltk.tokenize.casual module
- nltk.tokenize.destructive module
- nltk.tokenize.legality_principle module
- nltk.tokenize.mwe module
- nltk.tokenize.nist module
- nltk.tokenize.punkt module
- nltk.tokenize.regexp module
- nltk.tokenize.repp module
- nltk.tokenize.sexpr module
- nltk.tokenize.simple module
- nltk.tokenize.sonority_sequencing module
- nltk.tokenize.stanford module
- nltk.tokenize.stanford_segmenter module
- nltk.tokenize.texttiling module
- nltk.tokenize.toktok module
- nltk.tokenize.treebank module
- nltk.tokenize.util module
- Module contents