nltk.downloader module

The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.

Downloading Packages

If called with no arguments, download() will display an interactive interface which can be used to download and install new packages. If Tkinter is available, then a graphical interface will be shown, otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:

>>> download('treebank') 
[nltk_data] Downloading package 'treebank'...
[nltk_data]   Unzipping corpora/treebank.zip.

NLTK also provides a number of “package collections”, consisting of a group of related packages. To download all packages in a colleciton, simply call download() with the collection’s identifier:

>>> download('all-corpora') 
[nltk_data] Downloading package 'abc'...
[nltk_data]   Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data]   Unzipping corpora/alpino.zip.
  ...
[nltk_data] Downloading package 'words'...
[nltk_data]   Unzipping corpora/words.zip.

Download Directory

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

See Downloader.default_download_dir() for more a detailed description of how the default download directory is chosen.

NLTK Download Server

Before downloading any packages, the corpus and module downloader contacts the NLTK download server, to retrieve an index file describing the available packages. By default, this index file is loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml. If necessary, it is possible to create a new Downloader object, specifying a different URL for the package index file.

Usage:

python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or:

python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
class nltk.downloader.Collection[source]

Bases: object

A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by Downloader.

__init__(id, children, name=None, **kw)[source]
children

A list of the Collections or Packages directly contained by this collection.

static fromxml(xml)[source]
id

A unique identifier for this collection.

name

A string name for this collection.

packages

A list of Packages contained by this collection or any collections it recursively contains.

class nltk.downloader.Downloader[source]

Bases: object

A class used to access the NLTK data server, which can be used to download corpora and other data packages.

DEFAULT_URL = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml'

The default URL for the NLTK data server’s index. An alternative URL can be specified when creating a new Downloader object.

INDEX_TIMEOUT = 3600

The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.

INSTALLED = 'installed'

A status string indicating that a package or collection is installed and up-to-date.

NOT_INSTALLED = 'not installed'

A status string indicating that a package or collection is not installed.

PARTIAL = 'partial'

A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)

STALE = 'out of date'

A status string indicating that a package or collection is corrupt or out-of-date.

__init__(server_index_url=None, download_dir=None)[source]
clear_status_cache(id=None)[source]
collections()[source]
corpora()[source]
default_download_dir()[source]

Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the download_dir argument when calling download().

On Windows, the default download directory is PYTHONHOME/lib/nltk, where PYTHONHOME is the directory containing Python, e.g. C:\Python25.

On all other platforms, the default directory is the first of the following which exists or which can be created with write permission: /usr/share/nltk_data, /usr/local/share/nltk_data, /usr/lib/nltk_data, /usr/local/lib/nltk_data, ~/nltk_data.

download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<colorama.ansitowin32.StreamWrapper object>)[source]
property download_dir

The default directory to which packages will be downloaded. This defaults to the value returned by default_download_dir(). To override this default on a case-by-case basis, use the download_dir argument when calling download().

incr_download(info_or_id, download_dir=None, force=False)[source]
index()[source]

Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.

info(id)[source]

Return the Package or Collection record for the given item.

is_installed(info_or_id, download_dir=None)[source]
is_stale(info_or_id, download_dir=None)[source]
list(download_dir=None, show_packages=True, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]
models()[source]
packages()[source]
status(info_or_id, download_dir=None)[source]

Return a constant describing the status of the given package or collection. Status can be one of INSTALLED, NOT_INSTALLED, STALE, or PARTIAL.

update(quiet=False, prefix='[nltk_data] ')[source]

Re-download any packages whose status is STALE.

property url

The URL for the data server’s index file.

xmlinfo(id)[source]

Return the XML info record for the given item

class nltk.downloader.DownloaderGUI[source]

Bases: object

Graphical interface for downloading packages from the NLTK data server.

COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status', 'Unzipped Size', 'Copyright', 'Contact', 'License', 'Author', 'Subdir', 'Checksum']

A list of the names of columns. This controls the order in which the columns will appear. If this is edited, then _package_to_columns() may need to be edited to match.

COLUMN_WEIGHTS = {'': 0, 'Name': 5, 'Size': 0, 'Status': 0}

A dictionary specifying how columns should be resized when the table is resized. Columns with weight 0 will not be resized at all; and columns with high weight will be resized more. Default weight (for columns not explicitly listed) is 1.

COLUMN_WIDTHS = {'': 1, 'Identifier': 20, 'Name': 45, 'Size': 10, 'Status': 12, 'Unzipped Size': 10}

A dictionary specifying how wide each column should be, in characters. The default width (for columns not explicitly listed) is specified by DEFAULT_COLUMN_WIDTH.

DEFAULT_COLUMN_WIDTH = 30

The default width for columns that are not explicitly listed in COLUMN_WIDTHS.

HELP = 'This tool can be used to download a variety of corpora and models\nthat can be used with NLTK.  Each corpus or model is distributed\nin a single zip file, known as a "package file."  You can\ndownload packages individually, or you can download pre-defined\ncollections of packages.\n\nWhen you download a package, it will be saved to the "download\ndirectory."  A default download directory is chosen when you run\n\nthe downloader; but you may also select a different download\ndirectory.  On Windows, the default download directory is\n\n\n"package."\n\nThe NLTK downloader can be used to download a variety of corpora,\nmodels, and other data packages.\n\nKeyboard shortcuts::\n  [return]\t Download\n  [up]\t Select previous package\n  [down]\t Select next package\n  [left]\t Select previous tab\n  [right]\t Select next tab\n'
INITIAL_COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status']

The set of columns that should be displayed by default.

__init__(dataserver, use_threads=True)[source]
about(*e)[source]
c = 'Status'
destroy(*e)[source]
help(*e)[source]
mainloop(*args, **kwargs)[source]
class nltk.downloader.DownloaderMessage[source]

Bases: object

A status message object, used by incr_download to communicate its progress.

class nltk.downloader.DownloaderShell[source]

Bases: object

__init__(dataserver)[source]
run()[source]
class nltk.downloader.ErrorMessage[source]

Bases: DownloaderMessage

Data server encountered an error

__init__(package, message)[source]
class nltk.downloader.FinishCollectionMessage[source]

Bases: DownloaderMessage

Data server has finished working on a collection of packages.

__init__(collection)[source]
class nltk.downloader.FinishDownloadMessage[source]

Bases: DownloaderMessage

Data server has finished downloading a package.

__init__(package)[source]
class nltk.downloader.FinishPackageMessage[source]

Bases: DownloaderMessage

Data server has finished working on a package.

__init__(package)[source]
class nltk.downloader.FinishUnzipMessage[source]

Bases: DownloaderMessage

Data server has finished unzipping a package.

__init__(package)[source]
class nltk.downloader.Package[source]

Bases: object

A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by Downloader. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.

__init__(id, url, name=None, subdir='', size=None, unzipped_size=None, checksum=None, svn_revision=None, copyright='Unknown', contact='Unknown', license='Unknown', author='Unknown', unzip=True, **kw)[source]
author

Author of this package.

checksum

The MD-5 checksum of the package file.

contact

Name & email of the person who should be contacted with questions about this package.

copyright

Copyright holder for this package.

filename

The filename that should be used for this package’s file. It is formed by joining self.subdir with self.id, and using the same extension as url.

static fromxml(xml)[source]
id

A unique identifier for this package.

license

License information for this package.

name

A string name for this package.

size

The filesize (in bytes) of the package file.

subdir

The subdirectory where this package should be installed. E.g., 'corpora' or 'taggers'.

svn_revision

A subversion revision number for this package.

unzip

A flag indicating whether this corpus should be unzipped by default.

unzipped_size

The total filesize of the files contained in the package’s zipfile.

url

A URL that can be used to download this package’s file.

class nltk.downloader.ProgressMessage[source]

Bases: DownloaderMessage

Indicates how much progress the data server has made

__init__(progress)[source]
class nltk.downloader.SelectDownloadDirMessage[source]

Bases: DownloaderMessage

Indicates what download directory the data server is using

__init__(download_dir)[source]
class nltk.downloader.StaleMessage[source]

Bases: DownloaderMessage

The package download file is out-of-date or corrupt

__init__(package)[source]
class nltk.downloader.StartCollectionMessage[source]

Bases: DownloaderMessage

Data server has started working on a collection of packages.

__init__(collection)[source]
class nltk.downloader.StartDownloadMessage[source]

Bases: DownloaderMessage

Data server has started downloading a package.

__init__(package)[source]
class nltk.downloader.StartPackageMessage[source]

Bases: DownloaderMessage

Data server has started working on a package.

__init__(package)[source]
class nltk.downloader.StartUnzipMessage[source]

Bases: DownloaderMessage

Data server has started unzipping a package.

__init__(package)[source]
class nltk.downloader.UpToDateMessage[source]

Bases: DownloaderMessage

The package download file is already up-to-date

__init__(package)[source]
nltk.downloader.build_index(root, base_url)[source]

Create a new data.xml index file, by combining the xml description files for various packages and collections. root should be the path to a directory containing the package xml and zip files; and the collection xml files. The root directory is expected to have the following subdirectories:

root/
  packages/ .................. subdirectory for packages
    corpora/ ................. zip & xml files for corpora
    grammars/ ................ zip & xml files for grammars
    taggers/ ................. zip & xml files for taggers
    tokenizers/ .............. zip & xml files for tokenizers
    etc.
  collections/ ............... xml files for collections

For each package, there should be two files: package.zip (where package is the package name) which contains the package itself as a compressed zip file; and package.xml, which is an xml description of the package. The zipfile package.zip should expand to a single subdirectory named package/. The base filename package must match the identifier given in the package’s xml file.

For each collection, there should be a single file collection.zip describing the collection, where collection is the name of the collection.

All identifiers (for both packages and collections) must be unique.

nltk.downloader.download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False, print_error_to=<colorama.ansitowin32.StreamWrapper object>)
nltk.downloader.download_gui()[source]
nltk.downloader.download_shell()[source]
nltk.downloader.md5_hexdigest(file)[source]

Calculate and return the MD5 checksum for a given file. file may either be a filename or an open stream.

nltk.downloader.unzip(filename, root, verbose=True)[source]

Extract the contents of the zip file filename into the directory root.

nltk.downloader.update()[source]