nltk.collections module¶

class nltk.collections.AbstractLazySequence[source]¶

Bases: object

An abstract base class for read-only sequences whose values are computed as needed. Lazy sequences act like tuples – they can be indexed, sliced, and iterated over; but they may not be modified.

The most common application of lazy sequences in NLTK is for corpus view objects, which provide access to the contents of a corpus without loading the entire corpus into memory, by loading pieces of the corpus from disk as needed.

The result of modifying a mutable element of a lazy sequence is undefined. In particular, the modifications made to the element may or may not persist, depending on whether and when the lazy sequence caches that element’s value or reconstructs it from scratch.

Subclasses are required to define two methods: __len__() and iterate_from().

count(value)[source]¶: Return the number of times this list contains value.

index(value, start=None, stop=None)[source]¶: Return the index of the first occurrence of value in this list that is greater than or equal to start and less than stop. Negative start and stop values are treated like negative slice bounds – i.e., they count from the end of the list.

iterate_from(start)[source]¶: Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

class nltk.collections.LazyConcatenation[source]¶

Bases: AbstractLazySequence

A lazy sequence formed by concatenating a list of lists. This underlying list of lists may itself be lazy. LazyConcatenation maintains an index that it uses to keep track of the relationship between offsets in the concatenated lists and offsets in the sublists.

__init__(list_of_lists)[source]¶

iterate_from(start_index)[source]¶: Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

class nltk.collections.LazyEnumerate[source]¶

Bases: LazyZip

A lazy sequence whose elements are tuples, each containing a count (from zero) and a value yielded by underlying sequence. LazyEnumerate is useful for obtaining an indexed list. The tuples are constructed lazily – i.e., when you read a value from the list, LazyEnumerate will calculate that value by forming a tuple from the count of the i-th element and the i-th element of the underlying sequence.

LazyEnumerate is essentially a lazy version of the Python primitive function enumerate. In particular, the following two expressions are equivalent:

>>> from nltk.collections import LazyEnumerate
>>> sequence = ['first', 'second', 'third']
>>> list(enumerate(sequence))
[(0, 'first'), (1, 'second'), (2, 'third')]
>>> list(LazyEnumerate(sequence))
[(0, 'first'), (1, 'second'), (2, 'third')]

Lazy enumerations can be useful for conserving memory in cases where the argument sequences are particularly long.

A typical example of a use case for this class is obtaining an indexed list for a long sequence of values. By constructing tuples lazily and avoiding the creation of an additional long sequence, memory usage can be significantly reduced.

__init__(lst)[source]¶

Parameters:: lst (list) – the underlying list

class nltk.collections.LazyIteratorList[source]¶

Bases: AbstractLazySequence

Wraps an iterator, loading its elements on demand and making them subscriptable. __repr__ displays only the first few elements.

__init__(it, known_len=None)[source]¶

iterate_from(start)[source]¶: Create a new iterator over this list starting at the given offset.

class nltk.collections.LazyMap[source]¶

Bases: AbstractLazySequence

A lazy sequence whose elements are formed by applying a given function to each element in one or more underlying lists. The function is applied lazily – i.e., when you read a value from the list, LazyMap will calculate that value by applying its function to the underlying lists’ value(s). LazyMap is essentially a lazy version of the Python primitive function map. In particular, the following two expressions are equivalent:

>>> from nltk.collections import LazyMap
>>> function = str
>>> sequence = [1,2,3]
>>> map(function, sequence) 
['1', '2', '3']
>>> list(LazyMap(function, sequence))
['1', '2', '3']

Like the Python map primitive, if the source lists do not have equal size, then the value None will be supplied for the ‘missing’ elements.

Lazy maps can be useful for conserving memory, in cases where individual values take up a lot of space. This is especially true if the underlying list’s values are constructed lazily, as is the case with many corpus readers.

A typical example of a use case for this class is performing feature detection on the tokens in a corpus. Since featuresets are encoded as dictionaries, which can take up a lot of memory, using a LazyMap can significantly reduce memory usage when training and running classifiers.

__init__(function, *lists, **config)[source]¶

Parameters:

function – The function that should be applied to elements of lists. It should take as many arguments as there are lists.
lists – The underlying lists.
cache_size – Determines the size of the cache used by this lazy map. (default=5)

iterate_from(index)[source]¶: Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

class nltk.collections.LazySubsequence[source]¶

Bases: AbstractLazySequence

A subsequence produced by slicing a lazy sequence. This slice keeps a reference to its source sequence, and generates its values by looking them up in the source sequence.

MIN_SIZE = 100¶: The minimum size for which lazy slices should be created. If LazySubsequence() is called with a subsequence that is shorter than MIN_SIZE, then a tuple will be returned instead.

__init__(source, start, stop)[source]¶

static __new__(cls, source, start, stop)[source]¶: Construct a new slice from a given underlying sequence. The start and stop indices should be absolute indices – i.e., they should not be negative (for indexing from the back of a list) or greater than the length of source.

iterate_from(start)[source]¶: Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

class nltk.collections.LazyZip[source]¶

Bases: LazyMap

A lazy sequence whose elements are tuples, each containing the i-th element from each of the argument sequences. The returned list is truncated in length to the length of the shortest argument sequence. The tuples are constructed lazily – i.e., when you read a value from the list, LazyZip will calculate that value by forming a tuple from the i-th element of each of the argument sequences.

LazyZip is essentially a lazy version of the Python primitive function zip. In particular, an evaluated LazyZip is equivalent to a zip:

>>> from nltk.collections import LazyZip
>>> sequence1, sequence2 = [1, 2, 3], ['a', 'b', 'c']
>>> zip(sequence1, sequence2) 
[(1, 'a'), (2, 'b'), (3, 'c')]
>>> list(LazyZip(sequence1, sequence2))
[(1, 'a'), (2, 'b'), (3, 'c')]
>>> sequences = [sequence1, sequence2, [6,7,8,9]]
>>> list(zip(*sequences)) == list(LazyZip(*sequences))
True

Lazy zips can be useful for conserving memory in cases where the argument sequences are particularly long.

A typical example of a use case for this class is combining long sequences of gold standard and predicted values in a classification or tagging task in order to calculate accuracy. By constructing tuples lazily and avoiding the creation of an additional long sequence, memory usage can be significantly reduced.

__init__(*lists)[source]¶

Parameters:: lists (list(list)) – the underlying lists

iterate_from(index)[source]¶: Return an iterator that generates the tokens in the corpus file underlying this corpus view, starting at the token number start. If start>=len(self), then this iterator will generate no tokens.

class nltk.collections.OrderedDict[source]¶

Bases: dict

__init__(data=None, **kwargs)[source]¶

clear() → None. Remove all items from D.[source]¶

copy() → a shallow copy of D[source]¶

items() → a set-like object providing a view on D's items[source]¶

keys() → a set-like object providing a view on D's keys[source]¶

popitem()[source]¶

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, failobj=None)[source]¶

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) → None. Update D from dict/iterable E and F.[source]¶: If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D's values[source]¶

class nltk.collections.Trie[source]¶

Bases: dict

A Trie implementation for strings

LEAF = True¶

__init__(strings=None)[source]¶

Builds a Trie object, which is built around a dict

If strings is provided, it will add the strings, which consist of a list of strings, to the Trie. Otherwise, it’ll construct an empty Trie.

Parameters:: strings (list(str)) – List of strings to insert into the trie (Default is None)

insert(string)[source]¶

Inserts string into the Trie

Parameters:: string (str) – String to insert into the trie
Example:

>>> from nltk.collections import Trie
>>> trie = Trie(["abc", "def"])
>>> expected = {'a': {'b': {'c': {True: None}}},                         'd': {'e': {'f': {True: None}}}}
>>> trie == expected
True