nltk.util module

class nltk.util.Index[source]

Bases: defaultdict

__init__(pairs)[source]
nltk.util.acyclic_branches_depth_first(tree, children=<built-in function iter>, depth=-1, cut_mark=None, traversed=None, verbose=False)[source]
Parameters:
  • tree – the tree root

  • children – a function taking as argument a tree node

  • depth – the maximum depth of the search

  • cut_mark – the mark to add when cycles are truncated

  • traversed – the set of traversed nodes

  • verbose – to print warnings when cycles are discarded

Returns:

the tree in depth-first order

Adapted from acyclic_depth_first() above, to

traverse the nodes of a tree in depth-first order, discarding eventual cycles within the same branch, but keep duplicate paths in different branches. Add cut_mark (when defined) if cycles were truncated.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

Catches only only cycles within the same branch, but keeping cycles from different branches:

>>> import nltk
>>> from nltk.util import acyclic_branches_depth_first as tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(tree(wn.synset('certified.a.01'), lambda s:sorted(s.also_sees()), cut_mark='...', depth=4))
[Synset('certified.a.01'),
 [Synset('authorized.a.01'),
  [Synset('lawful.a.01'),
   [Synset('legal.a.01'),
    "Cycle(Synset('lawful.a.01'),0,...)",
    [Synset('legitimate.a.01'), '...']],
   [Synset('straight.a.06'),
    [Synset('honest.a.01'), '...'],
    "Cycle(Synset('lawful.a.01'),0,...)"]],
  [Synset('legitimate.a.01'),
   "Cycle(Synset('authorized.a.01'),1,...)",
   [Synset('legal.a.01'),
    [Synset('lawful.a.01'), '...'],
    "Cycle(Synset('legitimate.a.01'),0,...)"],
   [Synset('valid.a.01'),
    "Cycle(Synset('legitimate.a.01'),0,...)",
    [Synset('reasonable.a.01'), '...']]],
  [Synset('official.a.01'), "Cycle(Synset('authorized.a.01'),1,...)"]],
 [Synset('documented.a.01')]]
nltk.util.acyclic_breadth_first(tree, children=<built-in function iter>, maxdepth=-1, verbose=False)[source]
Parameters:
  • tree – the tree root

  • children – a function taking as argument a tree node

  • maxdepth – to limit the search depth

  • verbose – to print warnings when cycles are discarded

Returns:

the tree in breadth-first order

Adapted from breadth_first() above, to discard cycles.

Traverse the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

nltk.util.acyclic_depth_first(tree, children=<built-in function iter>, depth=-1, cut_mark=None, traversed=None, verbose=False)[source]
Parameters:
  • tree – the tree root

  • children – a function taking as argument a tree node

  • depth – the maximum depth of the search

  • cut_mark – the mark to add when cycles are truncated

  • traversed – the set of traversed nodes

  • verbose – to print warnings when cycles are discarded

Returns:

the tree in depth-first order

Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated. The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

Catches all cycles:

>>> import nltk
>>> from nltk.util import acyclic_depth_first as acyclic_tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(acyclic_tree(wn.synset('dog.n.01'), lambda s:sorted(s.hypernyms()),cut_mark='...'))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'), "Cycle(Synset('animal.n.01'),-3,...)"]]
nltk.util.acyclic_dic2tree(node, dic)[source]
Parameters:
  • node – the root node

  • dic – the dictionary of children

Convert acyclic dictionary ‘dic’, where the keys are nodes, and the values are lists of children, to output tree suitable for pprint(), starting at root ‘node’, with subtrees as nested lists.

nltk.util.bigrams(sequence, **kwargs)[source]

Return the bigrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import bigrams
>>> list(bigrams([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]

Use bigrams for a list version of this function.

Parameters:

sequence (sequence or iter) – the source data to be converted into bigrams

Return type:

iter(tuple)

nltk.util.binary_search_file(file, key, cache=None, cacheDepth=-1)[source]

Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.

Parameters:
  • file (file) – the file to be searched through.

  • key (str) – the identifier we are searching for.

nltk.util.breadth_first(tree, children=<built-in function iter>, maxdepth=-1)[source]

Traverse the nodes of a tree in breadth-first order. (No check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

nltk.util.choose(n, k)[source]

This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient).

This is the scipy.special.comb() with long integer computation but this approximation is faster, see https://github.com/nltk/nltk/issues/1181

>>> choose(4, 2)
6
>>> choose(6, 2)
15
Parameters:
  • n (int) – The number of things.

  • r (int) – The number of times a thing is taken.

nltk.util.clean_html(html)[source]
nltk.util.clean_url(url)[source]
nltk.util.cut_string(s, width=70)[source]

Cut off and return a given width of a string

Return the same as s[:width] if width >= 0 or s[-width:] if width < 0, as long as s has no unicode combining characters. If it has combining characters make sure the returned string’s visible width matches the called-for width.

Parameters:
  • s (str) – the string to cut

  • width (int) – the display_width

nltk.util.edge_closure(tree, children=<built-in function iter>, maxdepth=-1, verbose=False)[source]
Parameters:
  • tree – the tree root

  • children – a function taking as argument a tree node

  • maxdepth – to limit the search depth

  • verbose – to print warnings when cycles are discarded

Yield the edges of a graph in breadth-first order, discarding eventual cycles. The first argument should be the start node; children should be a function taking as argument a graph node and returning an iterator of the node’s children.

>>> from nltk.util import edge_closure
>>> print(list(edge_closure('A', lambda node:{'A':['B','C'], 'B':'C', 'C':'B'}[node])))
[('A', 'B'), ('A', 'C'), ('B', 'C'), ('C', 'B')]
nltk.util.edges2dot(edges, shapes=None, attr=None)[source]
Parameters:
  • edges – the set (or list) of edges of a directed graph.

  • shapes – dictionary of strings that trigger a specified shape.

  • attr – dictionary with global graph attributes

Returns:

a representation of ‘edges’ as a string in the DOT graph language.

Returns dot_string: a representation of ‘edges’ as a string in the DOT graph language, which can be converted to an image by the ‘dot’ program from the Graphviz package, or nltk.parse.dependencygraph.dot2img(dot_string).

>>> import nltk
>>> from nltk.util import edges2dot
>>> print(edges2dot([('A', 'B'), ('A', 'C'), ('B', 'C'), ('C', 'B')]))
digraph G {
"A" -> "B";
"A" -> "C";
"B" -> "C";
"C" -> "B";
}
nltk.util.elementtree_indent(elem, level=0)[source]

Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.

Parameters:
  • elem (ElementTree._ElementInterface) – element to be indented. will be modified.

  • level (nonnegative integer) – level of indentation for this element

Return type:

ElementTree._ElementInterface

Returns:

Contents of elem indented to reflect its structure

nltk.util.everygrams(sequence, min_len=1, max_len=-1, pad_left=False, pad_right=False, **kwargs)[source]

Returns all possible ngrams generated from a sequence of items, as an iterator.

>>> sent = 'a b c'.split()
New version outputs for everygrams.
>>> list(everygrams(sent))
[('a',), ('a', 'b'), ('a', 'b', 'c'), ('b',), ('b', 'c'), ('c',)]
Old version outputs for everygrams.
>>> sorted(everygrams(sent), key=len)
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
>>> list(everygrams(sent, max_len=2))
[('a',), ('a', 'b'), ('b',), ('b', 'c'), ('c',)]
Parameters:
  • sequence (sequence or iter) – the source data to be converted into ngrams. If max_len is not provided, this sequence will be loaded into memory

  • min_len (int) – minimum length of the ngrams, aka. n-gram order/degree of ngram

  • max_len (int) – maximum length of the ngrams (set to length of sequence by default)

  • pad_left (bool) – whether the ngrams should be left-padded

  • pad_right (bool) – whether the ngrams should be right-padded

Return type:

iter(tuple)

nltk.util.filestring(f)[source]
nltk.util.flatten(*args)[source]

Flatten a list.

>>> from nltk.util import flatten
>>> flatten(1, 2, ['b', 'a' , ['c', 'd']], 3)
[1, 2, 'b', 'a', 'c', 'd', 3]
Parameters:

args – items and lists to be combined into a single list

Return type:

list

nltk.util.guess_encoding(data)[source]

Given a byte string, attempt to decode it. Tries the standard ‘UTF8’ and ‘latin-1’ encodings, Plus several gathered from locale information.

The calling program must first call:

locale.setlocale(locale.LC_ALL, '')

If successful it returns (decoded_unicode, successful_encoding). If unsuccessful it raises a UnicodeError.

nltk.util.in_idle()[source]

Return True if this function is run within idle. Tkinter programs that are run in idle should never call Tk.mainloop; so this function should be used to gate all calls to Tk.mainloop.

Warning:

This function works by checking sys.stdin. If the user has modified sys.stdin, then it may return incorrect results.

Return type:

bool

nltk.util.invert_dict(d)[source]
nltk.util.invert_graph(graph)[source]

Inverts a directed graph.

Parameters:

graph (dict(set)) – the graph, represented as a dictionary of sets

Returns:

the inverted graph

Return type:

dict(set)

nltk.util.ngrams(sequence, n, **kwargs)[source]

Return the ngrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Wrap with list for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:

>>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
Parameters:
  • sequence (sequence or iter) – the source data to be converted into ngrams

  • n (int) – the degree of the ngrams

  • pad_left (bool) – whether the ngrams should be left-padded

  • pad_right (bool) – whether the ngrams should be right-padded

  • left_pad_symbol (any) – the symbol to use for left padding (default is None)

  • right_pad_symbol (any) – the symbol to use for right padding (default is None)

Return type:

sequence or iter

nltk.util.pad_sequence(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]

Returns a padded sequence of items before ngram extraction.

>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
['<s>', 1, 2, 3, 4, 5, '</s>']
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
['<s>', 1, 2, 3, 4, 5]
>>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[1, 2, 3, 4, 5, '</s>']
Parameters:
  • sequence (sequence or iter) – the source data to be padded

  • n (int) – the degree of the ngrams

  • pad_left (bool) – whether the ngrams should be left-padded

  • pad_right (bool) – whether the ngrams should be right-padded

  • left_pad_symbol (any) – the symbol to use for left padding (default is None)

  • right_pad_symbol (any) – the symbol to use for right padding (default is None)

Return type:

sequence or iter

nltk.util.pairwise(iterable)[source]

s -> (s0,s1), (s1,s2), (s2, s3), …

nltk.util.parallelize_preprocess(func, iterator, processes, progress_bar=False)[source]
nltk.util.pr(data, start=0, end=None)[source]

Pretty print a sequence of data items

Parameters:
  • data (sequence or iter) – the data stream to print

  • start (int) – the start position

  • end (int) – the end position

nltk.util.print_string(s, width=70)[source]

Pretty print a string, breaking lines on whitespace

Parameters:
  • s (str) – the string to print, consisting of words and spaces

  • width (int) – the display width

nltk.util.re_show(regexp, string, left='{', right='}')[source]

Return a string with markers surrounding the matched substrings. Search str for substrings matching regexp and wrap the matches with braces. This is convenient for learning about regular expressions.

Parameters:
  • regexp (str) – The regular expression.

  • string (str) – The string being matched.

  • left (str) – The left delimiter (printed before the matched substring)

  • right (str) – The right delimiter (printed after the matched substring)

Return type:

str

nltk.util.set_proxy(proxy, user=None, password='')[source]

Set the HTTP proxy for Python to download through.

If proxy is None then tries to set proxy from environment or system settings.

Parameters:
  • proxy – The HTTP proxy server to use. For example: ‘http://proxy.example.com:3128/

  • user – The username to authenticate with. Use None to disable authentication.

  • password – The password to authenticate with.

nltk.util.skipgrams(sequence, n, k, **kwargs)[source]

Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Parameters:
  • sequence (sequence or iter) – the source data to be converted into trigrams

  • n (int) – the degree of the ngrams

  • k (int) – the skip distance

Return type:

iter(tuple)

nltk.util.tokenwrap(tokens, separator=' ', width=70)[source]

Pretty print a list of text tokens, breaking lines on whitespace

Parameters:
  • tokens (list) – the tokens to print

  • separator (str) – the string to use to separate tokens

  • width (int) – the display width (default=70)

nltk.util.transitive_closure(graph, reflexive=False)[source]

Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.

The algorithm is a slight modification of the “Marking Algorithm” of Ioannidis & Ramakrishnan (1998) “Efficient Transitive Closure Algorithms”.

Parameters:
  • graph (dict(set)) – the initial graph, represented as a dictionary of sets

  • reflexive (bool) – if set, also make the closure reflexive

Return type:

dict(set)

nltk.util.trigrams(sequence, **kwargs)[source]

Return the trigrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import trigrams
>>> list(trigrams([1,2,3,4,5]))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use trigrams for a list version of this function.

Parameters:

sequence (sequence or iter) – the source data to be converted into trigrams

Return type:

iter(tuple)

nltk.util.unique_list(xs)[source]
nltk.util.unweighted_minimum_spanning_dict(tree, children=<built-in function iter>)[source]
Parameters:
  • tree – the tree root

  • children

    a function taking as argument a tree node

    Output a dictionary representing a Minimum Spanning Tree (MST)

of an unweighted graph, by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> from nltk.util import unweighted_minimum_spanning_dict as umsd
>>> from pprint import pprint
>>> pprint(umsd(wn.synset('bound.a.01'), lambda s:sorted(s.also_sees())))
{Synset('bound.a.01'): [Synset('unfree.a.02')],
 Synset('classified.a.02'): [],
 Synset('confined.a.02'): [],
 Synset('dependent.a.01'): [],
 Synset('restricted.a.01'): [Synset('classified.a.02')],
 Synset('unfree.a.02'): [Synset('confined.a.02'),
                         Synset('dependent.a.01'),
                         Synset('restricted.a.01')]}
nltk.util.unweighted_minimum_spanning_digraph(tree, children=<built-in function iter>, shapes=None, attr=None)[source]
Parameters:
  • tree – the tree root

  • children – a function taking as argument a tree node

  • shapes – dictionary of strings that trigger a specified shape.

  • attr

    dictionary with global graph attributes

    Build a Minimum Spanning Tree (MST) of an unweighted graph,

by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.

Return a representation of this MST as a string in the DOT graph language, which can be converted to an image by the ‘dot’ program from the Graphviz package, or nltk.parse.dependencygraph.dot2img(dot_string).

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

>>> import nltk
>>> wn=nltk.corpus.wordnet
>>> from nltk.util import unweighted_minimum_spanning_digraph as umsd
>>> print(umsd(wn.synset('bound.a.01'), lambda s:sorted(s.also_sees())))
digraph G {
"Synset('bound.a.01')" -> "Synset('unfree.a.02')";
"Synset('unfree.a.02')" -> "Synset('confined.a.02')";
"Synset('unfree.a.02')" -> "Synset('dependent.a.01')";
"Synset('unfree.a.02')" -> "Synset('restricted.a.01')";
"Synset('restricted.a.01')" -> "Synset('classified.a.02')";
}
nltk.util.unweighted_minimum_spanning_tree(tree, children=<built-in function iter>)[source]
Parameters:
  • tree – the tree root

  • children

    a function taking as argument a tree node

    Output a Minimum Spanning Tree (MST) of an unweighted graph,

by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

>>> import nltk
>>> from nltk.util import unweighted_minimum_spanning_tree as mst
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(mst(wn.synset('bound.a.01'), lambda s:sorted(s.also_sees())))
[Synset('bound.a.01'),
 [Synset('unfree.a.02'),
  [Synset('confined.a.02')],
  [Synset('dependent.a.01')],
  [Synset('restricted.a.01'), [Synset('classified.a.02')]]]]