nltk.text module

This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces. Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity.

class nltk.text.ContextIndex[source]

Bases: object

A bidirectional index between words and their ‘contexts’ in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.

__init__(tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda>>)[source]
tokens()[source]
Return type

list(str)

Returns

The document that this context index was created from.

word_similarity_dict(word)[source]

Return a dictionary mapping from words to ‘similarity scores,’ indicating how often these two words occur in the same context.

similar_words(word, n=20)[source]
common_contexts(words, fail_on_unknown=False)[source]

Find contexts where the specified words can all appear; and return a frequency distribution mapping each context to the number of times that context was used.

Parameters
  • words (str) – The words used to seed the similarity search

  • fail_on_unknown – If true, then raise a value error if any of the given words do not occur at all in the index.

class nltk.text.ConcordanceIndex[source]

Bases: object

An index that can be used to look up the offset locations at which a given word occurs in a document.

__init__(tokens, key=<function ConcordanceIndex.<lambda>>)[source]

Construct a new concordance index.

Parameters
  • tokens – The document (list of tokens) that this concordance index was created from. This list can be used to access the context of a given word occurrence.

  • key – A function that maps each token to a normalized version that will be used as a key in the index. E.g., if you use key=lambda s:s.lower(), then the index will be case-insensitive.

tokens()[source]
Return type

list(str)

Returns

The document that this concordance index was created from.

offsets(word)[source]
Return type

list(int)

Returns

A list of the offset positions at which the given word occurs. If a key function was specified for the index, then given word’s key will be looked up.

find_concordance(word, width=80)[source]

Find all concordance lines given the query word.

Provided with a list of words, these will be found as a phrase.

print_concordance(word, width=80, lines=25)[source]

Print concordance lines given the query word. :param word: The target word or phrase (a list of strings) :type word: str or list :param lines: The number of lines to display (default=25) :type lines: int :param width: The width of each line, in characters (default=80) :type width: int :param save: The option to save the concordance. :type save: bool

class nltk.text.TokenSearcher[source]

Bases: object

A class that makes it easier to use regular expressions to search over tokenized strings. The tokenized string is converted to a string where tokens are marked with angle brackets – e.g., '<the><window><is><still><open>'. The regular expression passed to the findall() method is modified to treat angle brackets as non-capturing parentheses, in addition to matching the token boundaries; and to have '.' not match the angle brackets.

__init__(tokens)[source]
findall(regexp)[source]

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> from nltk.text import TokenSearcher
>>> print('hack'); from nltk.book import text1, text5, text9
hack...
>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters

regexp (str) – A regular expression

class nltk.text.Text[source]

Bases: object

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

A Text is typically initialized from a given document or corpus. E.g.:

>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
__init__(tokens, name=None)[source]

Create a Text object.

Parameters

tokens (sequence of str) – The source text.

concordance(word, width=79, lines=25)[source]

Prints a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
  • word (str or list) – The target word or phrase (a list of strings)

  • width (int) – The width of each line, in characters (default=80)

  • lines (int) – The number of lines to display (default=25)

Seealso

ConcordanceIndex

concordance_list(word, width=79, lines=25)[source]

Generate a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
  • word (str or list) – The target word or phrase (a list of strings)

  • width (int) – The width of each line, in characters (default=80)

  • lines (int) – The number of lines to display (default=25)

Seealso

ConcordanceIndex

collocation_list(num=20, window_size=2)[source]

Return collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocation_list()[:2]
[('United', 'States'), ('fellow', 'citizens')]
Parameters
  • num (int) – The maximum number of collocations to return.

  • window_size (int) – The number of tokens spanned by a collocation (default=2)

Return type

list(tuple(str, str))

collocations(num=20, window_size=2)[source]

Print collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocations() 
United States; fellow citizens; four years; ...
Parameters
  • num (int) – The maximum number of collocations to print.

  • window_size (int) – The number of tokens spanned by a collocation (default=2)

count(word)[source]

Count the number of times this word appears in the text.

index(word)[source]

Find the index of the first occurrence of the word in the text.

readability(method)[source]
similar(word, num=20)[source]

Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.

Parameters
  • word (str) – The word used to seed the similarity search

  • num (int) – The number of words to generate (default=20)

Seealso

ContextIndex.similar_words()

common_contexts(words, num=20)[source]

Find contexts where the specified words appear; list most frequent common contexts first.

Parameters
  • words (str) – The words used to seed the similarity search

  • num (int) – The number of words to generate (default=20)

Seealso

ContextIndex.common_contexts()

dispersion_plot(words)[source]

Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.

Parameters

words (list(str)) – The words to be plotted

Seealso

nltk.draw.dispersion_plot()

generate(length=100, text_seed=None, random_seed=42)[source]

Print random text, generated using a trigram language model. See also help(nltk.lm).

Parameters
  • length (int) – The length of text to generate (default=100)

  • text_seed (list(str)) – Generation can be conditioned on preceding context.

  • random_seed – A random seed or an instance of random.Random. If provided,

makes the random sampling part of generation reproducible. (default=42) :type random_seed: int

plot(*args)[source]

See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()

vocab()[source]
Seealso

nltk.prob.FreqDist

findall(regexp)[source]

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> print('hack'); from nltk.book import text1, text5, text9
hack...
>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters

regexp (str) – A regular expression

class nltk.text.TextCollection[source]

Bases: nltk.text.Text

A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows:

>>> import nltk.corpus
>>> from nltk.text import TextCollection
>>> print('hack'); from nltk.book import text1, text2, text3
hack...
>>> gutenberg = TextCollection(nltk.corpus.gutenberg)
>>> mytexts = TextCollection([text1, text2, text3])

Iterating over a TextCollection produces all the tokens of all the texts in order.

__init__(source)[source]

Create a Text object.

Parameters

tokens (sequence of str) – The source text.

tf(term, text)[source]

The frequency of the term in text.

idf(term)[source]

The number of texts in the corpus divided by the number of texts that the term appears in. If a term does not appear in the corpus, 0.0 is returned.

tf_idf(term, text)[source]