nltk.text.TextCollection

class nltk.text.TextCollection[source]

Bases: Text

A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows:

>>> import nltk.corpus
>>> from nltk.text import TextCollection
>>> from nltk.book import text1, text2, text3
>>> gutenberg = TextCollection(nltk.corpus.gutenberg)
>>> mytexts = TextCollection([text1, text2, text3])

Iterating over a TextCollection produces all the tokens of all the texts in order.

__init__(source)[source]

Create a Text object.

Parameters

tokens (sequence of str) – The source text.

collocation_list(num=20, window_size=2)

Return collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocation_list()[:2]
[('United', 'States'), ('fellow', 'citizens')]
Parameters
  • num (int) – The maximum number of collocations to return.

  • window_size (int) – The number of tokens spanned by a collocation (default=2)

Return type

list(tuple(str, str))

collocations(num=20, window_size=2)

Print collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocations() 
United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
Parameters
  • num (int) – The maximum number of collocations to print.

  • window_size (int) – The number of tokens spanned by a collocation (default=2)

common_contexts(words, num=20)

Find contexts where the specified words appear; list most frequent common contexts first.

Parameters
  • words (str) – The words used to seed the similarity search

  • num (int) – The number of words to generate (default=20)

Seealso

ContextIndex.common_contexts()

concordance(word, width=79, lines=25)

Prints a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
  • word (str or list) – The target word or phrase (a list of strings)

  • width (int) – The width of each line, in characters (default=80)

  • lines (int) – The number of lines to display (default=25)

Seealso

ConcordanceIndex

concordance_list(word, width=79, lines=25)

Generate a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
  • word (str or list) – The target word or phrase (a list of strings)

  • width (int) – The width of each line, in characters (default=80)

  • lines (int) – The number of lines to display (default=25)

Seealso

ConcordanceIndex

count(word)

Count the number of times this word appears in the text.

dispersion_plot(words)

Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.

Parameters

words (list(str)) – The words to be plotted

Seealso

nltk.draw.dispersion_plot()

findall(regexp)

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> from nltk.book import text1, text5, text9
>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters

regexp (str) – A regular expression

generate(length=100, text_seed=None, random_seed=42)

Print random text, generated using a trigram language model. See also help(nltk.lm).

Parameters
  • length (int) – The length of text to generate (default=100)

  • text_seed (list(str)) – Generation can be conditioned on preceding context.

  • random_seed (int) – A random seed or an instance of random.Random. If provided, makes the random sampling part of generation reproducible. (default=42)

index(word)

Find the index of the first occurrence of the word in the text.

plot(*args)

See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()

readability(method)
similar(word, num=20)

Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.

Parameters
  • word (str) – The word used to seed the similarity search

  • num (int) – The number of words to generate (default=20)

Seealso

ContextIndex.similar_words()

vocab()
Seealso

nltk.prob.FreqDist

tf(term, text)[source]

The frequency of the term in text.

idf(term)[source]

The number of texts in the corpus divided by the number of texts that the term appears in. If a term does not appear in the corpus, 0.0 is returned.

tf_idf(term, text)[source]