nltk.text.Text

class nltk.text.Text[source]

Bases: object

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

A Text is typically initialized from a given document or corpus. E.g.:

>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
__init__(tokens, name=None)[source]

Create a Text object.

Parameters

tokens (sequence of str) – The source text.

concordance(word, width=79, lines=25)[source]

Prints a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
  • word (str or list) – The target word or phrase (a list of strings)

  • width (int) – The width of each line, in characters (default=80)

  • lines (int) – The number of lines to display (default=25)

Seealso

ConcordanceIndex

concordance_list(word, width=79, lines=25)[source]

Generate a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
  • word (str or list) – The target word or phrase (a list of strings)

  • width (int) – The width of each line, in characters (default=80)

  • lines (int) – The number of lines to display (default=25)

Seealso

ConcordanceIndex

collocation_list(num=20, window_size=2)[source]

Return collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocation_list()[:2]
[('United', 'States'), ('fellow', 'citizens')]
Parameters
  • num (int) – The maximum number of collocations to return.

  • window_size (int) – The number of tokens spanned by a collocation (default=2)

Return type

list(tuple(str, str))

collocations(num=20, window_size=2)[source]

Print collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocations() 
United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
Parameters
  • num (int) – The maximum number of collocations to print.

  • window_size (int) – The number of tokens spanned by a collocation (default=2)

count(word)[source]

Count the number of times this word appears in the text.

index(word)[source]

Find the index of the first occurrence of the word in the text.

readability(method)[source]
similar(word, num=20)[source]

Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.

Parameters
  • word (str) – The word used to seed the similarity search

  • num (int) – The number of words to generate (default=20)

Seealso

ContextIndex.similar_words()

common_contexts(words, num=20)[source]

Find contexts where the specified words appear; list most frequent common contexts first.

Parameters
  • words (str) – The words used to seed the similarity search

  • num (int) – The number of words to generate (default=20)

Seealso

ContextIndex.common_contexts()

dispersion_plot(words)[source]

Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.

Parameters

words (list(str)) – The words to be plotted

Seealso

nltk.draw.dispersion_plot()

generate(length=100, text_seed=None, random_seed=42)[source]

Print random text, generated using a trigram language model. See also help(nltk.lm).

Parameters
  • length (int) – The length of text to generate (default=100)

  • text_seed (list(str)) – Generation can be conditioned on preceding context.

  • random_seed (int) – A random seed or an instance of random.Random. If provided, makes the random sampling part of generation reproducible. (default=42)

plot(*args)[source]

See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()

vocab()[source]
Seealso

nltk.prob.FreqDist

findall(regexp)[source]

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> from nltk.book import text1, text5, text9
>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters

regexp (str) – A regular expression