nltk.tokenize.texttiling module¶

class nltk.tokenize.texttiling.TextTilingTokenizer[source]¶

Bases: TokenizerI

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

Parameters:

w (int) – Pseudosentence size
k (int) – Size (in sentences) of the block used in the block comparison method
similarity_method (constant) – The method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.
stopwords (list(str)) – A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus)
smoothing_method (constant) – The method used for smoothing the score plot: DEFAULT_SMOOTHING (default)
smoothing_width (int) – The width of the window used by the smoothing method
smoothing_rounds (int) – The number of smoothing passes
cutoff_policy (constant) – The policy used to determine the number of boundaries: HC (default) or LC

>>> from nltk.corpus import brown
>>> tt = TextTilingTokenizer(demo_mode=True)
>>> text = brown.raw()[:4000]
>>> s, ss, d, b = tt.tokenize(text)
>>> b
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]

__init__(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)[source]¶

tokenize(text)[source]¶: Return a tokenized copy of text, where each “token” represents a separate topic.

class nltk.tokenize.texttiling.TokenSequence[source]¶

Bases: object

A token list with its original length and its index

__init__(index, wrdindex_list, original_length=None)[source]¶

class nltk.tokenize.texttiling.TokenTableField[source]¶

Bases: object

A field in the token table holding parameters for each token, used later in the process

__init__(first_pos, ts_occurences, total_count=1, par_count=1, last_par=0, last_tok_seq=None)[source]¶

nltk.tokenize.texttiling.demo(text=None)[source]¶

nltk.tokenize.texttiling.smooth(x, window_len=11, window='flat')[source]¶

smooth the data using a window with requested size.

This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal.

Parameters:

x – the input signal
window_len – the dimension of the smoothing window; should be an odd integer
window – the type of window from ‘flat’, ‘hanning’, ‘hamming’, ‘bartlett’, ‘blackman’ flat window will produce a moving average smoothing.

Returns:

the smoothed signal

example:

t=linspace(-2,2,0.1)
x=sin(t)+randn(len(t))*0.1
y=smooth(x)

See also:: numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter

TODO: the window parameter could be the window itself if an array instead of a string