nltk.tokenize.punkt module

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

(Note that whitespace from the original text, including newlines, is retained in the output.)

Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.

>>> text = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
)  "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

The algorithm for this tokenizer is described in:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.
class nltk.tokenize.punkt.PunktLanguageVars[source]

Bases: object

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

sent_end_chars = ('.', '?', '!')

Characters which are candidates for sentence boundaries

internal_punctuation = ',:;'

sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

re_boundary_realignment = re.compile('["\\\')\\]}]+?(?:\\s+|(?=--)|$)', re.MULTILINE)

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

word_tokenize(s)[source]

Tokenize a string to split off punctuation other than periods

period_context_re()[source]

Compiles and returns a regular expression to find contexts including possible sentence boundaries.

class nltk.tokenize.punkt.PunktParameters[source]

Bases: object

Stores data used to perform sentence boundary detection with Punkt.

__init__()[source]
abbrev_types

A set of word types for known abbreviations.

collocations

A set of word type tuples for known common collocations where the first word ends in a period. E.g., (‘S.’, ‘Bach’) is a common collocation in a text that discusses ‘Johann S. Bach’. These count as negative evidence for sentence boundaries.

sent_starters

A set of word types for words that often appear at the beginning of sentences.

ortho_context

A dictionary mapping word types to the set of orthographic contexts that word type appears in. Contexts are represented by adding orthographic context flags: …

clear_abbrevs()[source]
clear_collocations()[source]
clear_sent_starters()[source]
clear_ortho_context()[source]
add_ortho_context(typ, flag)[source]
class nltk.tokenize.punkt.PunktToken[source]

Bases: object

Stores a token of text with annotations produced during sentence boundary detection.

__init__(tok, **params)[source]
tok
type
period_final
property type_no_period

The type with its final period removed if it has one.

property type_no_sentperiod

The type with its final period removed if it is marked as a sentence break.

property first_upper

True if the token’s first character is uppercase.

property first_lower

True if the token’s first character is lowercase.

property first_case
property is_ellipsis

True if the token text is that of an ellipsis.

property is_number

True if the token text is that of a number.

property is_initial

True if the token text is that of an initial.

property is_alpha

True if the token text is all alphabetic.

property is_non_punct

True if the token is either a number or is alphabetic.

parastart
linestart
sentbreak
abbr
ellipsis
class nltk.tokenize.punkt.PunktBaseClass[source]

Bases: object

Includes common components of PunktTrainer and PunktSentenceTokenizer.

__init__(lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>, params=None)[source]
class nltk.tokenize.punkt.PunktTrainer[source]

Bases: nltk.tokenize.punkt.PunktBaseClass

Learns parameters used in Punkt sentence boundary detection.

__init__(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]
get_params()[source]

Calculates and returns parameters for sentence boundary detection as derived from training.

ABBREV = 0.3

cut-off value whether a ‘token’ is an abbreviation

IGNORE_ABBREV_PENALTY = False

allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.

ABBREV_BACKOFF = 5

upper cut-off for Mikheev’s(2002) abbreviation detection algorithm

COLLOCATION = 7.88

minimal log-likelihood value that two tokens need to be considered as a collocation

SENT_STARTER = 30

minimal log-likelihood value that a token requires to be considered as a frequent sentence starter

INCLUDE_ALL_COLLOCS = False

this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.

INCLUDE_ABBREV_COLLOCS = False

this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.

MIN_COLLOC_FREQ = 1

this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.

train(text, verbose=False, finalize=True)[source]

Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.

train_tokens(tokens, verbose=False, finalize=True)[source]

Collects training data from a given list of tokens.

finalize_training(verbose=False)[source]

Uses data that has been gathered in training to determine likely collocations and sentence starters.

freq_threshold(ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)[source]

Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.

find_abbrev_types()[source]

Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as “rare”.

class nltk.tokenize.punkt.PunktSentenceTokenizer[source]

Bases: nltk.tokenize.punkt.PunktBaseClass, nltk.tokenize.api.TokenizerI

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

__init__(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

train(train_text, verbose=False)[source]

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

tokenize(text, realign_boundaries=True)[source]

Given a text, returns a list of the sentences in that text.

debug_decisions(text)[source]

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

span_tokenize(text, realign_boundaries=True)[source]

Given a text, generates (start, end) spans of sentences in the text.

sentences_from_text(text, realign_boundaries=True)[source]

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

text_contains_sentbreak(text)[source]

Returns True if the given text includes a sentence break.

sentences_from_text_legacy(text)[source]

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

sentences_from_tokens(tokens)[source]

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

dump(tokens)[source]
PUNCTUATION = (';', ':', ',', '.', '!', '?')
nltk.tokenize.punkt.format_debug_decision(d)[source]
nltk.tokenize.punkt.demo(text, tok_cls=<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>, train_cls=<class 'nltk.tokenize.punkt.PunktTrainer'>)[source]

Builds a punkt model and applies it to the same text