nltk.tokenize.punkt module

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

>>> from nltk.tokenize import PunktTokenizer
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = PunktTokenizer()
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

(Note that whitespace from the original text, including newlines, is retained in the output.)

Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.

>>> text = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
)  "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

The algorithm for this tokenizer is described in:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.
class nltk.tokenize.punkt.PunktBaseClass[source]

Bases: object

Includes common components of PunktTrainer and PunktSentenceTokenizer.

__init__(lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>, params=None)[source]
class nltk.tokenize.punkt.PunktLanguageVars[source]

Bases: object

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

internal_punctuation = ',:;'

sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

period_context_re()[source]

Compiles and returns a regular expression to find contexts including possible sentence boundaries.

re_boundary_realignment = re.compile('["\\\')\\]}]+?(?:\\s+|(?=--)|$)', re.MULTILINE)

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

sent_end_chars = ('.', '?', '!')

Characters which are candidates for sentence boundaries

word_tokenize(s)[source]

Tokenize a string to split off punctuation other than periods

class nltk.tokenize.punkt.PunktParameters[source]

Bases: object

Stores data used to perform sentence boundary detection with Punkt.

__init__()[source]
abbrev_types

A set of word types for known abbreviations.

add_ortho_context(typ, flag)[source]
clear_abbrevs()[source]
clear_collocations()[source]
clear_ortho_context()[source]
clear_sent_starters()[source]
collocations

A set of word type tuples for known common collocations where the first word ends in a period. E.g., (‘S.’, ‘Bach’) is a common collocation in a text that discusses ‘Johann S. Bach’. These count as negative evidence for sentence boundaries.

ortho_context

A dictionary mapping word types to the set of orthographic contexts that word type appears in. Contexts are represented by adding orthographic context flags: …

sent_starters

A set of word types for words that often appear at the beginning of sentences.

class nltk.tokenize.punkt.PunktSentenceTokenizer[source]

Bases: PunktBaseClass, TokenizerI

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

PUNCTUATION = (';', ':', ',', '.', '!', '?')
__init__(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

debug_decisions(text: str) Iterator[Dict[str, Any]][source]

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

Parameters:

text (str)

Return type:

Iterator[Dict[str, Any]]

dump(tokens: Iterator[PunktToken]) None[source]
Parameters:

tokens (Iterator[PunktToken])

Return type:

None

sentences_from_text(text: str, realign_boundaries: bool = True) List[str][source]

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

Parameters:
  • text (str)

  • realign_boundaries (bool)

Return type:

List[str]

sentences_from_text_legacy(text: str) Iterator[str][source]

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

Parameters:

text (str)

Return type:

Iterator[str]

sentences_from_tokens(tokens: Iterator[PunktToken]) Iterator[PunktToken][source]

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

Parameters:

tokens (Iterator[PunktToken])

Return type:

Iterator[PunktToken]

span_tokenize(text: str, realign_boundaries: bool = True) Iterator[Tuple[int, int]][source]

Given a text, generates (start, end) spans of sentences in the text.

Parameters:
  • text (str)

  • realign_boundaries (bool)

Return type:

Iterator[Tuple[int, int]]

text_contains_sentbreak(text: str) bool[source]

Returns True if the given text includes a sentence break.

Parameters:

text (str)

Return type:

bool

tokenize(text: str, realign_boundaries: bool = True) List[str][source]

Given a text, returns a list of the sentences in that text.

Parameters:
  • text (str)

  • realign_boundaries (bool)

Return type:

List[str]

train(train_text, verbose=False)[source]

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

class nltk.tokenize.punkt.PunktToken[source]

Bases: object

Stores a token of text with annotations produced during sentence boundary detection.

__init__(tok, **params)[source]
abbr
ellipsis
property first_case
property first_lower

True if the token’s first character is lowercase.

property first_upper

True if the token’s first character is uppercase.

property is_alpha

True if the token text is all alphabetic.

property is_ellipsis

True if the token text is that of an ellipsis.

property is_initial

True if the token text is that of an initial.

property is_non_punct

True if the token is either a number or is alphabetic.

property is_number

True if the token text is that of a number.

linestart
parastart
period_final
sentbreak
tok
type
property type_no_period

The type with its final period removed if it has one.

property type_no_sentperiod

The type with its final period removed if it is marked as a sentence break.

class nltk.tokenize.punkt.PunktTokenizer[source]

Bases: PunktSentenceTokenizer

Punkt Sentence Tokenizer that loads/saves its parameters from/to data files

__init__(lang='english')[source]

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

load_lang(lang='english')[source]
save_params()[source]
class nltk.tokenize.punkt.PunktTrainer[source]

Bases: PunktBaseClass

Learns parameters used in Punkt sentence boundary detection.

ABBREV = 0.3

cut-off value whether a ‘token’ is an abbreviation

ABBREV_BACKOFF = 5

upper cut-off for Mikheev’s(2002) abbreviation detection algorithm

COLLOCATION = 7.88

minimal log-likelihood value that two tokens need to be considered as a collocation

IGNORE_ABBREV_PENALTY = False

allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.

INCLUDE_ABBREV_COLLOCS = False

this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.

INCLUDE_ALL_COLLOCS = False

this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.

MIN_COLLOC_FREQ = 1

this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.

SENT_STARTER = 30

minimal log-likelihood value that a token requires to be considered as a frequent sentence starter

__init__(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]
finalize_training(verbose=False)[source]

Uses data that has been gathered in training to determine likely collocations and sentence starters.

find_abbrev_types()[source]

Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as “rare”.

freq_threshold(ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)[source]

Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.

get_params()[source]

Calculates and returns parameters for sentence boundary detection as derived from training.

train(text, verbose=False, finalize=True)[source]

Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.

train_tokens(tokens, verbose=False, finalize=True)[source]

Collects training data from a given list of tokens.

nltk.tokenize.punkt.demo(text, tok_cls=<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>, train_cls=<class 'nltk.tokenize.punkt.PunktTrainer'>)[source]

Builds a punkt model and applies it to the same text

nltk.tokenize.punkt.format_debug_decision(d)[source]
nltk.tokenize.punkt.load_punkt_params(lang_dir)[source]
nltk.tokenize.punkt.save_punkt_params(params, dir='/tmp/punkt_tab')[source]