nltk.tokenize.PunktSentenceTokenizer¶

class nltk.tokenize.PunktSentenceTokenizer[source]¶

Bases: PunktBaseClass, TokenizerI

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield: List[Tuple[int, int]]
Parameters: strings (List[str]) –
Return type: Iterator[List[Tuple[int, int]]]

tokenize_sents(strings: List[str]) → List[List[str]]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type: List[List[str]]
Parameters: strings (List[str]) –

__init__(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶: train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

train(train_text, verbose=False)[source]¶: Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

tokenize(text: str, realign_boundaries: bool = True) → List[str][source]¶

Given a text, returns a list of the sentences in that text.

Parameters

text (str) –
realign_boundaries (bool) –

Return type

List[str]

debug_decisions(text: str) → Iterator[Dict[str, Any]][source]¶

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

Parameters: text (str) –
Return type: Iterator[Dict[str, Any]]

span_tokenize(text: str, realign_boundaries: bool = True) → Iterator[Tuple[int, int]][source]¶

Given a text, generates (start, end) spans of sentences in the text.

Parameters

text (str) –
realign_boundaries (bool) –

Return type

Iterator[Tuple[int, int]]

sentences_from_text(text: str, realign_boundaries: bool = True) → List[str][source]¶

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

Parameters

text (str) –
realign_boundaries (bool) –

Return type

List[str]

text_contains_sentbreak(text: str) → bool[source]¶

Returns True if the given text includes a sentence break.

Parameters: text (str) –
Return type: bool

sentences_from_text_legacy(text: str) → Iterator[str][source]¶

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

Parameters: text (str) –
Return type: Iterator[str]

sentences_from_tokens(tokens: Iterator[PunktToken]) → Iterator[PunktToken][source]¶

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

Parameters: tokens (Iterator[PunktToken]) –
Return type: Iterator[PunktToken]

dump(tokens: Iterator[PunktToken]) → None[source]¶

Parameters: tokens (Iterator[PunktToken]) –
Return type: None

PUNCTUATION = (';', ':', ',', '.', '!', '?')¶

NLTK

Documentation

nltk.tokenize.PunktSentenceTokenizer¶