nltk.tokenize.PunktSentenceTokenizer¶
- class nltk.tokenize.PunktSentenceTokenizer[source]¶
Bases:
PunktBaseClass
,TokenizerI
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.
- span_tokenize_sents(strings: List[str]) Iterator[List[Tuple[int, int]]] ¶
Apply
self.span_tokenize()
to each element ofstrings
. I.e.:return [self.span_tokenize(s) for s in strings]
- Yield
List[Tuple[int, int]]
- Parameters
strings (List[str]) –
- Return type
Iterator[List[Tuple[int, int]]]
- tokenize_sents(strings: List[str]) List[List[str]] ¶
Apply
self.tokenize()
to each element ofstrings
. I.e.:return [self.tokenize(s) for s in strings]
- __init__(train_text=None, verbose=False, lang_vars=None, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]¶
train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
- train(train_text, verbose=False)[source]¶
Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.
- tokenize(text: str, realign_boundaries: bool = True) List[str] [source]¶
Given a text, returns a list of the sentences in that text.
- Parameters
text (str) –
realign_boundaries (bool) –
- Return type
List[str]
- debug_decisions(text: str) Iterator[Dict[str, Any]] [source]¶
Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
See format_debug_decision() to help make this output readable.
- Parameters
text (str) –
- Return type
Iterator[Dict[str, Any]]
- span_tokenize(text: str, realign_boundaries: bool = True) Iterator[Tuple[int, int]] [source]¶
Given a text, generates (start, end) spans of sentences in the text.
- Parameters
text (str) –
realign_boundaries (bool) –
- Return type
Iterator[Tuple[int, int]]
- sentences_from_text(text: str, realign_boundaries: bool = True) List[str] [source]¶
Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.
- Parameters
text (str) –
realign_boundaries (bool) –
- Return type
List[str]
- text_contains_sentbreak(text: str) bool [source]¶
Returns True if the given text includes a sentence break.
- Parameters
text (str) –
- Return type
bool
- sentences_from_text_legacy(text: str) Iterator[str] [source]¶
Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as
sentences_from_text
.- Parameters
text (str) –
- Return type
Iterator[str]
- sentences_from_tokens(tokens: Iterator[PunktToken]) Iterator[PunktToken] [source]¶
Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
- Parameters
tokens (Iterator[PunktToken]) –
- Return type
Iterator[PunktToken]
- dump(tokens: Iterator[PunktToken]) None [source]¶
- Parameters
tokens (Iterator[PunktToken]) –
- Return type
None
- PUNCTUATION = (';', ':', ',', '.', '!', '?')¶