nltk.tokenize.legality_principle module¶

The Legality Principle is a language agnostic principle maintaining that syllable onsets and codas (the beginning and ends of syllables not including the vowel) are only legal if they are found as word onsets or codas in the language. The English word ‘’admit’’ must then be syllabified as ‘’ad-mit’’ since ‘’dm’’ is not found word-initially in the English language (Bartlett et al.). This principle was first proposed in Daniel Kahn’s 1976 dissertation, ‘’Syllable-based generalizations in English phonology’’.

Kahn further argues that there is a ‘’strong tendency to syllabify in such a way that initial clusters are of maximal length, consistent with the general constraints on word-initial consonant clusters.’’ Consequently, in addition to being legal onsets, the longest legal onset is preferable—‘’Onset Maximization’’.

The default implementation assumes an English vowel set, but the vowels attribute can be set to IPA or any other alphabet’s vowel set for the use-case. Both a valid set of vowels as well as a text corpus of words in the language are necessary to determine legal onsets and subsequently syllabify words.

The legality principle with onset maximization is a universal syllabification algorithm, but that does not mean it performs equally across languages. Bartlett et al. (2009) is a good benchmark for English accuracy if utilizing IPA (pg. 311).

References:

Otto Jespersen. 1904. Lehrbuch der Phonetik. Leipzig, Teubner. Chapter 13, Silbe, pp. 185-203.
Theo Vennemann, ‘’On the Theory of Syllabic Phonology,’’ 1972, p. 11.
Daniel Kahn, ‘’Syllable-based generalizations in English phonology’’, (PhD diss., MIT, 1976).
Elisabeth Selkirk. 1984. On the major class features and syllable theory. In Aronoff & Oehrle (eds.) Language Sound Structure: Studies in Phonology. Cambridge, MIT Press. pp. 107-136.
Jeremy Goslin and Ulrich Frauenfelder. 2001. A comparison of theoretical and human syllabification. Language and Speech, 44:409–436.
Susan Bartlett, et al. 2009. On the Syllabification of Phonemes. In HLT-NAACL. pp. 308-316.
Christopher Hench. 2017. Resonances in Middle High German: New Methodologies in Prosody. UC Berkeley.

class nltk.tokenize.legality_principle.LegalitySyllableTokenizer[source]¶

Bases: TokenizerI

Syllabifies words based on the Legality Principle and Onset Maximization.

>>> from nltk.tokenize import LegalitySyllableTokenizer
>>> from nltk import word_tokenize
>>> from nltk.corpus import words
>>> text = "This is a wonderful sentence."
>>> text_words = word_tokenize(text)
>>> LP = LegalitySyllableTokenizer(words.words())
>>> [LP.tokenize(word) for word in text_words]
[['This'], ['is'], ['a'], ['won', 'der', 'ful'], ['sen', 'ten', 'ce'], ['.']]

__init__(tokenized_source_text, vowels='aeiouy', legal_frequency_threshold=0.001)[source]¶

Parameters

tokenized_source_text (list(str)) – List of valid tokens in the language
vowels (str) – Valid vowels in language or IPA representation
legal_frequency_threshold (float) – Lowest frequency of all onsets to be considered a legal onset

find_legal_onsets(words)[source]¶

Gathers all onsets and then return only those above the frequency threshold

Parameters: words (list(str)) – List of words in a language
Returns: Set of legal onsets
Return type: set(str)

onset(word)[source]¶

Returns consonant cluster of word, i.e. all characters until the first vowel.

Parameters: word (str) – Single word or token
Returns: String of characters of onset
Return type: str

tokenize(token)[source]¶

Apply the Legality Principle in combination with Onset Maximization to return a list of syllables.

Parameters: token (str) – Single word or token
Return syllable_list: Single word or token broken up into syllables.
Return type: list(str)

NLTK

Documentation

nltk.tokenize.legality_principle module¶