nltk.tokenize.sonority_sequencing module

The Sonority Sequencing Principle (SSP) is a language agnostic algorithm proposed by Otto Jesperson in 1904. The sonorous quality of a phoneme is judged by the openness of the lips. Syllable breaks occur before troughs in sonority. For more on the SSP see Selkirk (1984).

The default implementation uses the English alphabet, but the sonority_hiearchy can be modified to IPA or any other alphabet for the use-case. The SSP is a universal syllabification algorithm, but that does not mean it performs equally across languages. Bartlett et al. (2009) is a good benchmark for English accuracy if utilizing IPA (pg. 311).

Importantly, if a custom hierarchy is supplied and vowels span across more than one level, they should be given separately to the vowels class attribute.

References: - Otto Jespersen. 1904. Lehrbuch der Phonetik.

Leipzig, Teubner. Chapter 13, Silbe, pp. 185-203.

  • Elisabeth Selkirk. 1984. On the major class features and syllable theory. In Aronoff & Oehrle (eds.) Language Sound Structure: Studies in Phonology. Cambridge, MIT Press. pp. 107-136.

  • Susan Bartlett, et al. 2009. On the Syllabification of Phonemes. In HLT-NAACL. pp. 308-316.

class nltk.tokenize.sonority_sequencing.SyllableTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

Syllabifies words based on the Sonority Sequencing Principle (SSP).

>>> from nltk.tokenize import SyllableTokenizer
>>> from nltk import word_tokenize
>>> SSP = SyllableTokenizer()
>>> SSP.tokenize('justification')
['jus', 'ti', 'fi', 'ca', 'tion']
>>> text = "This is a foobar-like sentence."
>>> [SSP.tokenize(token) for token in word_tokenize(text)]
[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['.']]
__init__(lang='en', sonority_hierarchy=False)[source]
Parameters
  • lang (str) – Language parameter, default is English, ‘en’

  • sonority_hierarchy (list(str)) – Sonority hierarchy according to the Sonority Sequencing Principle.

assign_values(token)[source]

Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first.

Parameters

token (str) – Single word or token

Returns

List of tuples, first element is character/phoneme and second is the soronity value.

Return type

list(tuple(str, int))

validate_syllables(syllable_list)[source]

Ensures each syllable has at least one vowel. If the following syllable doesn’t have vowel, add it to the current one.

Parameters

syllable_list (list(str)) – Single word or token broken up into syllables.

Returns

Single word or token broken up into syllables (with added syllables if necessary)

Return type

list(str)

tokenize(token)[source]

Apply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first.

Parameters

token (str) – Single word or token

Return syllable_list

Single word or token broken up into syllables.

Return type

list(str)