nltk.corpus.reader.wordlist module

class nltk.corpus.reader.wordlist.MWAPPDBCorpusReader[source]

Bases: WordListCorpusReader

This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):

The original source of the full PPDB corpus can be found on https://www.cis.upenn.edu/~ccb/ppdb/

Returns:

a list of tuples of similar lexical terms.

entries(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]
Returns:

a tuple of synonym word pairs.

mwa_ppdb_xxxl_file = 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'
class nltk.corpus.reader.wordlist.NonbreakingPrefixesCorpusReader[source]

Bases: WordListCorpusReader

This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.

available_langs = {'ca': 'ca', 'catalan': 'ca', 'cs': 'cs', 'czech': 'cs', 'de': 'de', 'dutch': 'nl', 'el': 'el', 'en': 'en', 'english': 'en', 'es': 'es', 'fi': 'fi', 'finnish': 'fi', 'fr': 'fr', 'french': 'fr', 'german': 'de', 'greek': 'el', 'hu': 'hu', 'hungarian': 'hu', 'icelandic': 'is', 'is': 'is', 'it': 'it', 'italian': 'it', 'latvian': 'lv', 'lv': 'lv', 'nl': 'nl', 'pl': 'pl', 'polish': 'pl', 'portuguese': 'pt', 'pt': 'pt', 'ro': 'ro', 'romanian': 'ro', 'ru': 'ru', 'russian': 'ru', 'sk': 'sk', 'sl': 'sl', 'slovak': 'sk', 'slovenian': 'sl', 'spanish': 'es', 'sv': 'sv', 'swedish': 'sv', 'ta': 'ta', 'tamil': 'ta'}
words(lang=None, fileids=None, ignore_lines_startswith='#')[source]

This module returns a list of nonbreaking prefixes for the specified language(s).

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J']
True
>>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ']
True
Returns:

a list words for the specified language(s).

class nltk.corpus.reader.wordlist.SwadeshCorpusReader[source]

Bases: WordListCorpusReader

entries(fileids=None)[source]
Returns:

a tuple of words for the specified fileids.

class nltk.corpus.reader.wordlist.UnicharsCorpusReader[source]

Bases: WordListCorpusReader

This class is used to read lists of characters from the Perl Unicode Properties (see https://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from https://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

available_categories = ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
chars(category=None, fileids=None)[source]

This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.

>>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼']
True
>>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥']
True
>>> pup.available_categories
['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
Returns:

a list of characters given the specific unicode character category

class nltk.corpus.reader.wordlist.WordListCorpusReader[source]

Bases: CorpusReader

List of words, one per line. Blank lines are ignored.

words(fileids=None, ignore_lines_startswith='\n')[source]