nltk.corpus.reader.udhr module

UDHR corpus reader. It mostly deals with encodings.

class nltk.corpus.reader.udhr.UdhrCorpusReader[source]

Bases: PlaintextCorpusReader

ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('Polish-Latin2', 'cp1250'), ('Polish_Polski-Latin2', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]
SKIP = {'Amharic-Afenegus6..60375', 'Armenian-DallakHelv', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Bhojpuri-Agra', 'Burmese_Myanmar-UTF8', 'Burmese_Myanmar-WinResearcher', 'Chinese_Mandarin-HZ', 'Chinese_Mandarin-UTF8', 'Czech-Latin2-err', 'Esperanto-T61', 'Gujarati-UTF8', 'Hungarian_Magyar-Unicode', 'Japanese_Nihongo-JIS', 'Lao-UTF8', 'Magahi-Agra', 'Magahi-UTF8', 'Marathi-UTF8', 'Navaho_Dine-Navajo-Navaho-font', 'Russian_Russky-UTF8~', 'Tamil-UTF8', 'Tigrinya_Tigrigna-VG2Main', 'Vietnamese-TCVN', 'Vietnamese-VIQR', 'Vietnamese-VPS'}
__init__(root='udhr')[source]

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/usr/local/share/nltk_data/corpora/webtext/'
>>> reader = PlaintextCorpusReader(root, '.*\.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.

  • sent_tokenizer – Tokenizer for breaking paragraphs into words.

  • para_block_reader – The block reader used to divide the corpus into paragraph blocks.