nltk.corpus.reader.crubadan module

An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.

There are multiple potential applications for the data but this reader was created with the goal of using it in the context of language identification.

For details about An Crubadan, this data, and its potential uses, see: http://borel.slu.edu/crubadan/index.html

class nltk.corpus.reader.crubadan.CrubadanCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access language An Crubadan n-gram files.

__init__(root, fileids, encoding='utf8', tagset=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

lang_freq(lang)[source]

Return n-gram FreqDist for a specific language given ISO 639-3 language code

langs()[source]

Return a list of supported languages as ISO 639-3 codes

iso_to_crubadan(lang)[source]

Return internal Crubadan code based on ISO 639-3 code

crubadan_to_iso(lang)[source]

Return ISO 639-3 code given internal Crubadan code