nltk.corpus.reader.panlex_swadesh module

class nltk.corpus.reader.panlex_swadesh.PanlexLanguage

Bases: tuple

PanlexLanguage(panlex_uid, iso639, iso639_type, script, name, langvar_uid)

static __new__(_cls, panlex_uid, iso639, iso639_type, script, name, langvar_uid)

Create new instance of PanlexLanguage(panlex_uid, iso639, iso639_type, script, name, langvar_uid)

iso639

Alias for field number 1

iso639_type

Alias for field number 2

langvar_uid

Alias for field number 5

name

Alias for field number 4

panlex_uid

Alias for field number 0

script

Alias for field number 3

class nltk.corpus.reader.panlex_swadesh.PanlexSwadeshCorpusReader[source]

Bases: WordListCorpusReader

This is a class to read the PanLex Swadesh list from

David Kamholz, Jonathan Pool, and Susan M. Colowick (2014). PanLex: Building a Resource for Panlingual Lexical Translation. In LREC. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1029_Paper.pdf

License: CC0 1.0 Universal https://creativecommons.org/publicdomain/zero/1.0/legalcode

__init__(*args, **kwargs)[source]
Parameters:
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

entries(fileids=None)[source]
Returns:

a tuple of words for the specified fileids.

get_languages()[source]
get_macrolanguages()[source]
language_codes()[source]
license()[source]

Return the contents of the corpus LICENSE file, if it exists.

words_by_iso639(iso63_code)[source]
Returns:

a list of list(str)

words_by_lang(lang_code)[source]
Returns:

a list of list(str)