nltk.corpus.reader.panlex_lite module

CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.

class nltk.corpus.reader.panlex_lite.PanLexLiteCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

MEANING_Q = '\n        SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n        FROM dnx\n        JOIN ex ON (ex.ex = dnx.ex)\n        JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n        JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n        WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n        ORDER BY dnx2.uq DESC\n    '
TRANSLATION_Q = '\n        SELECT s.tt, sum(s.uq) AS trq FROM (\n            SELECT ex2.tt, max(dnx.uq) AS uq\n            FROM dnx\n            JOIN ex ON (ex.ex = dnx.ex)\n            JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n            JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n            WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n            GROUP BY ex2.tt, dnx.ui\n        ) s\n        GROUP BY s.tt\n        ORDER BY trq DESC, s.tt\n    '
__init__(root)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

language_varieties(lc=None)[source]

Return a list of PanLex language varieties.

Parameters

lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.

Returns

the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.

Return type

list(tuple)

meanings(expr_uid, expr_tt)[source]

Return a list of meanings for an expression.

Parameters
  • expr_uid – the expression’s language variety, as a seven-character uniform identifier.

  • expr_tt – the expression’s text.

Returns

a list of Meaning objects.

Return type

list(Meaning)

translations(from_uid, from_tt, to_uid)[source]

Return a list of translations for an expression into a single language variety.

Parameters
  • from_uid – the source expression’s language variety, as a seven-character uniform identifier.

  • from_tt – the source expression’s text.

  • to_uid – the target language variety, as a seven-character uniform identifier.

Returns

a list of translation tuples. The first element is the expression text and the second element is the translation quality.

Return type

list(tuple)

class nltk.corpus.reader.panlex_lite.Meaning[source]

Bases: dict

Represents a single PanLex meaning. A meaning is a translation set derived from a single source.

__init__(mn, attr)[source]
id()[source]
Returns

the meaning’s id.

Return type

int

quality()[source]
Returns

the meaning’s source’s quality (0=worst, 9=best).

Return type

int

source()[source]
Returns

the meaning’s source id.

Return type

int

source_group()[source]
Returns

the meaning’s source group id.

Return type

int

expressions()[source]
Returns

the meaning’s expressions as a dictionary whose keys are language variety uniform identifiers and whose values are lists of expression texts.

Return type

dict