nltk.corpus.reader.panlex_lite module¶

CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.

class nltk.corpus.reader.panlex_lite.Meaning[source]¶

Bases: dict

Represents a single PanLex meaning. A meaning is a translation set derived from a single source.

__init__(mn, attr)[source]¶

expressions()[source]¶

Returns:: the meaning’s expressions as a dictionary whose keys are language variety uniform identifiers and whose values are lists of expression texts.
Return type:: dict

id()[source]¶

Returns:: the meaning’s id.
Return type:: int

quality()[source]¶

Returns:: the meaning’s source’s quality (0=worst, 9=best).
Return type:: int

source()[source]¶

Returns:: the meaning’s source id.
Return type:: int

source_group()[source]¶

Returns:: the meaning’s source group id.
Return type:: int

class nltk.corpus.reader.panlex_lite.PanLexLiteCorpusReader[source]¶

Bases: CorpusReader

MEANING_Q = '\n SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n ORDER BY dnx2.uq DESC\n '¶

TRANSLATION_Q = '\n SELECT s.tt, sum(s.uq) AS trq FROM (\n SELECT ex2.tt, max(dnx.uq) AS uq\n FROM dnx\n JOIN ex ON (ex.ex = dnx.ex)\n JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n GROUP BY ex2.tt, dnx.ui\n ) s\n GROUP BY s.tt\n ORDER BY trq DESC, s.tt\n '¶

__init__(root)[source]¶

Parameters:

root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:
- A string: encoding is the encoding name for all files.
- A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
- A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
- None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

language_varieties(lc=None)[source]¶

Return a list of PanLex language varieties.

Parameters:: lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.
Returns:: the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.
Return type:: list(tuple)

meanings(expr_uid, expr_tt)[source]¶

Return a list of meanings for an expression.

Parameters:

expr_uid – the expression’s language variety, as a seven-character uniform identifier.
expr_tt – the expression’s text.

Returns:

a list of Meaning objects.

Return type:

list(Meaning)

translations(from_uid, from_tt, to_uid)[source]¶

Return a list of translations for an expression into a single language variety.

Parameters:

from_uid – the source expression’s language variety, as a seven-character uniform identifier.
from_tt – the source expression’s text.
to_uid – the target language variety, as a seven-character uniform identifier.

Returns:

a list of translation tuples. The first element is the expression text and the second element is the translation quality.

Return type:

list(tuple)

NLTK

Documentation

nltk.corpus.reader.panlex_lite module¶