nltk.corpus.reader.bcp47 module

class nltk.corpus.reader.bcp47.BCP47CorpusReader[source]

Bases: CorpusReader

Parse BCP-47 composite language tags

Supports all the main subtags, and the ‘u-sd’ extension:

>>> from nltk.corpus import bcp47
>>> bcp47.name('oc-gascon-u-sd-fr64')
'Occitan (post 1500): Gascon: Pyrénées-Atlantiques'

Can load a conversion table to Wikidata Q-codes: >>> bcp47.load_wiki_q() >>> bcp47.wiki_q[‘en-GI-spanglis’] ‘Q79388’

__init__(root, fileids)[source]

Read the BCP-47 database

data_dict(records)[source]

Convert the BCP-47 language subtag registry to a dictionary

lang2str(lg_record)[source]

Concatenate subtag values

load_wiki_q()[source]

Load conversion table to Wikidata Q-codes (only if needed)

morphology()[source]
name(tag)[source]

Convert a BCP-47 tag to a colon-separated string of subtag names

>>> from nltk.corpus import bcp47
>>> bcp47.name('ca-Latn-ES-valencia')
'Catalan: Latin: Spain: Valencian'
parse_tag(tag)[source]

Convert a BCP-47 tag to a dictionary of labelled subtags

subdiv_dict(subdivs)[source]

Convert the CLDR subdivisions list to a dictionary

val2str(val)[source]

Return only first value

wiki_dict(lines)[source]

Convert Wikidata list of Q-codes to a BCP-47 dictionary