Sample usage for crubadan¶
Crubadan Corpus Reader¶
Crubadan is an NLTK corpus reader for ngram files provided by the Crubadan project. It supports several languages.
>>> from nltk.corpus import crubadan
>>> crubadan.langs()
['abk', 'abn',..., 'zpa', 'zul']
Language code mapping and helper methods¶
The web crawler that generates the 3-gram frequencies works at the level of “writing systems” rather than languages. Writing systems are assigned internal 2-3 letter codes that require mapping to the standard ISO 639-3 codes. For more information, please refer to the README in nltk_data/crubadan folder after installing it.
To translate ISO 639-3 codes to “Crubadan Code”:
>>> crubadan.iso_to_crubadan('eng')
'en'
>>> crubadan.iso_to_crubadan('fra')
'fr'
>>> crubadan.iso_to_crubadan('aaa')
In reverse, print ISO 639-3 code if we have the Crubadan Code:
>>> crubadan.crubadan_to_iso('en')
'eng'
>>> crubadan.crubadan_to_iso('fr')
'fra'
>>> crubadan.crubadan_to_iso('aa')
Accessing ngram frequencies¶
On initialization the reader will create a dictionary of every language supported by the Crubadan project, mapping the ISO 639-3 language code to its corresponding ngram frequency.
You can access individual language FreqDist and the ngrams within them as follows:
>>> english_fd = crubadan.lang_freq('eng')
>>> english_fd['the']
728135
Above accesses the FreqDist of English and returns the frequency of the ngram ‘the’. A ngram that isn’t found within the language will return 0:
>>> english_fd['sometest']
0
A language that isn’t supported will raise an exception:
>>> crubadan.lang_freq('elvish')
Traceback (most recent call last):
...
RuntimeError: Unsupported language.