nltk.classify.textcat module¶

A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization”.

The algorithm takes advantage of Zipf’s law and uses n-gram frequencies to profile languages and text-yet to be identified-then compares using a distance measure.

Language n-grams are provided by the “An Crubadan” project. A corpus reader was created separately to read those files.

For details regarding the algorithm, see: https://www.let.rug.nl/~vannoord/TextCat/textcat.pdf

For details about An Crubadan, see: https://borel.slu.edu/crubadan/index.html

class nltk.classify.textcat.TextCat[source]¶

Bases: object

__init__()[source]¶

calc_dist(lang, trigram, text_profile)[source]¶: Calculate the “out-of-place” measure between the text and language profile for a single trigram

fingerprints = {}¶

guess_language(text)[source]¶: Find the language with the min distance to the text and return its ISO 639-3 code

lang_dists(text)[source]¶: Calculate the “out-of-place” measure between the text and all languages

last_distances = {}¶

profile(text)[source]¶: Create FreqDist of trigrams within text

remove_punctuation(text)[source]¶: Get rid of punctuation except apostrophes

nltk.classify.textcat.demo()[source]¶

NLTK

Documentation

nltk.classify.textcat module¶