nltk.tag package¶

Submodules¶

Module contents¶

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

A Russian tagger is also available if you specify lang=”rus”. It uses the Russian National Corpus tagset:

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')    
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]

This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> round(tagger.accuracy(brown.tagged_sents(categories='news')[500:600]), 3)
0.735

For more information, please consult chapter 5 of the NLTK Book.

isort:skip_file

nltk.tag.pos_tag(tokens, tagset=None, lang='eng')[source]¶

Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.

>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal') 
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

NB. Use pos_tag_sents() for efficient tagging of more than one sentence.

Parameters:

tokens (list(str)) – Sequence of tokens to be tagged
tagset (str) – the tagset to be used, e.g. universal, wsj, brown
lang (str) – the ISO 639 code of the language, e.g. ‘eng’ for English, ‘rus’ for Russian

Returns:

The tagged tokens

Return type:

list(tuple(str, str))

nltk.tag.pos_tag_sents(sentences, tagset=None, lang='eng')[source]¶

Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.

Parameters:

sentences (list(list(str))) – List of sentences to be tagged
tagset (str) – the tagset to be used, e.g. universal, wsj, brown
lang (str) – the ISO 639 code of the language, e.g. ‘eng’ for English, ‘rus’ for Russian

Returns:

The list of tagged sentences

Return type:

list(list(tuple(str, str)))