nltk.tag.crf module

A module for POS tagging using CRFSuite

class nltk.tag.crf.CRFTagger[source]

Bases: nltk.tag.api.TaggerI

A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite

>>> from nltk.tag import CRFTagger
>>> ct = CRFTagger()
>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')],
... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]
>>> ct.train(train_data,'model.crf.tagger')
>>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']])
[[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]
>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]]
>>> ct.evaluate(gold_sentences)
1.0

Setting learned model file >>> ct = CRFTagger() >>> ct.set_model_file(‘model.crf.tagger’) >>> ct.evaluate(gold_sentences) 1.0

__init__(feature_func=None, verbose=False, training_opt={})[source]

Initialize the CRFSuite tagger

Parameters
  • feature_func – The function that extracts features for each token of a sentence. This function should take 2 parameters: tokens and index which extract features at index position from tokens list. See the build in _get_features function for more detail.

  • verbose (boolean) – output the debugging messages during training.

  • training_opt (dictionary) – python-crfsuite training options

Set of possible training options (using LBFGS training algorithm).
‘feature.minfreq’

The minimum frequency of features.

‘feature.possible_states’

Force to generate possible state features.

‘feature.possible_transitions’

Force to generate possible transition features.

‘c1’

Coefficient for L1 regularization.

‘c2’

Coefficient for L2 regularization.

‘max_iterations’

The maximum number of iterations for L-BFGS optimization.

‘num_memories’

The number of limited memories for approximating the inverse hessian matrix.

‘epsilon’

Epsilon for testing the convergence of the objective.

‘period’

The duration of iterations to test the stopping criterion.

‘delta’

The threshold for the stopping criterion; an L-BFGS iteration stops when the improvement of the log likelihood over the last ${period} iterations is no greater than this threshold.

‘linesearch’

The line search algorithm used in L-BFGS updates:

  • ‘MoreThuente’: More and Thuente’s method,

  • ‘Backtracking’: Backtracking method with regular Wolfe condition,

  • ‘StrongBacktracking’: Backtracking method with strong Wolfe condition

‘max_linesearch’

The maximum number of trials for the line search algorithm.

set_model_file(model_file)[source]
tag_sents(sents)[source]

Tag a list of sentences. NB before using this function, user should specify the mode_file either by

  • Train a new model using train function

  • Use the pre-trained model which is set via set_model_file function

Params sentences

list of sentences needed to tag.

Returns

list of tagged sentences.

Return type

list(list(tuple(str,str)))

train(train_data, model_file)[source]

Train the CRF tagger using CRFSuite :params train_data : is the list of annotated sentences. :type train_data : list (list(tuple(str,str))) :params model_file : the model will be saved to this file.

tag(tokens)[source]

Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by

  • Train a new model using train function

  • Use the pre-trained model which is set via set_model_file function

Params tokens

list of tokens needed to tag.

Returns

list of tagged tokens.

Return type

list(tuple(str,str))