nltk.tag.hunpos module

A module for interfacing with the HunPos open-source POS-tagger.

class nltk.tag.hunpos.HunposTagger[source]

Bases: TaggerI

A class for pos tagging with HunPos. The input is the paths to:
  • a model trained on training data

  • (optionally) the path to the hunpos-tag binary

  • (optionally) the encoding of the training data (default: ISO-8859-1)

Check whether the required “hunpos-tag” binary is available:

>>> from nltk.test.setup_fixt import check_binary
>>> check_binary('hunpos-tag')
>>> from nltk.tag import HunposTagger
>>> ht = HunposTagger('en_wsj.model')
>>> ht.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
>>> ht.close()

This class communicates with the hunpos-tag binary via pipes. When the tagger object is no longer needed, the close() method should be called to free system resources. The class supports the context manager interface; if used in a with statement, the close() method is invoked automatically:

>>> with HunposTagger('en_wsj.model') as ht:
...     ht.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
__init__(path_to_model, path_to_bin=None, encoding='ISO-8859-1', verbose=False)[source]

Starts the hunpos-tag executable and establishes a connection with it.

  • path_to_model – The model file.

  • path_to_bin – The hunpos-tag binary.

  • encoding

    The encoding used by the model. Unicode tokens passed to the tag() and tag_sents() methods are converted to this charset when they are sent to hunpos-tag. The default is ISO-8859-1 (Latin-1).

    This parameter is ignored for str tokens, which are sent as-is. The caller must ensure that tokens are encoded in the right charset.


Closes the pipe to the hunpos executable.


Tags a single sentence: a list of words. The tokens should not contain any newline characters.