NLTK Python 2.x - 3.x Compatibility Layer

NLTK comes with a Python 2.x/3.x compatibility layer, nltk.compat (which is loosely based on six):

>>> from nltk import compat
>>> compat.PY3
>>> compat.integer_types
(<type 'int'>, <type 'long'>)
>>> compat.string_types
(<type 'basestring'>,)
>>> # and so on


Under Python 2.x __str__ and __repr__ methods must return bytestrings.

@python_2_unicode_compatible decorator allows writing these methods in a way compatible with Python 3.x:

  1. wrap a class with this decorator,
  2. define __str__ and __repr__ methods returning unicode text (that's what they must return under Python 3.x),

and they would be fixed under Python 2.x to return byte strings:

>>> from nltk.compat import python_2_unicode_compatible

>>> @python_2_unicode_compatible
... class Foo(object):
...     def __str__(self):
...         return u'__str__ is called'
...     def __repr__(self):
...         return u'__repr__ is called'

>>> foo = Foo()
>>> foo.__str__().__class__
<type 'str'>
>>> foo.__repr__().__class__
<type 'str'>
>>> print(foo)
__str__ is called
>>> foo
__repr__ is called

Original versions of __str__ and __repr__ are available as __unicode__ and unicode_repr:

>>> foo.__unicode__().__class__
<type 'unicode'>
>>> foo.unicode_repr().__class__
<type 'unicode'>
>>> unicode(foo)
u'__str__ is called'
>>> foo.unicode_repr()
u'__repr__ is called'

There is no need to wrap a subclass with @python_2_unicode_compatible if it doesn't override __str__ and __repr__:

>>> class Bar(Foo):
...     pass
>>> bar = Bar()
>>> bar.__str__().__class__
<type 'str'>

However, if a subclass overrides __str__ or __repr__, wrap it again:

>>> class BadBaz(Foo):
...     def __str__(self):
...         return u'Baz.__str__'
>>> baz = BadBaz()
>>> baz.__str__().__class__  # this is incorrect!
<type 'unicode'>

>>> @python_2_unicode_compatible
... class GoodBaz(Foo):
...     def __str__(self):
...         return u'Baz.__str__'
>>> baz = GoodBaz()
>>> baz.__str__().__class__
<type 'str'>
>>> baz.__unicode__().__class__
<type 'unicode'>

Applying @python_2_unicode_compatible to a subclass shouldn't break methods that was not overridden:

>>> baz.__repr__().__class__
<type 'str'>
>>> baz.unicode_repr().__class__
<type 'unicode'>


Under Python 3.x repr(unicode_string) doesn't have a leading "u" letter.

nltk.compat.unicode_repr function may be used instead of repr and "%r" % obj to make the output more consistent under Python 2.x and 3.x:

>>> from nltk.compat import unicode_repr
>>> print(repr(u"test"))
>>> print(unicode_repr(u"test"))

It may be also used to get an original unescaped repr (as unicode) of objects which class was fixed by @python_2_unicode_compatible decorator:

>>> @python_2_unicode_compatible
... class Foo(object):
...     def __repr__(self):
...         return u'<Foo: foo>'

>>> foo = Foo()
>>> repr(foo)
'<Foo: foo>'
>>> unicode_repr(foo)
u'<Foo: foo>'

For other objects it returns the same value as repr:

>>> unicode_repr(5)

It may be a good idea to use unicode_repr instead of %r string formatting specifier inside __repr__ or __str__ methods of classes fixed by @python_2_unicode_compatible to make the output consistent between Python 2.x and 3.x.