NLTK Python 3 support

The following text is not a general comprehensive Python 3.x porting guide; it provides some information about the approach using for NLTK Python 3 port.

Porting Strategy

NLTK is being ported to Python 3 using single codebase strategy: NLTK should work from a single codebase in Python 2.x and 3.x.

Python 2.5 compatibility is dropped in order to take advantage of new __future__ imports, b bytestring marker, new except Exception as e syntax and better standard library compatibility.

General notes

There are good existing guides for writing Python 2.x - 3.x compatible code, e.g.

Take a look at them to have an idea how the approach works and what is changed in Python 3.

nltk.compat

nltk.compat module is loosely based on a great six library. It provides simple utilities for wrapping over differences between Python 2 and Python 3. Moved imports, removed/renamed builtins, type names differences and support functions are there.

Note

We don’t use six directly because it didn’t work well bundled at the time the port was started (this was a bug in six that is fixed now), and NLTK needs extra custom 2+3 helpers anyway.

map vs imap, items vs iteritems, ...

A number of Python builtins and builtin methods returns lists under Python 2.x and iterators under Python 3.x. There are 3 possible ways to workaround this:

  1. use non-iterator versions of functions and methods under Python 3.x (e.g. cast zip result to list);
  2. convert Python 2.x code to iterator versions (e.g. replace zip with itertools.izip when possible);
  3. let the code behave different under Python 2.x and Python 3.x.

In this NLTK port (1) and (3) methods are used; (3) is preferred. This way there are no breaking interface changes for Python 2.x code and Python 3.x code remains idiomatic (it is surprising for a dict subclass items method to return a list under Python 3.x).

Existing code that uses NLTK will have to be ported from Python 2.x to 3.x anyway so I think such interface changes are acceptable.

Unicode support

Fixing corpora readers

Previously, many corpora readers returned byte strings. In a Python 3.x branch the correct encodings are provided for all corpora and all corpora readers are now returning unicode.

__repr__ and __str__

Under Python 2.x obj.__repr__ and obj.__str__ must return byte strings, while under Python 3.x they must return unicode strings.

To make things worse, terminals are tricky, and under Python 2.x there is no hassle-free encoding that obj.__repr__ and obj.__str__ may use except for plain 7 bit ASCII.

Should I link a blog post (http://kmike.ru/python-with-strings-attached/) or extract some text from it to make the statement about encodings more reasoned?

In NLTK most classes with custom __repr__ and/or __str__ should use nltk.compat.python_2_unicode_compatible decorator. It works this way:

  1. Class should be defined with __repr__ and __str__ methods returning unicode (that’s Python 3.x semantics);
  2. under Python 2.x the decorator fixes __repr__ and __str__ to return bytestrings;
  3. under both Python 2.x and 3.x the decorator creates __unicode__ method (which is an original __str__) and unicode_repr method (which is an original __repr__).

Under Python 2.x __repr__ method returns an escaped version of the unicode value and __str__ returns a transliterated version. For transliteration Unidecode, text-unidecode or a basic “accent remover” may be used, depending on what packages are installed.

In order to write unicode-aware __str__ and __repr__, the following approach may be used:

  1. from __future__ import unicode_literals is added to a top of file;
  2. str(something) should be replaced with "%s" % something when used (maybe indirectly) inside __str__ or __repr__;
  3. repr(something) and "%r" % something should be replaced with unicode_repr(something) and "%s" % unicode_repr(something) when used (maybe indirectly) inside __str__ or __repr__.

Doctests porting notes

NLTK test suite is mostly doctest-based. Most usual rules apply to porting doctests code. But ther are some issues that make the process harder, so in order to make doctests work under Python 2.x and Python 3.x extra tricks are needed.

__future__ imports

Python’s doctest runner doesn’t support __future__ imports. They are executed but has no effect in doctests’ code. These imports are quite useful for making code Python 2.x + 3.x compatible so there are some methods to overcome the limitation.

  • from __future__ import print_function: it may seem the import works because print(foo) works under python 2.x; but it works only because (foo) == foo; print(foo, bar) prints tuple; print(foo, sep=' ') raises an exception. In order to make print() work this future import is injected to all doctests’ globals within NLTK test suite (implementation: nltk.test.doctest_nose_plugin.DoctestPluginHelper). So NLTK’s doctests shouldn’t import print_function but they should assume this import is in effect.

  • from __future__ import unicode_literals: there is no sane way to use non-ascii constants in doctests under python 2.x (see http://bugs.python.org/issue1293741 ); doctests with non-ascii constants should be better rewritten as unittests or as doctests without non-ascii constants.

    Tests may use variables with unicode values though. In order to print such values and have the same output under python 2 and python 3 the following trick may be used:

    >>> print(unicode_value.encode('unicode-escape').decode('ascii'))
    

    But it may be a better idea to avoid this trick and rewrite the test to unittest format instead.

  • from __future__ import division: it is usually not hard to cast results to int or float to have the same semantics under python 2 and 3.

Unicode strings __repr__

Representation of unicode strings is different in Python 2.x and Python 3.x even if they contain only ascii characters.

Python 2.x:

>>> x = b'foo'.decode('ascii')
>>> x
u'foo'

Python 3.x:

>>> x = b'foo'.decode('ascii')
>>> x
'foo'

(Note the missing ‘u’ in Python 3 example).

In order to simplify things NLTK’s custom doctest runner (see nltk.test.doctest_nose_plugin.DoctestPluginHelper) doesn’t take ‘u’’s into account; it considers u’foo’ and ‘foo’ equal; developer is free to write u’foo’ or ‘foo’.

This is not absolutely correct but if this distinction is important then doctest should be converted to unittest.

There are other possible fixes for this issue but they all make doctests less readable. For example, for single variables print may be used. Python 2.x:

>>> print(x)
foo

Python 3.x:

>>> print(x)
foo

This won’t help with container types. Python 2.x:

>>> print([x, x])
[u'foo', u'foo']

Possible fixes for lists are:

>>> for txt in [x, x]:
...     print(x)
foo
foo

or:

>>> print(", ".join([x, x]))
foo, foo

Float values representation

The exact representation of float values may vary across Python interpreters (this is not only a Python 3.x - specific issue). So instead of this:

>>> recall
0.8888888888889

write this:

>>> print(recall)
0.88888888888...

Porting tools

python-modernize

python-modernize script was used for tedious parts of python3 porting. Take a look at the docs for more information. The process was:

  • Run NLTK test suite under Python 2.x;
  • fix one specific aspect of NLTK by running one of python-modernize fixers on NLTK source code;
  • take a look at changes python-modernize proposes, fix stupid things;
  • run NLTK test suite again under Python 2.x and make sure there are no regressions.

After python-modernize code wouldn’t be necessary Python 3.x compatible but further porting would be easier and there shouldn’t be 2.x regressions.

2to3

Doctest porting may be tedious, there is a lot of search/replace work (e.g. print foo -> print(foo) or raise Exception, e -> raise Exception as e). In order to overcome this 2to3 utility was used, e.g.:

$ 2to3 -d -f print nltk/test/*.doctest

Fixers were applied one-by-one, test suite was executed before and after fixing.