The first weeks of GSoC are coming to an end, so let’s take some time to reflect on the overall progress during the first phase of the coding period.
The following decisions have been taken in phase one:
- We will aim for a single code base for Pattern that supports both Python 2.7 and Python 3.5+. Notably, this involves dropping support for Python 2.6 and less.
- We will aim to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception. This requires some efforts but will hopefully make the code more readable in the long term and makes it easy to drop Python 2 support entirely at some point.
- Wherever possible, we will avoid using the
sixmodule since it tends to obfuscate the source code. We will however make use of the
futurepackage wherever suitable.
- I will not touch the
masterbranch on the
clips/patternGitHub repository, but decided to commit changes working towards a stable Python 2.7 version to the
developmentbranch. Apart from this, I am mostly working on the
python3branch which will incrementally build towards the Python 2/3 code base. Two minor branches
wordnethave been created to rip out vendorized libraries and help with moving away from
The following will list the steps taken in the last weeks in roughly chronological order:
May, 30 – June, 11
- I set up Travis CI, a continuous integration platform that helps us keeping track of which unit tests pass or fail for different Python versions. Every time that changes are commited to one of the branches, Travis CI will run the unit tests, show a build matrix on the project’s status page and list the log of all unit tests.
- In the current version, Pattern comes bundled with many libraries that are directly integrated into the Pattern code base, especially in the
pattern.webmodule. However, this should be discouraged since it requires keeping up with the development of each library individually and merging upstream changes back to Pattern, which is quite laborious. Since we nowadays have decent setup procedures available that can deal with resolving dependencies, these modules should be entirely removed from the code base and added as external dependencies to
setup.py. Specifically, the following bundled libraries were removed from the code base and now merely remain external dependencies:
- There used to be a
<>operator in Python 2 which is no longer available in Python 3. I replaced all occurrences with the equivalent
!=(i.e. not equal) operator.
- In Python 3, only absolute imports and explicit relative imports are supported. I adapted a good part of the
importstatements in various modules.
There were some changes to the way numerals are handled by the interpreter. Numbers with leading zeros, e.g.
01are unsupported in Python 3, as well as explicit long integer declarations such as
1L. I adapted the code base accordingly.
Python 3 removes one of the two ways in Python 2 to catch exceptions,
except SomeException, e:in favor of the universal
except SomeException as e. Similarly, when raising exceptions,
raise SomeException, "Something is wrong!"is deprecated in favor of
raise SomeException("Something is wrong!"). I adapted the code base accordingly.
- Some of the packages or functions in the standard library have been renamed or refactored in some other way, e.g.
htmlentitydefs. In general, Python 3 provides a more consistent naming scheme. I adapted the source code to deal with this, either using
try: ... except:around
importstatements, or making use of the
- Furthermore, Python 3 turns functions like
range(), zip(), map()into generators by default.
reduce()must be separately imported from
functools. This required some code refactoring since generators can neither be indexed nor sliced.
- I did a bit of community work on GitHub, closing resolved or ancient issues or pull requests and opening some issues to address more recent developments. I plan to expand on this during the next two periods.
June, 12 – June, 25
sorted()function no longer accepts custom comparison functions with the
cmpkeyword in Python 3. Instead, one must move over to using key functions. There is a helper function
functoolswhich can deal with this quite easily.
Nodeobjects could not be added to sets because they became unhashable in Python 3 due to the fact that the
__eq__()method was overwritten. The solution was to simply specify
__hash__ = object.__hash__in the class definition to explicitly use the default hashing procedure.
- In Python 3, the
__getslice__()method for slicing is deprecated. Instead, everything is deferred to
__getitem__(). I had to do some code refactoring to account for this, mostly for the
- I refactored the unit test
test_db.pyto do the initialization work (mostly setup of MySQL/MariaDB database handle) in a slightly different way. This is because when running the test with
pytest, sometimes the initialization failed, resulting in failing unit tests due to a closed database handle or similar problems.
- I noticed that the
MySQLdbpackage is not available on Python 3, so some of the tests in
test_db.pywere not actually discovered until after the refactoring. However, there is a package called
mysqlclientwhich can substitute
MySQLdband supports both Python 2 and Python 3.
- I added an option for
pytestto report code coverage information.
- I made
pywordnetis deprecated (since 2001!) and integrated into
nltk, I refactored the code in
pattern/text/en/wordnet/__init__.pyto support the new interface, which has changed extensively. This is still work in progress as of today…
- I decided to move over to
nosefor unit testing since it has become the de–facto standard over the last years and
nosehas been deprecated for some time now. This does not require any refactoring right now because
pytestis able to discover and run the classical
unittesttest cases. However, at some point it might be desirable to port all of the unit tests to
pytest, but this is not exactly of high priority right now.
- As of right now, the following modules have been ported to Python 3:
- In the upcoming weeks, I will work towards porting the two juicy modules,
pattern.webwhich both require a lot of unicode handling.