Overview of Projects

This page shows (in no particular order) some projects that I have worked on. Some of my projects are published under open source licenses. Most of them are available on GitHub.

Machine Learning

Photometric Variability Classifiers

This work assesses the predictive performance of different automated, supervised Machine Learning approaches for the classification of astrophysical variables based on their photometric variability. This is a report accounting for a Bachelor of Science (B.Sc.) thesis in Physics.

Photometric Variability Classifiers

  • Abstract: This work assesses the predictive performance of different automated, supervised Machine Learning approaches for the classification of astrophysical variables based on their photometric variability. We extract \(64\) ad–hoc features in \(R_E\), \(B_E\) and \(R_E − B_E\) from \(32683\) EROS–2 light curves of known periodic, semi–periodic and aperiodic sources in the Large Magellanic Cloud (LMC). To characterize the periodicity of the signals, we make use of both Lomb–Scargle and the conditional entropy (CE) algorithm for period–finding. In this context, we present a fast Python/Cython implementation of the CE algorithm. To provide further separation of quasars in feature space, we implement the structure function to quantify the source's intrinsic stochastic variability. Using a training set containing labels for \(9\) superclasses and \(25\) subclasses provided by Kim et al. [2014], we train three different models on the extracted features, Support Vector Machines (SVMs), Random Forest (RF) and Gradient Boosted Trees (GBT), and optimize the model's hyperparameters for the average, weighted \(F_1\)–score for superclasses and subclasses by performing a grid search using \(5\)–fold cross–validation. We find that the decision tree based models, RF and GBT, outperform the SVM in both superclass and subclass classification. The highest scores are achieved by the GBT classifier with an average, weighted \(F_1\)–score of \((98.43 \pm 0.07) \, \%\) for superclass classification and \((86.30 \pm 0.37) \, \%\) for subclass classification.
  • Supervisors: PD Dr. Coryn Bailer–Jones, Dr. Kester Smith, Dr. Dae–Won Kim
  • Institution: Max Planck Institute for Astronomy (MPIA)
  • Submission: July 2015

Automated recognition of \(\LaTeX\) math

We have developed a prototype that is able to transform images of typeset, low-complexity mathematical expressions back to the \(\LaTeX\) markup language. This work was our submission as a final project for the Advanced Machine Learning lecture at University of Heidelberg in the summer term 2016.

Automated recognition of \(\LaTeX\) math

More information following soon...

Gender Classification on Human Facial Images

In this project I developed a system capable of separating frontal images of women and men in feature space, i.e. distinguishing female faces from male faces. When it comes to feature engineering, I particularly focused on two different methods for dimensionality reduction, the Eigenfaces approach based on Principal Component Analysis (PCA), and the Fisherfaces approach based on Linear Discriminant Analysis (LDA). This work was part of our submission as a final project for the Machine Learning lecture at University of Heidelberg in the autumn term 2014.

Gender Classification on Human Facial Images

The objective of this work was to discriminate male faces from female faces, given images of their faces taken from a full frontal position, very much like the ones found in passports or other identification cards. The data set that was used for this task was the aligned Labeled Faces in the Wild (LFW) by Huang et al., consisting of \(13200\) grayscale images with \(250 \times 250\) pixels of \(5000\) different public persons scraped off the internet. To be able to use this data set for the problem discussed I had to take care of the following:

  • The LFW data set comes with images and names, but does not provide any information about gender itself. To map all names to a gender I made use of the Genderize.io RESTful API that provides a probabilistic classification based on social network data. The results obtained by using this service were very good, and after neglecting predictions with a confidence \(< 0.9\), the results were flawless.
  • The second problem encountered was that the data set was not balanced, i.e. there were a lot more male images than female images. To overcome this, I simply doubled the female images using a horizontal reflection transformation.
  • Since I use the aligned data set by Wolf et al., we can expect the face to be nearly centered and the eyes on a horizontal straight line. To cut down the background clutter, I decided to take only the \(90 \times 90\) pixels in the center into account.

I used that opportunity to review common methods for dimensionality reduction that are both applicable and suitable for the problem:

  • Principal Component Analysis (PCA) is one of the classics of unsupervised ML. It performs dimensionality reduction, i.e. trying to find a lower–dimensional representation of the data, using a linear transformation of the features that essentially maximizes the within–feature variance. Specifically, it diagonalizes the covariance matrix to obtain a set of eigenvectors that will form a basis for a more favorable vector space to represent our data. Out of these eigenvectors, we will only select a subset that we call principal components containing the eigenvectors associated with the largest eigenvalues. The eigenvectors are images itself, and when applied to facial recognition, these eigenvectors are often called Eigenfaces. Now, after going through all this, we can use this to project every input image into this eigenspace to obtain a much small number of coefficients or features to work with.
  • Since PCA does not know anything about the classes and merely tries to come up with features that are uncorrelated and have a high variance, it usually performs poorly for classification tasks, since the features that result from this aren't necessarily helpful in discriminating the classes. For example, it could be the case that the first principal component corresponds to the lighting conditions of the images, which will in general not carry any information about gender. To overcome this, Linear Discriminant Analysis (LDA) finds set of new features by making linear combinations of the original features in such a way that the classes are separated as far as possible in feature space. In other words: In contrast to PCA, LDA explicitly tries to model the differences in the data. For this reason, LDA is more suitable for this very problem and usually ends up with much better results on real–world data sets.

I tried different approaches for the classification ranging from very trivial methods like comparing the input image to mean male and female images in terms of euclidean distance, to classical, simple supervised ML algorithms such as SVMs, \(k\)–NN and Naive Bayes predicting on lower–dimensional representations obtained by either PCA or LDA. The best result was achieved by the Fisherface SVM classifier with an average accuracy of \(> 90 \, \%\) in a \(5\)–fold cross–validation.


Open–Source Projects

Language Learning Toolkit (lltk)

The Language Learning Toolkit (LLTK) combines several approaches such as natural language processing and web scraping to perform a variety of tasks useful for (human) language learning.

Language Learning Toolkit (lltk)

Build Status PyPi version PyPi license

The Language Learning Toolkit combines several approaches such as natural language processing and web scraping to perform a variety of tasks useful for (human) language learning. This includes:

  • Part-of-speech tagging (POS) supported by Pattern
  • Phonetic transcriptions in accordance with the International Phonetic Alphabet (IPA)
  • Audiosamples (Forvo, Google Translate)
  • Textsamples/Sample sentences (Tatoeba)
  • Visual representations of a given word using Google Images
  • Conjugation of verbs (Present, Perfect, Past, Pluperfect, Future) supported by Verbix
  • Pluralization of nouns (accuracy depending on the language)
  • Indefinite and definite articles for nouns (accuracy depending on the language)
  • Comparative and superlative for adjectives
  • Basic gender detection for nouns
  • License: GNU Lesser General Public License (LGPL)
  • Version: 0.2.4
GitHub

Koko

Koko is a web application that uses the Language Learning Toolkit (LLTK) as a backend to allow for automatic generation of flashcards for Anki SRS.

The underlying idea is to be able to add new cards to your Anki language learning decks within minutes without having to worry about downloading audiosamples or looking up example sentences.

Koko

Koko is a web application that uses the Language Learning Toolkit (LLTK) as a backend to allow for automatic generation of flashcards for Anki SRS. It is based on Flask and queries LLTK-RESTful.

The name is inspired by Koko the gorilla who learned to communicate using American Sign Language (ASL). Furthermore, Koko is able to understand roughly 2000 words of spoken English.

GitHub

Batyr

Batyr is a web application that can be used to improve listening skills. It uses the Tatoeba database of example sentences, and presents the user with audiosamples – one sentence at a time.

Batyr

Batyr is a web application that can be used to improve listening skills. It uses the Tatoeba database of example sentences, and presents the user with audiosamples – one sentence at a time. You can try Batyr online at lltk.de.

The name is inspired by Batyr the elephant.

  • License: GNU Affero General Public License (AGPL)
  • Version: 0.2.0
Launch! GitHub

lltk-RESTful

A RESTful API for the Language Learning Toolkit (lltk).

lltk-RESTful

lltk-RESTful is a RESTful API for the Language Learning Toolkit (LLTK). It is based on Flask and delivers JSON documents by default.

  • License: GNU Affero General Public License (AGPL)
  • Version: 0.2.0
GitHub

aescrypt

aescrypt is a small command line tool to encrypt/decrypt files and directories with the well-established AES-256-CBC.

aescrypt

aescrypt is a small command line tool to encrypt/decrypt files and directories with the well-established AES-256-CBC algorithm.

  • License: GNU General Public License (GPLv3)
  • Version: 0.1.0
GitHub

moodle-download

A simplistic downloader for the Moodle e-learning platform.

moodle-download

moodle-dowload.py is a small Python script to download a set of specifed files from the Moodle e-learning platform. You can create profiles (simple JSON files) that provide information on the courses and use regular expressions to determine which files should be downloaded and where they belong.

So far this has only been tested with the e-learning platform (v2.9) at University of Heidelberg. In principle, this could work for other Moodle platforms as well, but some modifications might be necessary.

GitHub

ricoh-upload

Upload files to RICOH printers (e.g. http://drucker.uni-hd.de) using the web interface.

ricoh-upload

Upload files to RICOH printers (e.g. http://drucker.uni-hd.de) using the web interface.

  • License: GNU General Public License (GPLv3)
GitHub