DeepCpG is a deep neural network for predicting DNA methylation in multiple cells. DeepCpG has a modular architecture, consisting of a recurrent CpG module to account for correlations between CpG sites within and across cells, a convolutional DNA module to extract patterns from a wide DNA sequence window, and a Joint module that integrates the evidence from the CpG and DNA module to predict the methylation state of multiple cells for a target CpG site. DeepCpG yields accurate predictions, enables discovering DNA sequence motifs that are associated with DNA methylation states and cell-to-cell variability, and can be used for analyzing the effect of single-nucleotide mutations on DNA methylation. DeepCpG is implemented in Python and publicly available.
Google Summer of Code 2015
I was a Google Summer of Code 2015 student, supported by the Python Software Foundation. I developed a module for interactively visualizing compute graphs, which is now part of the deep learning library Theano! Have a look at my blog for more details!
Factor analysis (FA) is a method for dimensionality reduction, similar to principle component analysis (PCA), singular value decomposition (SVD), or independent component analysis (ICA). Applications include visualization, image compression, or feature learning. A mixture of factor analysers consists of several factor analysers, and allows both dimensionality reduction and clustering. Variational Bayesian learning of model parameters prevents overfitting compared with maximum likelihood methods such as expectation maximization (EM), and allows to learn the dimensionality of the lower dimensional subspace by automatic relevance determination (ARD).
I developed vbmfa, a python package for variational matrix factorization, which is available on PyPI and GitHub!
A tool for protein sequence searching with a two-fold higher sensitivity than the popular search tool BLAST and the same runtime. CS-BLAST exploits the context of an amino acid to infer more accurate mutation probabilities than what would be possible by just looking at the amino acid itself. I extented CS-BLAST by a model based on conditional random fields for predicting amino acid mutation probabilities from sequence context, and showed that it outperforms BLAST on a challenging test set. CS-BLAST is implement in C++ and publicly available.
- Discriminative modelling of context-specific amino acid substitution probabilities, Bioinformatics, 2012
- Sequence context-specific profiles for homology searching, PNAS, 2009
- Bioinformatics Toolkit
- Source code
An interactive web server for bioinformatics research that offers access to a great variety of bioinformatics tools for protein sequence analysis. The server is heavily used by researchers worldwide and processes over 500 jobs per day on average. I maintained and extended the Bioinformatics Toolkit for almost two years.
Algorithm for predicting functionally important positions in protein sequences. For a given input protein, HHfuncs searches a database for similar proteins with known biological function, which are used to predict functionally relevant sites in the input protein. HHfuncs belonged to the top-ranking methods in the international competitions for protein structure prediction CASP9 and CASP10.