DeepCpG is a deep neural network for predicting DNA methylation in single cells. DeepCpG has a modular architecture, consisting of a recurrent CpG module to account for correlations between CpG sites within and across cells, a convolutional DNA module to extract patterns from a wide DNA sequence window, and a Joint module that integrates the evidence from the CpG and DNA module to predict the methylation state of multiple cells for a target CpG site. DeepCpG yields accurate predictions, enables discovering DNA sequence motifs that are associated with DNA methylation states as well as cell-to-cell variability, and can be used for analyzing the effect of single-nucleotide mutations on DNA methylation. DeepCpG is implemented in Python and publicly available.
Google Summer of Code 2015
I participated in Google Summer of Code 2015, supported by the Python Software Foundation. I developed a module for interactively visualizing computational graphs in a web browser, which was later integrated into the deep learning library Theano. More information can be found in my blog posts.
Factor analysis (FA) is a method for dimensionality reduction, similar to principle component analysis (PCA), singular value decomposition (SVD), and independent component analysis (ICA). Applications include visualization, image compression, and feature learning. A mixture of factor analyzers consists of several factor analyzers, enabling both dimensionality reduction and clustering. Variational Bayesian learning of model parameters prevents overfitting compared with maximum likelihood methods such as expectation maximization (EM), and enables learning the dimensionality of the lower dimensional subspace by automatic relevance determination (ARD).
I developed VBMFA, a python package for variational matrix factorization, which is available on PyPI and GitHub!
A tool for protein sequence searching with a two-fold higher sensitivity than the popular search tool BLAST and the same runtime. CS-BLAST exploits the context of an amino acid to infer more accurate mutation probabilities than what would be possible by just looking at the amino acid itself. I extended CS-BLAST by a model based on conditional random fields for predicting amino acid mutation probabilities from sequence contexts, and showed that it outperforms BLAST on a challenging test set. CS-BLAST is implement in C++ and publicly available.
- Discriminative modelling of context-specific amino acid substitution probabilities, Bioinformatics, 2012
- Sequence context-specific profiles for homology searching, PNAS, 2009
- Bioinformatics Toolkit
- Source code
An interactive web server for bioinformatics research that offers access to a great variety of bioinformatics tools for protein sequence analysis. The server is heavily used by researchers worldwide and processes over 500 jobs per day on average. I maintained and extended the Bioinformatics Toolkit for almost two years.
HHfuncs is an algorithm for predicting functional sites in proteins. For a given input protein sequence, HHfuncs searches a database for similar proteins with known biological function, which are used to predict functionally relevant sites in the input protein. HHfuncs achieved state-of-the-art results in the international competitions for protein structure prediction CASP9 and CASP10.