We work on answering fundamental biological questions using methods rooted in computer science. Among some of the areas we are currently exploring is the aberrant structure of cancer genomes and the causes of genetic disorders and diseases such as autism. The computer science aspect of our research ranges from proving theorems about mathematical structures of the genome to analyzing data from clinical samples. We work closely with biological collaborators both globally and locally here at Penn State.

Recently, the main focus of our group has been on methods for genome assembly, structural and copy number variation detection, as well as all aspects of next generation sequencing data. We are also interested in a variety of computer science areas such as phylogenetics, graph theory, computational complexity, on-line algorithms, and networking.

We highlight some recent projects:

Comparisons of multiple genomes is at the foundation of much of evolutionary and functional biology, but methods to handle the scale of today’s data are still limited. De Bruijn graph-based methods are highly promising, but they rely on a graph compaction step which has only been shown to scale up to 7 human genomes. Using improved algorithms and multi-threading, our new TwoPaCo tool can perform this step for 100 in a day, on a typical server. Our paper “TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes” appears in Bioinformatics.
The de Bruijn graph plays an important role in bioinformatics and fragment assembly, but its construction remains a bottleneck for large genomes. In collaboration with Rayan Chikhi and Antoine Limasset, we have developed an algorithm for the important compaction step that achieves an order of magnitude improvement in time and memory. BCALM 2 can compact a dataset of 13 billion k-mers from the 20 Gbp White spruce genome in under 7 hours and only 39 GB of RAM. Our paper "Compacting de Bruijn graphs from sequencing data quickly and in low memory" appears in ISMB 2016.
The first stage of reconstructing a genome from sequencing data is to identify contigs -- long sequences that are guaranteed to be in the genome. Contigs are typically generated using the simple and efficient unitig algorithm. But is this the best we can do? In a collaboration with Alexandru I. Tomescu, we characterize all the sequences that can be safely output and give a poly-time algorithm for finding them. Our paper "Safe and complete contig assembly via omnitigs" will be presented at RECOMB 2016.
An overlap digraph is constructed from a set of reads by assigning an arc between two reads if and only if they overlap. How are the digraphs generated in this way limited by the read-length? To answer this question, we introduce and study a graph parameter called readability. The readability of digraph D is the minimum read-length needed such that there exists a set of reads such that their overlap digraph is D. Our paper, in collaboration with Sofya Raskhodnikova and Martin Milanic, "On the readability of overlap digraphs" appears in CPM 2015.
One of the most important parameters that dictates the quality of genome assemblies is the k-mer size. Making the correct choice has for the most part been more art than science, often requiring days of compute time to try various putative values. We have developed a tool, KmerGenie, that can efficiently determine the best value in a fraction of time and in an automated manner. Our paper, titled "Informed and automated k-mer size selection for genome assembly", won the Best Paper award at HiTSeq 2013.