Our group works on developing computer science techniques for analysis of biological data and on answering fundamental biological questions using such methods.Our research ranges from 1) defining and studying problems in theoretical computer science that are inspired by biology (see this for example), 2) applying ideas and techniques from theoretical computer science to design algorithms and data structures for biological problems (see
this for example), 3) turning algorithmic ideas into user-facing bioinformatics software (see this for example), and 4) applying bioinformatics software to answer fundamental biological and medical questions (see this for example).

A recent focus of our lab is on developing scalable algorithms for big biological datasets. The ubiquity of sequencing technologies has created a scalability bottleneck in both the number and size of genomes being analyzed. We highlight some recent projects:

Comparisons of multiple genomes is at the foundation of much of evolutionary and functional biology, but methods to handle the scale of today’s data are still limited. De Bruijn graph-based methods are highly promising, but they rely on a graph compaction step which has only been shown to scale up to 7 human genomes. Using improved algorithms and multi-threading, our new TwoPaCo tool can perform this step for 100 in a day, on a typical server. Our paper

“TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes” appears in Bioinformatics.

The de Bruijn graph plays an important role in bioinformatics and fragment assembly,
but its construction remains a bottleneck for large genomes.
In collaboration with

Rayan Chikhi and Antoine Limasset, we have developed an algorithm for the important compaction step that achieves an order of magnitude improvement in time and memory.
BCALM 2 can compact a dataset of 13 billion k-mers from the 20 Gbp White spruce genome in under 7 hours and only 39 GB of RAM.
Our paper "

Compacting de Bruijn graphs from sequencing data quickly and in low memory"
appears in

ISMB 2016.

The first stage of reconstructing a genome from sequencing data is to identify contigs -- long sequences that are guaranteed to be in the genome. Contigs are typically generated using the simple and efficient unitig algorithm. But is this the best we can do? In a collaboration with Alexandru I. Tomescu, we characterize all the sequences that can be safely output and give a poly-time algorithm for finding them. Our paper "

Safe and complete contig assembly via omnitigs" was presented at

RECOMB 2016.

An overlap digraph is constructed from a set of reads by assigning an arc between two reads if and only if they overlap.
How are the digraphs generated in this way limited by the read-length?
To answer this question, we introduce and study a graph parameter called readability.
The readability of digraph D is the minimum read-length needed such that
there exists a set of reads such that their overlap digraph is D.
Our paper, in collaboration with Sofya Raskhodnikova and Martin Milanic,
"

On the readability of overlap digraphs"
appears in

CPM 2015.

One of the most important parameters that dictates the quality of genome assemblies
is the k-mer size.
Making the correct choice has for the most part been more art than science,
often requiring days of compute time to try various putative values.
We have developed a tool,

KmerGenie,
that can efficiently determine the best value in a fraction of time and in an automated manner.
Our paper, titled "

Informed and automated k-mer size selection for genome assembly",
won the Best Paper award at

HiTSeq 2013.

Principal Investigator
Location
Affiliations