Our group works on developing computer science techniques for analysis of biological data and on answering fundamental biological questions using such methods. Our research ranges from 1) defining and studying problems in theoretical computer science that are inspired by biology (see this for example), 2) applying ideas and techniques from theoretical computer science to design algorithms and data structures for biological problems (see this for example), 3) turning algorithmic ideas into user-facing bioinformatics software (see this for example), and 4) applying bioinformatics software to answer fundamental biological and medical questions (see this for example). A recent focus of our lab is on developing scalable algorithms for big biological datasets. The ubiquity of sequencing technologies has created a scalability bottleneck in both the number and size of genomes being analyzed.
Comparisons of multiple genomes is at the foundation of much of evolutionary and functional biology, but methods to handle the scale of today’s data are still limited. De Bruijn graph-based methods are highly promising, but they rely on a graph compaction step which has only been shown to scale up to 7 human genomes. Using improved algorithms and multi-threading, we’ve developed scalable methods to both construct the de Bruijn graph (TwoPaCo) and a multiple whole-genome alignment (SibeliaZ). Our tools can build the graph for 100 human genomes and do an alignment of 16 mice genomes in less than a day on a typical server.
A significant portion of genes in vertebrates are present in multiple highly similar copies, but existing techniques are not able to reconstruct them at nucleotide-level precision. In collaboration with the Makova lab, we have developed an algorithm to recover these sequences from targeted PacBio data. IsoCon is able to detect rare transcripts that differ by as little as one base pair from 100-fold more abundant transcripts and has allowed us to detect an unprecedented number of novel transcripts from the human Y ampliconic gene families. The paper and software are now available.
The ubiquity of sequencing technologies has resulted in the exponential growth of data repositories such as the Sequencing Read Archive (SRA). These databases offer a treasure trove of potential information, but searching them broadly has remained infeasible until a recent breakthrough. We build upon this with the AllSome and then the HowDe Sequence Bloom Tree data structures and software that significantly reduces time to index and search these databases.
Small genetic variants are known to be responsible for many genetic disorders and their analysis with next-generation sequencing data plays an important role in biomedical research and clinical applications. For bedside genomics, speed becomes crucial, and our tool, VarGeno, can genotype small variants from a human dataset in under an hour. Meanwhile, studies that compare small variants across individuals are complicated by the fact that the same variant can be represented in a myriad of ways. Our tool, VarMatch, allows more accurate comparison of small variant datasets using the memory of a desktop computer.
The de Bruijn graph plays an important role in bioinformatics and fragment assembly, but its construction remains a bottleneck for large genomes. In collaboration with Rayan Chikhi and Antoine Limasset, we have developed an algorithm for the important compaction step that achieves an order of magnitude improvement in time and memory. BCALM 2 can compact a dataset of 13 billion k-mers from the 20 Gbp White spruce genome in under 7 hours and only 39 GB of RAM. Our paper “Compacting de Bruijn graphs from sequencing data quickly and in low memory” appears in ISMB 2016.
The first stage of reconstructing a genome from sequencing data is to identify contigs — long sequences that are guaranteed to be in the genome. Contigs are typically generated using the simple and efficient unitig algorithm. But is this the best we can do? In a collaboration with Alexandru I. Tomescu, we characterize all the sequences that can be safely output and give a poly-time algorithm for finding them. Our paper “Safe and complete contig assembly via omnitigs” was presented at RECOMB 2016.
An overlap digraph is constructed from a set of reads by assigning an arc between two reads if and only if they overlap. How are the digraphs generated in this way limited by the read-length? To answer this question, we introduce and study a graph parameter called readability. The readability of digraph D is the minimum read-length needed such that there exists a set of reads such that their overlap digraph is D. Our paper, in collaboration with Sofya Raskhodnikova and Martin Milanic, “On the readability of overlap digraphs” appears in CPM 2015.
One of the most important parameters that dictates the quality of genome assemblies is the k-mer size. Making the correct choice has for the most part been more art than science, often requiring days of compute time to try various putative values. We have developed a tool,KmerGenie, that can efficiently determine the best value in a fraction of time and in an automated manner. Our paper, titled “Informed and automated k-mer size selection for genome assembly“, won the Best Paper award at HiTSeq 2013.