All the king's horses
Srinivas Aluru
Sequencing genomes is a path to new discoveries in plant sciences, be the findings pieces of molecular machinery or targets for trait enhancement.
Now new generation DNA sequencing, using massively parallel sequencing technologies, is reducing costs and accelerating sequence data acquisition rates. But it comes hand-in-glove with a new problem - how to assemble and decode all the information.
To sequence genomes, you first break them down and then, like Humpty- Dumpty, put the pieces back together again. Normally the genome is randomly cut into a set of DNA fragments - each 700 to 1,000 base pairs long - then sequenced. Next, long overlapping sections are matched sequentially and the original genome reassembled.
New generation DNA sequencing relies upon simultaneously sequencing millions of short DNA pieces, each 36 to 75 base pairs in length. So though this method is fast, reassembly is problematic.
"The probability of overlaps matching with high confidence is greatly diminished with short reads," explains Srinivas Aluru, the Ross Martin Mehl and Marylyne Munas Mehl Professor of Computer Engineering in the Department of Electrical and Computer Engineering.
Aluru, with support from the institute's Innovative Grants Program, is developing a software program that assists genome reassembly and accommodates assemblies where many end parts are similar to the start of another - a quandary hammered home for researchers reassembling the maize genome that proved to be 65 to 80 percent repetitive.
"One big problem is repetitive sequence reads - genomic co-location," explains Aluru. "How do we find where in the genome they come from when overlaps are not always from the same genomic region?"
To appreciate the sheer number and complexity the latest technology produces, one can imagine navigating across Lake Superior from Sault Saint Marie to Thunder Bay in a kayak - with only one data set of thousands of short navigational reads to go by. If an exact geographical location must be matched with each paddle stroke throughout the entire length of the journey, then even one misplaced coordinate steers the kayaker off course with no navigational information for reorientation.
Aluru's program, called YAGA (yet another genome assembler) works by coordinating potentially thousands of high-performance super computer processers simultaneously.
"This becomes a very large optimization project," says Aluru. "Like building a house - if one person does all the work it takes a long time, but there is no need to coordinate planning. If 1,000 people are working, they must interface at the correct time, doing equal amounts of work and completing their tasks at once."
YAGA analyzes millions of these paired reads for vital clues and predicts the one logical path containing the short reads that is most likely to be the genome sequence.
"We will be trying to apply the YAGA program to various real biology projects, such as sequencing transcriptomes and genomes of other crop grasses," says Aluru. "In the process, we will discover what we need to do better, helping further development."



