Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Preliminary Studies on de novo Assembly with Short Reads

Nanheng Wu

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2009-172
December 15, 2009

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-172.pdf

Recent development of next generation sequencing presents new computational challenges to assembly algorithms. Any effective and practical de novo assembly algorithm must confront issues of short read length, base-calling errors and enormous data size. In this report we present our effort to address these challenges in de novo assembly with short reads. Specifically we show that quality scores contain vital information and algorithms can achieve optimized results if they utilize quality scores. We also show that error correction preprocessing can be used to enhance de novo assembly algorithms with more tolerance to base-calling errors. Finally we present a novel parallel algorithm to cluster sequence reads based on overlap information and show that it has the potential to scale up to handling millions of reads efficiently.

Advisor: Satish Rao


BibTeX citation:

@mastersthesis{Wu:EECS-2009-172,
    Author = {Wu, Nanheng},
    Editor = {Rao, Satish and Song, Yun S.},
    Title = {Preliminary Studies on de novo Assembly with Short Reads},
    School = {EECS Department, University of California, Berkeley},
    Year = {2009},
    Month = {Dec},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-172.html},
    Number = {UCB/EECS-2009-172},
    Abstract = {Recent development of next generation sequencing presents new computational challenges to assembly algorithms. Any effective and practical de novo assembly algorithm must confront issues of short read length, base-calling errors and enormous data size. In this report we present our effort to address these challenges in de novo assembly with short reads. Specifically we show that quality scores contain vital information and algorithms can achieve optimized results if they utilize quality scores. We also show that error correction preprocessing can be used to enhance de novo assembly algorithms with more tolerance to base-calling errors. Finally we present a novel parallel algorithm to cluster sequence reads based on overlap information and show that it has the potential to scale up to handling millions of reads efficiently.}
}

EndNote citation:

%0 Thesis
%A Wu, Nanheng
%E Rao, Satish
%E Song, Yun S.
%T Preliminary Studies on de novo Assembly with Short Reads
%I EECS Department, University of California, Berkeley
%D 2009
%8 December 15
%@ UCB/EECS-2009-172
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-172.html
%F Wu:EECS-2009-172