Ariel Shaul Schwartz

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2007-39

March 28, 2007

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-39.pdf

Comparative genomics analyses, such as genome annotation, phylogenetic inference, protein structure prediction and protein function prediction, rely heavily on multiple sequence alignment (MSA), which is essentially a homology prediction problem at the character level over multiple sequences. Alignments are also crucial in other settings, including natural language processing, as seen in word alignment in machine translation, or in the problem of citation entity alignment that is introduced in this dissertation. Fundamental as it is, MSA is a hard computational problem, and producing accurate MSAs remains challenging. In this thesis we develop new posterior decoding based methods for multiple alignments of discrete characters. We begin by discussing biological sequence alignment, and describe a new alignment accuracy measure (AMA), which is based on a metric for the space of alignments. In the case of pairwise sequence alignment, it is possible to find an alignment that maximizes the expected AMA value using posterior decoding. We then describe an extension of the pairwise alignment method to MSA using a novel approach that is based on a poset representation of multiple alignments, and an efficient online topological ordering algorithm. This leads to a sequence annealing algorithm, which is an incremental method for building MSAs one match at a time. The sequence annealing algorithm outperforms all existing algorithms on benchmark test sets of protein sequence alignments. It simultaneously achieves high recall and high precision, dramatically reducing the number of incorrectly aligned residues in comparison to other programs. The method can adjust the recall/precision trade off, and can reliably identify homologous regions among protein sequences. Finally, we demonstrate that similar posterior decoding methods can be applied to the problem of multiple word alignment of citation sentences (citances) in the bioscience literature. We extend the definition of AMA to many-to-many word alignments, and develop a posterior decoding algorithm for multiple citance alignment.

Advisors: Lior Pachter


BibTeX citation:

@phdthesis{Schwartz:EECS-2007-39,
    Author= {Schwartz, Ariel Shaul},
    Title= {Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments},
    School= {EECS Department, University of California, Berkeley},
    Year= {2007},
    Month= {Mar},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-39.html},
    Number= {UCB/EECS-2007-39},
    Abstract= {Comparative genomics analyses, such as genome annotation, phylogenetic inference, protein structure prediction and protein function prediction, rely heavily on multiple sequence alignment (MSA), which is essentially a homology prediction problem at the character level over multiple sequences. Alignments are also crucial in other settings, including natural language processing, as seen in word alignment in machine translation, or in the problem of citation entity alignment that is introduced in this dissertation. Fundamental as it is, MSA is a hard computational problem, and producing accurate MSAs remains challenging. In this thesis we develop new posterior decoding based methods for multiple alignments of discrete characters. We begin by discussing biological sequence alignment, and describe a new alignment accuracy measure (AMA), which is based on a metric for the space of alignments. In the case of pairwise sequence alignment, it is possible to find an alignment that maximizes the expected AMA value using posterior decoding. We then describe an extension of the pairwise alignment method to MSA using a novel approach that is based on a poset representation of multiple alignments, and an efficient online topological ordering algorithm. This leads to a sequence annealing algorithm, which is an incremental method for building MSAs one match at a time. The sequence annealing algorithm outperforms all existing algorithms on benchmark test sets of protein sequence alignments. It simultaneously achieves high recall and high precision, dramatically reducing the number of incorrectly aligned residues in comparison to other programs. The method can adjust the recall/precision trade off, and can reliably identify homologous regions among protein sequences. Finally, we demonstrate that similar posterior decoding methods can be applied to the problem of multiple word alignment of citation sentences (citances) in the bioscience literature. We extend the definition of AMA to many-to-many word alignments, and develop a posterior decoding algorithm for multiple citance alignment.},
}

EndNote citation:

%0 Thesis
%A Schwartz, Ariel Shaul 
%T Posterior Decoding Methods for Optimization and Accuracy Control of Multiple Alignments
%I EECS Department, University of California, Berkeley
%D 2007
%8 March 28
%@ UCB/EECS-2007-39
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-39.html
%F Schwartz:EECS-2007-39