# Fully Distributed EM for Very Large Datasets

### Jason Wolfe, Aria Delier Haghighi and Daniel Klein

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2007-178

December 22, 2007

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-178.pdf

In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be prohibitive. We present a framework which exploits parameter sparsity to fully distribute the entire EM procedure. Each node interacts with only the subset of parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate the effectiveness of our framework over a MapReduce approach, on two tasks: word alignment for machine translation, and LDA for topic modeling.

BibTeX citation:

@techreport{Wolfe:EECS-2007-178, Author = {Wolfe, Jason and Haghighi, Aria Delier and Klein, Daniel}, Title = {Fully Distributed EM for Very Large Datasets}, Institution = {EECS Department, University of California, Berkeley}, Year = {2007}, Month = {Dec}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-178.html}, Number = {UCB/EECS-2007-178}, Abstract = {In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be prohibitive. We present a framework which exploits parameter sparsity to fully distribute the entire EM procedure. Each node interacts with only the subset of parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate the effectiveness of our framework over a MapReduce approach, on two tasks: word alignment for machine translation, and LDA for topic modeling.} }

EndNote citation:

%0 Report %A Wolfe, Jason %A Haghighi, Aria Delier %A Klein, Daniel %T Fully Distributed EM for Very Large Datasets %I EECS Department, University of California, Berkeley %D 2007 %8 December 22 %@ UCB/EECS-2007-178 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-178.html %F Wolfe:EECS-2007-178