Submission of CS267 Assignment 0

Yangqing Jia, jiayq@eecs.berkeley.edu

Bio

I am a second-year graduate student at EECS Berkeley, focusing on Artificial Intelligence, especially computer vision and related machine learning algorithms. As the vision applications are growing more and more large-scale, using a single machine is no longer an acceptable approach. Most methods such as object detection and image-level latent topic models can benefit from parallel computers from multiple aspects. During the last semester we developed a general-purpose framework that enabled us to easily distribute independent jobs over the machines in the vision research group's machines (see the report). However, I would like to have a more detailed parallel computing understanding, especially in the programming side. I had been trying out CUDA for our research, and am eager to learn more about parallel programming techniques such as OpenMP and GPU computing.

Application: Latent Dirichlet Allocation

Latent Dirichlet Allocation, or LDA, sometimes also called latent topic models, has been extensively explored as a latent variable model for multinomial data, which is the common case if we consider a document as a bag of words (without the sequential information). It has been proved to be effective when applied to text analysis, and has been extended to computer vision when we consider an image a bag of discretized (usually via vector quantization) visual “words”.

In general, a topic is a distribution over words specifying the probability that a word appears when the topic is present. For example, a topic that semantically relates to nature would assign high probability to words such as “sea” and “birds”. An LDA model consists of a set of K topics, with Phi_{ij} indicating the probability of word i given topic j, and a hyperparameter alpha indicating the prior probability of topics appearing in a document. To generate a document, we first sample a topic distribution theta from the Dirichlet prior Dir(theta|alpha). Then, for each word i in the document, we first sample a topic z_i from the topic distribution i, and then sample the word w_i from the distribution Phi_{cdot j}. We will not discuss the mathematical details of LDA here, and readers interested to find more can read the following paper:

David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. JMLR, 3(Jan):993-1022, 2003.

A shorter description can be found at the wikipedia page.

To apply LDA to any specific tasks, we need to find the set of topic-word distributions Phi_{ij}. A widely adopted method is to use Gibbs sampling: we start with a random set of topic assignments for each word in each document, estimate an initial guess of Phi, and then sample new topic assignments based on this guess of Phi. The estimation and sampling iterates until stochastic convergence. (Note: our definition of the Gibbs sampling is a little different from the original Gibbs sampling as this enables us to do parallel computation, while sacrificing the theoretical correctness. In practice we did not observe any performance difference considering perplexity and further classification accuracy).

Parallel LDA

Several methods have been proposed to solve parallel LDA, ranging from message-passing algorithms to vector platforms. The corresponding programming tools also varies. Specifically, we would like to review the parallel LDA we applied during the last semester, using Nvidia GPU computing.

Specifically, we used C with CUDA to write the parallel LDA program. They key motivation is simple: the Gibbs sampling can be carried out in parallel (under our definition of Gibbs sampling). Thus, for each sampling step, we take the advantage of GPU computing to sample topic assignments for each word in each document in parallel, and then collect the sufficient statistics to obtain Phi. To this end, we only use one computer with one GPU mounted (when the code was written, we did not find any way to run a program simultaneously on two GPUs).

For speedup issues, we used several tricks: to do code alignment as much as possible, to utilize the memory hierarchy structure (the GPU has its global memory and shared memory exclusive to individual cores, the latter being faster than the former), etc. We did benefit a lot from the GPU architecture: the GPU code runs about 20 times faster than the CPU version. (Our CPU is a 2.4GHz core 2 processor, and our GPU has 8 cores, each being able to execute 30 threads in parallel). It did not reach the peak performance of the GPU, though. We infer that the reason might lie in the estimation step: after each sampling, we need to collect the sufficient statistics, i.e., the number of times each document-topic pair and topic-word pair appears. This actually leads to a lot of atomic operations - although these are only atomic add's, it seems to have greatly affected the overall run time, and we are still improving the code for better performance.