(Fall 2004 - Culler & Shewchuk): "Problem 1 You are writing a parallelizing Fortran compiler for a distributed memory computer whose interconnection network has high bandwidth but large message latency. Your task is to create an efficient implementation of array statements of the following form. A[1:n] = B[1:n] The notation "1:n" is interpreted as an implicit loop over the array elements from 1 to n. (a) If the array A is distributed equally among the processors in a block distribution, and B is distributed equally in a cyclic distribution, what inter-processor communication takes place? What other data distribution schemes are frequently used? (b) Now consider array statements of the following form. A[X[1:n]] = B[1:n] A[1:n] = B[Y[1:n]] A[X[1:n]] = B[Y[1:n]] The good news is that these statements typically appear within loops that are iterated many times, and the arrays X and Y almost never change from iteration to iteration. How will your compiler make these statements execute as quickly as possible? (c) Suppose one of the array statements above appears alone inside a loop that iterates a thousand times. How can improve the ratio of computation to communication? Problem 2. You have a computation of the form FORALL i = 1 to n F(i) where n is not known at compile time. (a) Describe the run time mechanisms that might be used to distribute this computation over processors on a shared address space machine? (b) In a message passing model (say on a distributed memory machine). (c) What serialization do your mechanisms introduce? (What is the potential speedup?) (d) What are the sources of load imbalance? (e) What techniques could be used to improve the load balance. (f) In general, what trade-offs are faced in balancing load? (g) What general techniques can be used? Problem 3 (a) You are computing a one-dimensional FFT on a distributed-memory machine. How do you suggest parallelizing the FFT? What communication is involved in your scheme? As you scale the problem or machine up or down, where does your approach run into problems. (b) It used to be very popular to build machines with a hypercube internconnection network. (Can you name a couple?) How ca you take advantage of the machine topology to make the FFT as fast as possible? Under what conditions does this pay off? (c) In view of what you know about the LogP model, why do you think so few architectures are built in hypercube topologies anymore?"
(Spring 2004 - Demmel & Shewchuk): "Consider a one-dimensional nonlinear PDE on a domain Omega = [0, 1] with Dirichlet boundary conditions f(0) = f(1) = 0. The domain is divided into n = 2^k identical elements with n + 1 nodes. We wish to find an approximate solution to the PDE using either finite elements or finite differences, whichever you're more comfortable with. We use multigrid to solve the associated linear equations. (a) Describe the matrices that appear in the method, including matrices that are defined implicitly (e.g. the restriction operator). Roughly speaking, what are the contents of these matrices? Why? (b) Which part of the multigrid cycle is most difficult to parallelize on a shared memory multiprocessor? How would you suggest parallelizing it? Each iterate (estimated solution) can be expressed as a Fourier expansion. In principle, we could convert an iterate to its corresponding Fourier coefficients or back at any time by a matrix multiply, though in practice we would use an FFT. Suppose we transform the linear system so it is expressed in term of Fourier coefficients. We want to use multigrid to solve for the Fourier coefficients, and we stay in the Fourier domain throughout the multigrid cycle. The restriction operator and other elements of multigrid do not need to have _exactly_ the same effect as their space-domain counterparts; you can change them slightly to better suit the spectral domain, but they should accomplish the same goals in principle. (c) What do the matrices discussed in part a look like now? (For reasonable implementation choices.) (d) Which part of the spectral multigrid cycle is most difficult to a shared memory multiprocessor? How would you suggest parallelizing it? Does it make any sense to choose the spectral domain for this problem?"
(Fall 1999 - Demmel & Shewchuk): "Q1: You are given a bunch of fish swimming around in a very shallow pond, where for some reason each pair of fish attracts one with a force f = 1/r^3 , r = distance between fish In other, they attract weakly, but not so weakly that you can ignore distant fish. How would you compute the forces on each fish as efficiently as possible, assuming that you wanted justa good approximation? What would you do if the force law changed to f = { 1/r^3 , if r >= 1 { 2-1/r , if r < 1 ? Q2: Consider parallelizing two large 3D numerical grid-based simulations. One uses a uniform regular cubic grid, and the other uses a very unstructured 3D grid distributed among processors using a graph partitioner like Chaco or Metis. The simulation alternates between a computation phase, in which each vertex of the graph receives a new value based on the values of its neighbors, and a communication phase, in which each vertex gathers the values of its neighbors. (a) How should we structure the computation on the grid if we want to overlap computation and communication as much as possible? (b) If the unstructured simulation is run on a shared memory multiprocessor, is the need for a graph partitioner eliminated (or reduced)? Why? (c) If the simulations are run on a network of workstations, how do the structured and unstructured simulations differ in the demands they place on the latency and bandwidth of the communication network? Q3: Graph partitioning is a method for parallelizing and load balancing a variety of applications. Define the graph partitioning problems, and say how it could be used in the following situations. In particular, what is the graph that we partition, and how does a good partition impact the parallelism or complexity of the algorithm? Conjugate gradient method to solve Ax=b for large sparse A; Sparse Choleksy to solve Ax=b; Sparse GEPP to solve Ax=b, where A is a sparse nonsymmetric matrix; A term-document matrix is a matrix with one row for each document in a collection, and one column for each "interesting" term that appears in the document. A(i,j) is the number of times term j appreas in document i. The goal is to partition the documents and partition the terms to discover whether there are subsets of documents which share many terms, and so are likely to be closely related. Now we will ask about some graph partitioning algorithms. They come in two flavors, depending on whether one has cooreindate information associated with each vertex, which is often the case in physical problems. Suppose coordinate information is available, and that most edges tend to be between physically nearby vertices. What is an efficient partitioning algorithm? Suppose coordinate information is not available? What is an efficient partitioning algorithm? Let In(G) be the incidence matrix of an undirected graph G(V,E), i.e., an |N|-by-|E| matrix with one row per vertex and one column per edge. If E=(i,j) is edge k, then In(G) (i,k)=+1 and In(G) (j,k)=-1. Then the Laplacian of G is defined as L(G) = In(G)*(In(G))^T. What do the eigenvalues of L(G) tell you about how G can be partitioned?"
(Spring 1999 - Demmel & Culler): "Q1: Critical design issues in parallel systems often arise at several levels - at the machine level, at the level of programming primitives, at the application logic level - perhaps in subtly different forms. Let's take one issue, say deadlock. Give examples of how deadlock arises in each of these three levels, in a shared-address space framework and also in a message passing framework. What about race conditions? Q2: Parallel processing often deals with making trade-offs among opposing costs, such as load balance vs. communication. Often, different phases of a computation will optimize the tradeoffs differently. For example, consider a basic particle-in-cell (PIC) code. It normally has two phases. Each particle contributes to the field values at the corners of the mesh. Each field value contributes force on each of the particles, causing them to move. The time step is such that the fastest particles only move one cell length per timestep. Discuss how you would partition the mesh and how you would partition the particles. Q3: Given a message passing program, it is very simple to emulate it on a shared address space (SAS) programming model. How? Given a SAS program, can you give a general means of emulating it on a message passing system? What do you require in the message passing system? How might you optimize from that general solution? Q4: This question considers tradeoffs in implementing a variation of the sharks and fish problem: sharks and fish with gravity. We will present a sequence of slightly different situations, and ask which algorithm you would use in each case, and why: (a) Suppose there are only fish. What algorithms are available for this problem? What is their sequential and parallel complexity? Can you briefly say how they work? (b) Which algotihm would you choose if the number of fish, f, is very small? Very large? Why? (c) Now suppose there are some fish and some sharks, all of which are similar in mass. How does the algorithm change? (d) Now suppose fish are very light (like asteroids) so that their gravitational interactions can be ignored, but sharks are heavy (like planets) and so that fish-shark and shark-shark interactions are important. (e) Suppose instead that there are many fish and just a few sharks. How does the algorithm change? (f) Supopse next that communication is very expensive and there are many more fish than sharks. How does the algorithm change?"
August 2000