next up previous
Next: Changes to the Up: Implementation Previous: Implementation

Data Layout

Our matrix diagonalization algorithm has basically two different kinds of computationally intensive operations: The 3D Fourier transform associated with the LPO, and the matrix products between the various sets of vectors. For the data layout, we had two different possibilities:

  1. Assigning certain components of the vectors to particular processors, e.g. processor 0 holds the components 0...100 of all vectors, processor 1 holds the components 101...200 etc. The matrix products between sets of vectors can be done with very little communication, but the LPO operation requires communication because of the 3D-FFT involved.
  2. Assigning certain vectors to certain processors, e.g. all the vectors labeled 1,2,3 would be stored on processor 0, the ones with labels 4,5,6 on processor 1 etc. We decided against this layout, because load balancing smaller problem sizes would be hard to do, e.g. there might be more processors than there are vectors. Also, on a distributed memory hardware, the minimum memory requirements for a single node would be about 10 to 1000 times higher than for the other layout.

As shown in equation (2), a single vector is composed of all the Fourier grid points lying within a sphere of radius . To balance the load, we did a block column cyclic layout with a certain blocking factor, chopping the sphere into slices as shown in the figure below for two processors.

Of all vectors, each processor holds the components that fall on one of the planes it owns. For each vector, it aligns all the components it owns in a linear array. By arranging the linear arrays of vectors of the same kind into a contiguous array (e.g. all eigenvectors into a matrix denoted by x), the dot products between vectors can be done as matrix products by level 3 BLAS (Basic Linear Algebra subroutine) calls, which we found to run at up to 200 MFLOPs on the Power Challenge.

To perform the LPO on a single vector, the linear array representing the components lying within the sphere has to be unfolded onto a cubic grid as shown in the figure above. (The grid is actually not cubic, but can have different edge lengths, and the angles between the edges need not be 90 degrees). The center of the sphere is located at the corner (0,0,0) of the grid, and the parts other than the first octant are wrapped around. After unfolding, the inverse and forward 3D-FFT for the LPO can be performed. Finally, only the components lying inside the sphere are picked up and folded into the linear array to form the result vector.



next up previous
Next: Changes to the Up: Implementation Previous: Implementation