Our matrix diagonalization algorithm has basically two different kinds of computationally intensive operations: The 3D Fourier transform associated with the LPO, and the matrix products between the various sets of vectors. For the data layout, we had two different possibilities:
As shown in equation (2), a single vector is composed of
all the Fourier grid points lying within a sphere of radius
. To balance the load, we did a block column cyclic
layout with a certain blocking factor, chopping the sphere into slices as shown
in the figure below for two processors.
Of all vectors, each processor holds the components that fall on one of the planes it owns. For each vector, it aligns all the components it owns in a linear array. By arranging the linear arrays of vectors of the same kind into a contiguous array (e.g. all eigenvectors into a matrix denoted by x), the dot products between vectors can be done as matrix products by level 3 BLAS (Basic Linear Algebra subroutine) calls, which we found to run at up to 200 MFLOPs on the Power Challenge.
To perform the LPO on a single vector, the linear array representing the components lying within the sphere has to be unfolded onto a cubic grid as shown in the figure above. (The grid is actually not cubic, but can have different edge lengths, and the angles between the edges need not be 90 degrees). The center of the sphere is located at the corner (0,0,0) of the grid, and the parts other than the first octant are wrapped around. After unfolding, the inverse and forward 3D-FFT for the LPO can be performed. Finally, only the components lying inside the sphere are picked up and folded into the linear array to form the result vector.