The surface plots present an overall picture of the accuracy of our timing model. The surface on top is the actual measured run time and that below is the run time predicted by our model. As can be seen, our model tends to underestimate the run time but it generally tracks the performance reasonably well.
We believe that our inverse 3D-FFT ran slower than predicted because the timing measurements were performed on a non-dedicated machine. The SGI Power Challenge at NCSA was always loaded with a variety of both serial and parallel jobs. Studies have shown that parallel applications are especially affected by a non-dedicated environment because the parallel processes are thrown off sync in such an environment. The following graphs show that the gap between our measured and predicted run times increases as we go from 1 to more processors, thus supporting our belief.
Notice that the run time for Problem 4 is not significantly worse than that for the 2 smaller problems. This is because the grid size in this case is 64x64x64 which is especially good for the inverse 1D-FFT. Observe that our model seems to be less accurate for the larger problem sizes. We believe this is because the bigger problem sizes involve more communication and with PVM running on sockets, communication is especially sensitive to OS scheduling. In short, we think that the discrepancy between the observed and predicted performance stems from the fact that our model does not account for the effects of a non-dedicated environment.