120 GFLOPS in Matrix-Matrix Multiply using DirectX 9.0
January 31, 2007
This is an implementation of matrix-matrix multiply for the GPU that follows the ideas
outlined in the ATI CTM publications 
but is implemented in DirectX 9.0.
(CTM was previously known as DPVM.)
In particular, it uses 4x4 blocking to reduce the bandwidth requirements and fetch-4
for the higher bandwidth cache access. The achieved performance is similar to that
achieved using the CTM.
The following package includes the source codes written in C++ using
Windows API and DirectX 9.0 and the project files for Microsoft Visual C++ 6.0 and
Microsoft Visual Studio 2005. The code was tested on ATI Radeon X1900 XT only.
Source code (2007-01-31)
The computation is performed in the GPU memory.
If the matrices are not in the GPU memory, they are transferred there and back.
The resulting performance for square matrices is presented in Figures 1, 2 and 3.
The testing platform was a 2.8GHz Pentium 4 with an ATI Radeon X1900 XT.
Figure 1: Computational rates achieved.
Figure 2: Transfer bandwidths for uploading the matrices to the GPU memory and
downloading the result.
Figure 3: Breakdown of the runtime when matrices are in main memory. Runtime
has three stages: "uploading" the input matrices to the GPU
memory, "computing" on the GPU, and "downloading" the result back to the main memory.
The stages are timed individually, which causes a slight discrepancy between the
total observed time and the sum of the individual stage times. This difference is
about 1% for dimensions above 400, and peaks at 10% at ~300.
The Radeon X1900 driver tends to run the GPU at lower clock rates (known as "2D mode").
The ATITool utility was used to switch it back to the advertized values.