CS 267: Lecture 4, Jan 25, 1996

Data Parallel Programming in CM Fortran and Matlab

The execution model for Connection Machine Fortran, or CM Fortran, or just CMF for short, is to execute a single program text sequentially. Parallelism arises by operating on arrays, with each array element conceptually being processed in parallel. Any interprocessor communication is implicit in the array operation. One may think of a "host processor" actually executing the program, occasionally sending instructions to the parallel processors for them to execute in lockstep on the parts of arrays they have in their local memories.

In future versions of this lecture, when High Performance Fortran (HPF) becomes locally available, we will substitute HPF for CMF in this lecture. They are broadly similar languages.

Let us compare the Matlab and CMF solution of the first Sharks and Fish problem, fish swimming in a current. (You should pop up the source code for these two programs in separate windows.) As you can see, they are quite similar.

The main data structures are arrays, like fishp, with the i-th entry describing the i-th fish. fishp(i) is the position of the i-th fish in the 2-dimensional plane, represented as a complex number. (The main data structures in Matlab are arrays of complex numbers).

For example, the statement

      fishp = fishp + dt * fishv
in both programs means to multiply the array of fish velocities, fishv, by the time step dt, to add it to the array of fish positions, fishp, and replace fishp by its new value. (This is a simple approximation, called Euler's method, used to solve Newton's equations of motion.) The arrays must be conformable, i.e. have the same size and shape, in order to be operated on together in this way. The sequential equivalent of this would be
     for i = 1 to nfish (number of fish)
         fishp(i) = fishp(i) + dt*fishv(i)
     end
The CMF compiler will automatically assume that array operation are to be done in parallel, provided the arrays are declared to reside on more than one processor, as we now explain.

Data Layout. Whether or not a statement will be executed in parallel depends on where it is "laid out" in the machine. An array spread across all p processors can hope to enjoy p-fold parallelism in operations, whereas one stored only on the "host" processor will run sequentially. Briefly, scalar data and arrays which do not participate in any array operations (like A=B+C) reside on the host, and the rest are spread over the machine. There are also compiler directives to help insure that arrays are laid out the way the user wants them to be.

The simplest example is

      REAL A(64), B(64), C(64)
CMF$  LAYOUT A( :NEWS ), B( :NEWS ), C( :NEWS )
      A = B + C          
In this code fragment, the CMF$ LAYOUT compiler directive is used to tell CMF where to store the arrays A, B, C. Suppose to we have a 64 processor machine. Then this statement implies that the i-th processor memory stores A(i), B(i), and C(i). Therefore, the assignment statement A=B+C can execute entirely in parallel.

Here is a slightly more complicated example.

      REAL A(64,8), B(64,8), C(64,8)
CMF$  LAYOUT A( :NEWS, :SERIAL ), B( :NEWS, :SERIAL ), C( :SERIAL, :NEWS )
      A = B
      A = C
Again suppose to we have a 64 processor machine. There are (at least!) two natural ways to store a 64-by-8 array:
    Mem 0     Mem 1            ...                Mem 63

    A(1,1)    A(2,1)           ...                A(64,1)
    A(1,2)    A(2,2)           ...                A(64,2)
     ...       ...             ...
    A(1,8)    A(2,8)           ...                A(64,8)
and
    Mem 0     Mem 1    ...   Mem 7    Mem 8  ...   Mem 63

    C(1,1)    C(1,2)   ...   C(1,8)    xxx   ...    xxx
    C(2,1)    C(2,2)   ...   C(2,8)    xxx   ...    xxx
     ...       ...            ...
    C(64,1)   C(64,2)  ...   C(64,8)   xxx   ...    xxx
The former is specified by A( :NEWS, :SERIAL ), which says the first subscript should be stored with different values corresponding to different parallel memories, and different values of the second subscript should be stored sequentially in the same processor memory. Thus, there is parallelism available in the :NEWS direction, but not in the :SERIAL direction. :NEWS is a historical term from the CM-2, where processors were arranged in a rectangular grid and connected to their nearest processors in the North-East-West-South directions. (This is a good example of a language designed to reflect the architectural details of a particular machine, and not changing even though the CM-5 has its processors connected in a completely different topology.)

Since A and B have the same layout, the statement A=B is perfectly parallel. On the other hand, A=C amounts essentially to a matrix transpose, and requires a large amount of communication. Thus, it is possible for two very similar and simple assignments, A=B and A=C, to take vastly different amounts of time. This is an early indication that while programming in CMF (or HPF) may result in simple programs, understading their performance may be difficult. So when optimizing programs for performance, it is essential to make sure data layouts are chosen to minimize communication. When in doubt, calling CMF_describe_array(A) will print out a detailed description of the layout of the argument array.

Continuing our comparison of the Matlab and CMF codes for the first Sharks_and_Fish problem, consider the updating of the mean square velocity

  Matlab:   mnsqvel = [mnsqvel,sqrt(sum(abs(fishv).^2)/length(fishp))];

  CMF   :   fishspeed = abs(fishv)
            mnsqvel = sqrt(sum(fishspeed*fishspeed)/nfish)
We ignore the fact that mnsqvel is a scalar in CMF and an array in Matlab, and concentrate on the computation on the right-hand-side. In both cases, the intrinsic function abs() is applied to a whole array fishv, and then the array elements are summed by the intrinsic function sum(), which reduces to a scalar. Operations like sum(), prod(), max(), min() are called reduction operations because they reduce arrays to scalars.

Now we compare the subroutines, both called current, which compute the current at each fish position. The Matlab solution uses a function which takes an array argument (fishp), and returns an array result (the force on each fish), whereas the CMF solution uses a subroutine which has one input array argument, (fishp), and one output array argument (force).

  Matlab:      X = fishp*sqrt(-1);
               X = 3*X ./ max(abs(X),.01) - fishp;

  CMF   :      force = (3,0)*(fishp*ii)/(max(abs(fishp),0.01)) - fishp
This illustrates the use of the intrinsic abs(), reduction operator max(), with a scalar argument and array argument, and various array operations.

The vector field represented by force (or X) is computed as follows: fishp*sqrt(-1) ( = fishp*ii ) rotates the vectors pointing from the origin to each fish by 90 degrees counterclockwise. These vectors are then scaled to have length 3 (and going to 0 near the origin). By itself, this force would spin the fish around the origin rapidly, with their momentum carrying them quickly away from the origin. To keep them from spiralling way too quickly, we subtract the vector fishp, which attracts fish back to the origin. It is easy to modify this function to get fish to move in other patterns. For example, what multiple of fishp should we subtract so that the fish move in perfect circles, rather than spiralling inward or outward? Is Euler's algorithm an accurate enough approximation of the true motion to compute these circles accurately?

Graphics is handled in different ways by the two systems. CMF uses a 256-by-256 integer array "show" to indicate the presense or absence of fish in a window -zoom <= x,y <= zoom. Each fish position is appropriately scaled to form an integer array x of x-coordinates in the range from 1 to 256, and and integer array y of y-coordinates in the range from 1 to 256, after which only corresponding entries of "show" are set to one. This is done using the parallel construct "for all":

      x = INT(( real(fishp)+zoom)/(zoom)*(m/2))
      y = INT((aimag(fishp)+zoom)/(zoom)*(m/2))
      show = 0
      forall(j=1:nfish) show(x(j),y(j)) = 1
The forall statement in the above code fragment means that the assignment show(x(j),y(j))=1 should be execute for all values of j from 1 to nfish in parallel. Note that we must have some restrictions on the use of forall; what would it mean if x(1)=x(2) and y(1)=y(2), so that show(x(j),y(j)) were assigned to twice, perhaps by different values? For correct and consistent results, this case must be avoided (by the programmer). The compiler permits only simple arithmetic operations and intrinsics to be used on the left-hand-side of statements inside forall constructs.

Now we go on to discuss some other operation supported in CMF.

Array Constructors.

   A = 0                  !  scalar extension to an array
   B = [1,2,3,4]          !  array constructor
   X = [1:n]              !  real sequence 1.0, 2.0, ..., n
   I = [0:100:4]          !  integer sequence 0, 4, 8, ... 100
                          !  note difference from Matlab syntax: (0:4:100)
   C = [ 50[1], 50[2,3] ] !  50 1s followed by 50 pairs of 2,3
   call CMF_Random(A)     !  fill A with random numbers

Conditional Operation. The "where" statement is used to assign just to selected array entries, as shown in the examples below. Only assignment statements are permitted in the body of a "where". "Where"s may not be nested. "Forall" as well as most intrinsic functions take optional boolean mask arguments.

   For example,

       force = (3,0)*(fishp*ii)/(max(abs(fishp),0.01)) - fishp

   could be written

       dist = .01
       where (abs(fishp) > dis) dist = abs(fishp)
       force = (3,0)*(fishp*ii)/dist - fishp

   or

       dist = .01
       far = abs(fishp) > .01
       where (far) dist = abs(fishp)
       force = (3,0)*(fishp*ii)/dist - fishp

   or  

       where ( abs(fishp) .ge. .01 ) 
            dist = abs(fishp)
       elsewhere
            dist = .01
       endwhere
       force = (3,0)*(fishp*ii)/dist - fishp

   or

       forall ( j = 1:nfish, abs(fishp(j)) > .01 )
             force(j) = (3,0)*(fishp(j)*ii)/abs(fishp(j) - fishp(j)

Array Sections. A portion of an array is defined by a triplet in each dimension. It may appear wherever an array is used.

       A(3)              ! third element
       A(1:5)            ! first 5 element
       A(:5)             ! same
       A(1:10:2)         ! odd elements in order
       A(10:-2:2)        ! even elements in reverse order
       A(1:2,3:5)        ! 2-by-3 subblock
       A(1,:)            ! first row
       A(:,j)            ! jth column

Implicit Communication. Operations on conformable array sections may require interprocessor communication. For example, if an assignment statement combines data stored on different processors, these data items will have to be moved to the same processor before the operations can be performed. The choice of when and which data to move is made by the compiler, invisibly to the user. This makes programming seem easy, but it also means that two slightly different assignment statements may generate very different amounts of communication, and so take very different times to execute, as illustrate above with the statements A=B and A=C. Here are some other way to express implicit communication.

       A( 1:10, : ) = B( 1:10, : ) + B( 11:20, : )
					  ! add rows 1:10 and 11:20 of B, to get A
       DA( 1:n-1 ) = ( A( 1:n-1 ) - A( 2:n ) ) / dt    
					  ! a common "finite difference" operation
       C( :, 1:5:2 ) = C( :, 2:6:2 )      ! shift noncontiguous sections
       D = D( 10:1:-1 )                   ! permutation (reverse)

       A = [1,0,2,0,0,0,4]
       I = [1,3,7]
       B = A(I)                           ! B = [1,2,4], a "gather" operation on A
       C(I) = B                           ! C = A, a "scatter" operation on B
					  ! no duplicates in I are permitted for scattering

       D = A([ 1, 1, 3, 3 ])              ! entries of A are replicated in D

       B = CSHIFT( A, 1 )                 ! B = [4,1,0,2,0,0,0], circular shift
       B = EOSHIFT( A, -1 )               ! B = [0,2,0,0,0,4,0], end-off shift
       B = TRANSPOSE( H )                 ! matrix transpose
       B = SPREAD(A,2,3)                  ! if B is 3-by-7 then
                                          ! B = [ 1,0,2,0,0,0,4 ]
                                          !     [ 1,0,2,0,0,0,4 ]
                                          !     [ 1,0,2,0,0,0,4 ]
To illustrate some of these operations, let us compare the Matlab and CMF solution of the second Sharks and Fish problem, fish moving with gravity. (You should pop up the source code for these two programs in separate windows.) As you can see, they are again quite similar.

The main difference between simulating gravity and current is this: With current, the force on each fish is independent of the force on any other fish, and so can be computed very simply in parallel for constant work per fish. With gravity, the force on each fish depends on the locations of all the other fish:

   for i = 1 to number_of_fish
       force_on_fish(i) = 0
       for j = 1 to number_of_fish
           if (j .ne. i)
               force_on_fish(i) += force on fish i from fish j
           endif
       endfor
   endfor
This straightforward algorithm requires all fish to communicate with all other fish, and does work growing proportionally to the number_of_fish per fish (or O(number_of_fish^2) overall). Later, we will examine faster algorithms that only do a logarithmic amount of work or less per fish, but for now we will stick with the simple algorithm.
  Matlab:
      nfish = length(fishp);
      perm = [nfish,(1:nfish-1)];	! perm=[ nfish, 1, 2, ... , nfish-1 ]
      force = 0;
      fishpp=fishp;
      fishmp=fishm;
      for k=1:nfish-1,
         fishpp = fishpp(perm);		! do a circular shift of fishpp
         fishmp = fishmp(perm);		! do a circular shift of fishmp
         dif = fishpp - fishp;
         force = force + fishmp .* fishm .* dif ./ max(abs(dif).^2,1e-6) ;
      end

  CMF:
      force = 0
      fishpp = fishp
      fishmp = fishm
      do k=1, nfish-1
        fishpp(1:nfish) = cshift(fishpp(1:nfish), DIM=1, SHIFT=-1)
        fishmp(1:nfish) = cshift(fishmp(1:nfish), DIM=1, SHIFT=-1)
        dif = fishpp - fishp
        force = force + fishmp * fishm * dif / (abs(dif)*abs(dif))
      enddo
Both algorithm work as follows. The fish positions fishp and fish masses fishm are copied into other arrays fishpp and fishmp, respectively. Then fishpp and fishmp are "rotated" nfish-1 times, so that each fish in fishp is aligned once with each fish in fishpp. For example, after step j, fishpp(i) contains the position of fishp(i+j), where i+j wraps around to 1 modulo nfish. This lets each fish "visit" every other fish to make the necessary force calculation. The rotation is done in Matlab by subscripting by the array perm of rotated subscripts. This could be done in CMF as well, but it is faster to use the built in circular shift operations cshift.

Scan, or "Parallel Prefix" operations. We discuss these briefly here, and return to them in a later lecture on algorithms.

    B(1) = A(1)
    do i=2,5
       B(i) = B(i-1)+A(i)
    enddo
          ! forward running sum computed sequentially

    forall (i=1:5) B(i) = SUM( A( 1:i ) )
          ! forward running sum computed with forall

    CMF_SCAN_ADD( B, A, ... )
          ! forward running sum computed using scan operation



    FACT(1) = 1
    do i=2,5
       FACT(i) = FACT(i-1)*i
    enddo
          ! factorial computed sequentially

    INTS = [1:n]
    forall (i=1:n) FACT( i ) = PRODUCT( INTS( 1:i ) )
          ! factorial computed with forall

    CMF_SCAN_MUL( FACT, INT, ... )
          ! factorial computed using scan operation

Libraries and Intrinsics. CM Fortran has a very large library of routines available for moving data and doing numerical computations. If you can find an appropriate library routine, it will often run faster than coding it oneself. CMSSL (CM Scientific Subroutine Library) contains many routines for basic linear algebra (BLAS), solving systems of linear equations, finding eigenvalues, doing FFTs, etc., and is highly optimized for the CM-5. We will discuss algorithms used by CMSSL and similar libraries later in the course.

Documentation for the CM-5 may be found in the course reference material, or by logging in to rodin and typing cmview.