CS 267: Lecture 7, Feb 6, 1996

Parallel Programming with Split-C

Split-C was designed at Berkeley, and is intended for distributed memory multiprocessors. It is a small SPMD extension to C, and meant to support programming in data parallel, message passing and shared memory styles. Like C, Split-C is "close" to the machine, so understanding performance is relatively easy. Split-C is portable, and runs on the

  • Thinking Machines CM-5,
  • Intel Paragon,
  • IBM SP-2,
  • Meiko CS-2,
  • Cray T3D,
  • Sun multiprocessors (eg quad processor SS-10 or SS-20 with Solaris), and
  • NOW (network of workstations).
  • The best document from which to learn Split-C is the tutorial Introduction to Split-C. There is a debugger available as well: Mantis.

    Extensions of Split-C to include features of multithreading (as introduced in Lecture 6) and C++ classes are under development, and will be released soon.

    We begin with a general discussion of Split-C features, and then discuss the solution to Sharks & Fish problem 1 in detail. The most important features of Split-C are

  • An SPMD programming style. There is one program text executed by all processors.
  • A 2-dimensional address space for the entire machine's memory. Every processor can access every memory location via addresses of the form (processor number, local address). Thus, we may view the machine memory as a 2D array with one row per processor, and one column per local memory location. For example, in the following figure we have shaded in location (1,4).

  • Global pointers. These pointers are global addresses of the form just described, (processor number, local address), and can be used much as regular C pointers are used. For example, the assignment
    *local_pointer = *global_pointer
    
    gets the data pointed to by the global_pointer, where ever it resides, and stores the value at the location indicated by local_pointer.
  • Spread Arrays. These are 2 (or more dimensional arrays) which are stored across processor memories. For example, A[i][j] may refer to word j on processor i. Spread arrays and global pointers together support a kind of shared memory programming style.
  • Split phase assignment. In the above example, "*local_pointer = *global_pointer", execution of this statement must complete before continuing. If this requires interprocess communication, the processor remains idle until *global_pointer is fetched. It is possible to overlap computation and communication by beginning this operation, doing other useful work, and waiting later for it to complete. This is done as follows:
          *local_pointer := *global pointer
          ... other work not requiring *local_pointer ...
          synch()
    
    The "split-phase" assignment operator := initiates the communication, and synch() waits until it is complete.
  • Atomic Operations are short subroutines which are guaranteed to be executed by one processor at a time. They provide an implementation of mutual exclusion, and the body of the subroutine is called a critical section.
  • A library, including an extensive reduction operations, bulk memory moves, etc.
  • Pointers to Global Data

    There are actually three kinds of pointers in Split-C:
  • local pointers,
  • global pointers, and
  • spread pointers.
  • Local pointers are standard C pointers, and refer to data only on the local processor. The other pointers can point to any word in any memory, and consist of a pair (processor number, local pointer). Spread pointers are associated with spread arrays, and will be discussed below. Here are some simple examples to show how global pointers work. First, pointers are declared as follows:
        int *Pl, *Pl1, *Pl2           /*  local pointers  */
        int *global Pg, Pg1, Pg2      /*  global pointers  */
        int *spread Ps, Ps1, Ps2      /*  spread pointers  */
    
    The following assignment sends messages to fetch the data pointed to by Pg1 and Pg2, brings them back, and stores their sum locally:
        *Pl = *Pg1 + *Pg2
    
    Execution does not continue until the entire operation is complete. Note that the program on the processors owning the data pointed to by Pg1 and Pg2 does not have to cooperate in this communication in any explicit way. Thus, it is very much like a shared memory operation, although it is implemented on distributed memory machines, in effect interrupting the processors owning the remote data, getting the data and sending it back to the requesting processor, and letting them continue. In particular, there is no notion of needing matched sends and receives as in a message passing programming style. Rather than calling this send or receive, the operation performed is called a get, to emphasize that the processor owning the data need not anticipate the request for data.

    The following assignment stores data from local memory into a remote location:

        *Pg = *Pl
    
    As before, the processor owning the remote data need not anticipate the arrival of the message containing the new value. This operation is called a put.

    Global pointers permit us to construct distributed data structures which span the whole machine. For example, the following is an example of a tree which spans processors. The nodes of this binary tree can reside on any processor, and traversing the tree in the usual fashion, following pointers to child nodes, works without change.

         typedef struct global_tree *global gt_ptr
         typedef struct global_tree{
             int value;
             gt_ptr left_child;
             gt_ptr right_child;
         } g_tree
    
    We will discuss how to design good distributed data structures later when we discuss the Multipol library.

    Global pointers offer us the ability to write more complicated and flexible programs, but also get new kinds of bugs. The following code illustrates a race condition, where the answer depends on which processor executes "faster". Initially, processor 3 owns the data pointed to by global pointer i, and its value is 0:

            Processor 1             Processor 2
            *i = *i + 1             *i = *i + 2
            barrier()               barrier()
            print 'i=', i
    
    It is possible to print out i=1, i=2 or i=3, depending on the order in which the 4 global accesses to i occur. For example, if
      processor 1 gets *i (=0)
      processor 2 gets *i (=0)
      processor 1 puts *i (=0+1=1)
      processor 2 puts *i (=0+2=2)
    
    then the processor 1 will print "i=2". We will discuss programming styles and techniques that attempt to avoid this kind of bug.

    A more interesting example of a potential race condition is in a job queue, a data structure for distributing chunks of work of unpredictable sizes to different processors. We will discuss this example below after we present more feature of Split-C.

    Global pointers may be incremented like local pointers: if Pg = (processor,offset), then Pg+1 = (processor,offset+1). This lets one index through a remote part of a data structure. Spread pointers differ from global pointers only in this respect: if Ps = (processor,offset), then

       Ps+1 = (processor+1 ,offset)  if processor < PROCS-1, or
            = (0 ,offset+1)          if processor = PROCS-1
    
    where PROCS is the number of processors. In other words, viewing the memory as a 2D array, with one row per processor and one column per local memory location, incrementing Pg moves the pointer across a row, and incrementing Ps moves the pointer down a column. Incrementing Ps past the end of a column moves Ps to the top of the next columns.

    The local part of a global or spread pointer may be extracted using the function to_local.

    Only local pointers may be used to point to procedures; neither global nor spread pointers may be used this way. There are also some mild restrictions on use of deferenced global and spread pointers; see the last section of the Split-C tutorial.

    Spread Arrays and Spread Pointers

    A spread array is declared to exist across all processor memories, and is referenced the same way by all processors. For example,
        static int A[PROCS]::[10]
    
    declares an array of 10 integers in each processor memory. The double colon is called the spreader, and indicates that subscripts to its left index across processors, and subscripts to the right index within processors. So for example A[i,j] is stored in location to_local(A)+j on processor i. In other words, the 10 words on each processor reside at the same local memory locations.

    The declaration

        static double A[PROCS][m]::[b][b]
    
    declares a total of PROCS*m*b^2 double precision words. You may think of this as PROCS*m groups of b^2 doubles being allocated to the processors in round robin fashion. The memory per processor is b^2*m double words. A[i][j][k][l] is stored in processor
         i*m+j mod PROCS, 
    
    and at offset
         to_local(A) + b^2*floor( (i*m+j)/PROCS ) + k*b+l
    
    In the figure below, we illustrate the layout of A[4][3]::[8][8] on 4 processors. Each wide light-gray rectangle represents 8*8=64 double words. The two wide dark-gray rectangles represent wasted space. The two thin medium-gray rectangles are the very first word, A[0][0][0][0], and A[1][2][7][7], respectively.

    In addition to declaring static spread arrays, one can malloc them:

       int *spread w = all_spread_malloc(10, sizeof(int))
    
    This is a synchronous, or blocking, subroutine call (like the first kind of send and receive we discussed in Lecture 6), so all processors must participate, and should do so at about the same time to avoid making processors wait idly, since all processors will wait until all processors have called it. The value returned in w is a pointer to the first word of the array on processor 0:
         w = (0, local address of first word on processor 0).
    

    (A nonblocking version, int *spread w = spread_malloc(10, sizeof(int)), executes on just one processor, but allocates the same space as before, on all processors. Some internal locking is needed to prevent allocating the same memory twice, or even deadlock. However, this only works on the CM-5 implementation and its use is discouraged.)

    Split Phase Assignment

    The split phases referred to are the initiation of a remote read (or write), and blocking until its completion. This is indicated by the assignment operator ":=". The statement
         c := *global_pointer                 ...   c is a local variable
         ... other work not involving c ...
         synch()
         b = b + c                            ...   b is a local variable
    
    initiates a get of the data pointed to by global_pointer, does other useful work, and only waits for c's arrival when c is really needed, by calling synch(). This is also called prefetching, and permits communication (getting c) and computation to run in parallel. The statement
         *global_pointer := b
    
    similarly launches a put of the local data b into the remote location global_pointer, and immediately continues computing. One can also wait until an acknowledgement is received from the processor receiving b, by calling synch().

    Being able to initiate a remote read (or get) and remote write (or put), go on to do other useful work while the network is busy delivering the message and returning any response, and only waiting for completion when necessary, offers several speedup opportunities.

  • It allows one to compute and communicate in parallel; this is illustrated by the above example. This allows one to hide the latency of the communication network by prefetching.
  • Split-phase assignment lets one do many communications in parallel, if this is supported by the network (it often is). For example,
       /* lxn and sum are local variables; Gxn is a global pointer */
       lx1 := *Gx1             
       lx2 := *Gx2
       lx3 := *Gx3
       lx4 := *Gx4
       synch()
       sum = lx1 + lx2 + lx3 + lx4
    
    can have up to 4 gets running in parallel in the network, and hides the latency of all but the last one.
  • By avoiding the need to have processors synchronize on a send and receive, idle time spent waiting for another processor to send or receive data can be avoiding by simply getting the data when it is needed.
  • The total number of messages in the system is decreased compared to using send and receive. A synchronous send and receive actuallys requires 3 messages to be sent (see the figure below), where only the last message contains the data. In contrast, a put requires one message with the data, and one acknowledgement, and a get similarly requires just 2 messages instead of 3. For small messages, this is 2/3 as much memory traffic. This is illustrated in the figure. Here, time is the vertical axis in each picture, and the types of arrows indicate what the processor is doing during that time.
  • Instead of synching on all outstanding puts and gets, it is possible to synch just on a selected subset of puts and gets, by associating a counter just with those puts and gets of interest. The counter is automatically incremented whenever a designated put or get is initiated, and automatically decremented when an acknowledgement is received, so one can test if all have been acknowledged by comparing the counter to zero. See section 10.5 of Introduction to Split-C for details.

    The freedom afforded by split-phase assignment also offers the freedom for new kinds of bugs. The following examples illustrates a loss of sequential memory consistency. Sequential consistency means that the outcome of the parallel program is consistent with some interleaved sequential execution of the PROCS different sequential programs. For example, if there are two processors, where processor 1 executes instructions instr1.1, instr1.2, instr1.3, ... in that order, and processor 2 similarly executes instr2.1, instr2.2, instr2.3 ... in order, then the parallel program must be equivalent to executing both sets of instructions in some interleaved order such that instri.j is executed before instri.(j+1). The following are examples of consistent and inconsistent orderings:

        Consistent      Inconsistent
         instr1.1         instr1.1
         instr2.1         instr2.2   *out of order
         instr1.2         instr1.2
         instr2.2         instr2.1   *out of order
         instr1.3         instr1.3
         instr2.3         instr2.3
         ...              ...
    
    Sequential consistency, or having the machine execute your instructions in the order you intended, is obviously an important tool if you want to predict what your program will do by looking at it. Sequential consistency can be lost, and bugs introduced, when the program mistakenly assumes that the network delivers messages in the order in which they were sent, when in fact the network (like the post office) does not guarantee this.

    For example, consider the following program, where data and data_ready_flag are global pointers to data owned by processor 2, both of which are initially zero:

            Processor 1              Processor 2
            *data := 1               while (*data_ready_flag != 1) {/* wait for data*/}
            *data_ready_flag := 1    print 'data=',*data
    
    From Processor 1's point of view, first *data is set to 1, then the *data_ready_flag is set. But Processor 2 may print either data=0 or data=1, depending on which message from Processor 1 is delivered first. If data=0 is printed, this is not sequentially consistent with the order in which Processor 1 has executed its instructions, and probably will result in a bug. Note that this bug is nondeterministic, i.e. it may or may not occur on any particular run, because it is timing dependent. These are among the hardest bugs to find!

    This sort of hazard is not an artifact of Split-C, but in fact occurs when programming several shared memory machines with caches, as discussed in Lecture 3. So it is a fact of life in parallel computing.

    In addition to the split phase assignments put (global := local) and get (local := global), there is one more called store, which is written

         global :- local
    
    The difference between store and put is that store provides no acknowledgement to the sender of the receipt, whereas put does. This is illustrated by the last figure above. Thus, store reduces yet further the total number of messages in the network, which means the network can spend yet more time sending useful data rather than acknowledgements.

    To be able to use the data stored on a processor, one still needs to know whether it has arrived. There are two ways to do this. The simplest way is to call store_sync(n), which waits until n bytes have been stored in the memory of the processor executing store_sync(n). This presumes the parallel algorithm is designed so that one knows how much data to expect.

    For example, the following code fragment stores the transpose of the spread array A in B:

      static int A[n]::[n], B[n]::[n]
      for (i=0; i++; i< n)
         B[MYPROC][i] = A[i][MYPROC];
    
    This is a slow implementation because use of "=" in the assignment means at most one communication occurs at a time. We may improve this by replacing "=" by ":=":
      static int A[n]::[n], B[n]::[n]
      for (i=0; i++; i< n)
         B[MYPROC][i] := A[i][MYPROC];
      sync();
    
    Now there can be several communications going on simultaneously, and one only has to wait for one's own part of B to arrive before continuing. But there are still twice as many messages in the network as necessary, leading us to
      static int A[n]::[n], B[n]::[n]
      for (i=0; i++; i< n)
         B[i][MYPROC] :- A[MYPROC][i];
      store_sync(n);
    
    Now there are a minimal number of messages in the system, and one continues computing as soon as n messages have arrived.

    But this code still has a serious bottleneck: the first thing all the processors try to do is send a message to processor 0, which owns B[0][MYPROC] for all MYPROC. This means processor 0 is a serial bottleneck, followed by processor 1 and so on. It is better to have message destinations evenly distributed across all processors. Note that in the following code fragment, for each value of i, the n stores all have different processor destinations:

      static int A[n]::[n], B[n]::[n]
      for (i=0; i++; i< n)
         B[(i+MYPROC) mod MYPROC][MYPROC] :- A[MYPROC][(i+MYPROC) mod MYPROC];
      store_sync(n);
    

    It is also possible to ask if all store operations on all processors have completed by calling all_store_sync(). This functions as a barrier, which all processors must execute before continuing. Its efficient implementation depends on some special network hardware on the CM-5 (to be discussed briefly later), and will not necessarily be fast on other machines without similar hardware.

    Atomic Operations

    Consider again the following example, which illustrates a race condition:
            Processor 1             Processor 2
            *i = *i + 1             *i = *i + 2
            barrier()               barrier()
            print 'i=', i
    
    Here *i is a global pointer pointing to a location on processor 3. Recall that either 'i=1', 'i=2' or 'i=3' may be printed, depending on the order in which the 4 memory accesses occur (2 gets of *i and 2 puts to *i). To avoid this, we encapsulate the incrementation of *i in an atomic operation, which guarantees that only one processor may increment *i at a time.
       static int x[PROCS]:: ;
       void add(int *i, int incr)
       {
          *a = *a + b
       }
       splitc_main()
       {
           int *global i = (int *global)(x+3)   /* make sure i points to x[3] */
           if ( MYPROC == 3 ) *i = 0;           /* initialize *i */
           barrier();                           /* after this, all processors see *i=0 */
           if ( MYPROC == 1 )
              atomic( add, i, 1 );              /* executed only by processor 1 */
           elseif ( MYPROC == 2 )
              atomic( add, i, 2 );              /* executed only by processor 2 */
       }
    
    Atomic(procedure, arg1, arg2 ) permits exactly one subroutine to execute procedure( arg1, arg2 ) at a time. Other processors executing Atomic(procedure, arg1, arg2 ) at the same time queue up, and are permitted to execute procedure( arg1, arg2 ) one at at time. Atomic procedures should be short and simple (since they are by design a serial bottleneck), and are subject to a number of restrictions described in section 8.1 of Introduction to Split-C. Computer science students who have studied operating systems will be familiar with this approach, which is called mutual exclusion, since one processor executing the atomic procedure excludes all others. The body of the atomic procedure is also sometimes called a critical section.

    Here is a particularly useful application of an atomic operation, called a job queue. A job queue keeps a list of jobs to be farmed out to idle processors. The jobs have unpredictable running times, so if one were to simply assign an equal number of jobs to each processor, some processors might finish long before others and so remain idle. This unfortunate situation is called load imbalance, and is clearly an inefficient use of the machine. The job queue tries to avoid load imbalance by keeping a list of available jobs and giving a new one to each processor after it finishes the last one it was doing. The job queue is a simple example of a load balancing technique, and we will study several others. It assumes all the jobs can be executed independently, so it doesn't matter which processor executes which job.

    The simplest (and incorrect) implementation of a job queue one could imagine is this:

       static int x[PROCS]:: ;
       splitc_main()
       {
           int job_number;
           int *global cnt = (int *global)(x);  /* make sure cnt points to x[0] */
           if ( MYPROC == 0 ) *cnt = 100;       /* initialize *cnt to number of
                                                   jobs initially available */
           barrier();
           while ( *cnt > 0 )                   /* while jobs remain to do */
           {
               job_number = *cnt;               /* get number of next available job */
               *cnt = *cnt - 1;                 /* remove job from job queue */
               work(job_number);                /* do job associated with job_number */
           }
    
    The trouble with this naive implementation is that two or more processors may get *cnt at about the same time and get the same job_number to do. This can be avoided by decrementing *cnt in a critical section:
       static int x[PROCS]:: ;
       void fetch_and_add_atomic(int proc, void *return_val, int *addr, int incr_val )
       {
          int tmp = *addr;
          *addr = *addr + incr_val;
          atomic_return_i(tmp);
       }
    
       int fetch_and_add( int *global addr, int incr_val )
       {
           return atomic_i( fetch_and_add_atomic, addr, incr_val )
       }
    
       splitc_main()
       {
           int job_number;
           int *global cnt = (int *global)(x)      /* make sure cnt points to x[0] */
           if ( MYPROC == 0 ) *cnt = 100;          /* initialize *cnt to number of
                                                      jobs initially available */
           barrier()
           while ( *cnt > 0 )                      /* while jobs remain to do */
           {
               job_number = fetch_and_add(cnt,-1); /* get number of next available job */
               work(job_number);                   /* do job associated with job_number */
           }
       }
    
    Fetch_and_add(addr,incr_val) atomically fetches the old value of *addr, and increments it by incr_val.

    Split-C Library Overview

    In addition to the functions described above, there are a great many others available. Here are some examples:
  • bulk_read, bulk_write, bulk_get, bulk_put, and bulk_store play the roles of local = global, global = local, local := global, global := local and global :- local, respectively, on blocks of data larger than one word. Use of these routines can save overhead on communication of blocks of data.
  • is_sync returns a boolean indicating whether all outstanding puts and gets have completed, without blocking.
  • When the above procedures are appended with _ctr, then synchronization can be done on a user-specified subset of get, put or store operations, rather than all of them.
  • all_spread_free frees space allocated by all_spread_malloc.
  • all_reduce_to_one_add returns the sum of a set of p numbers spread across the processors. Many other reduction operations are available.
  • all_scan_add compute the parallel prefix sum of p numbers spread across the processors. Many other parallel prefix operations are available.
  • get_seconds() returns the value of a timer in seconds.
  • g_strlen and many other global string manipulation procedures are available.
  • fetch_and_add (which was used to implement the job-queue above), exchange, test_and_set, and cmp_and_swap are provided as atomic operations.
  • A quick look at Sharks and Fish in Split-C

    You should clone this window and click here to see a copy of the Split-C solution to the first Sharks and Fish problem. This directory contains other files used in the Split-C solution.

    Find the string "splitc_main" to examine the main procedure. The fish are spread among the processors in a spread array allocated by

         fish_t *spread fishes = all_spread_malloc(NFISH, sizeof(fish_t));
    
    Here fish_t is a struct (defined near the top of the file) containing the position, velocity and mass of a fish, and NFISH = 10000 is a constant. The next line
         int num_fish = my_elements(NFISH)
    
    uses an intrinsic function to return the number of fish stored on the local processor. Then
         fish_t *fish_list = (fish_t *)&fishes[MYPROC];
    
    provides a local pointer to the beginning of the local fish. In other words, the local fish are stored from address fish_list to fish_list + num_fish*sizeof(fish_t) -1.

    The rest of the main routine calls

        all_init_fish (num_fish, fish_list) to initialize the local fish,
        all_do_display(num_fish, fish_list) to display the local fish periodically, and
        all_move_fish (num_fish, fish_list, dt, &max_acc, &max_speed, &sum_speed_sq)
            to move the local fish and return their maximum acceleration, etc.
    
    The global reduction operation
     
        max_acc = all_reduce_to_all_dmax(max_acc);
    
    reduces all the local maximum accelerations to a global maximum. The other two all_reduce_to_all calls are analogous.

    all_init_fish() does purely local work, so let us next examine all_do_display. This routine first declares a spread pointer map, and then calls

        map = all_calculate_display(num_fish, fish_list);
    
    to compute the 2D image of the fish in a spread array and return a spread pointer to it in map. The map is displayed by calling
        all_display_fish(map);
    
    to pass the map to the host (which handles the X-window display), doing a barrier to make sure all the processors have passed their data to the host, and then having only processor 0 display the data via
         on_one {X_show();}
    

    all_calculate_display works as follows. on the first call, a spread array of size DISPLAY_SIZE-by-DISPLAY_SIZE (256-by-256) is allocated, with map pointing to it. The statements

        for_my_1d(i, DISPLAY_SIZE*DISPLAY_SIZE) { 
            map[i] = 0; /* blue */ }
        barrier();
    
    have each processor initialize its local entries of the spread array to 0 (blue water, i.e. no fish). The Split-C macro for_my_1d loops over just those values of i from 0 to DISPLAY_SIZE*DISPLAY_SIZE-1 such that map[i] is stored locally.

    The next loop loops over all locally stored fish, computes the scaled coordinates (x_disp,y_disp) of each fish, where 0 <= x_disp , y_disp < DISPLAY_SIZE, and atomically adds 1 to map[x_disp,y_disp] to indicate the presence of a fish at that location (we use poetic license here by addressing map as a 2D array, whereas the code addresses it as a 1D array). The final all_atomic_sync waits until all the processors have finished updating map.

    After all_calculate_display returns in all_do_display, map contains an image of the current fish positions, with map[x_disp,y_disp] containing the number of fish at scaled location (x_disp,y_disp). Next, all_do_display calls all_display_fish(map) to transfer the data to the host. The first time all_display_fish is called it has processor 0 initialize the X-window interface, allocates a spread array called old_map to keep a copy of the map from the previous time step, and initializes old_map to 0 (empty). Then, all_display_fish has each processor compare the part of map it owns to the corresponding part of old_map, which it also owns, and if they differ it transfers the new map data to the host for X-window display. This minimizes the number of messages the host has to handle, a serial bottleneck. Finally, map is copied to old_map for the next step.

    Procedure all_move_fish does all the work of moving the fish, and is purely local.