# CS267 Assignment 2: Parallelize Particle Simulation

## Overview

The purpose of this assignment is introduction to programming in shared and distributed memory models.

Your goal is to parallelize a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behavior shown in the following animation:

The range of interaction forces is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations.

Don't forget to leave off the -o option for your actual timing runs you use in these calculations and plots.

## Reproducible Simulations

This year there is a significant new element to this homework. We have discovered over the years that students often have code that gets the wrong answer, but they time it anyways. That leaves us with a bit of detective work to determine if this is an artifact of floating-point arithmetic errors accumulating in the simulation (which we consider OK), or if they were just doing something bad, like you code doesn't actually move the particle in parallel (You can make very fast code this way, but you can see the problem). Not finding your neighbors properly can lead to very fast code that is just wrong. Or different parallelization approaches having different accuracies.

In addition to this problem for us, it becomes a problem for the students. They start with correct slow code, create correct faster code, and somewhere along the way in their optimizations they stopped getting the right answer. Some discover that they have a bug, but when did it all go wrong ?

So, to help everyone, this year students are expected to almost exactly reproduce the same results as the serial implementation. To achieve this "almost-ness" we are expecting equivalence to single-precision. the program is executing in double-precision. We create the output file in single-precision and using the diff command on your output files to determine if all digits of accuracy are there for every particle for the entire simulation. To make this possible a new data member has been added to the particle_t struct, globalID . This is the particles position in the original list of all particles. This allows you to write out your particles in the same order as the original serial code. Use the output of serial to check your solutions for 500 and 10,000 particles. You can achieve perfect binary correctness but it is hard to do this and keep high performance (consider that a stretch goal).

## Source Code

You may start with the serial and parallel implementations supplied below. All of them run in O(n2) time, which is unacceptably inefficient.

 serial.cpp a serial implementation, openmp.cpp a shared memory parallel implementation done using OpenMP, pthreads.cpp a shared memory parallel implementation done using pthreads (if you prefer it over OpenMP), mpi.cpp a distributed memory parallel implementation done using MPI, common.cpp, common.h an implementation of common functionality, such as I/O, numerics and timing, Makefile a makefile that should work on all NERSC clusters if you uncomment appropriate lines, job-franklin-serial, job-franklin-pthreads4, job-franklin-openmp4, job-franklin-mpi4, job-hopper-serial, job-hopper-pthreads24, job-hopper-openmp24, job-hopper-mpi24 sample batch files to launch jobs on Franklin and Hopper. Use qsub to submit on Franklin or Hoppper. particles.tar all above files in one tarball.

You are welcome to use any NERSC cluster in this assignment. If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.

You may consider using the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version.

## Submission

You may work in groups of 2 or 3. One person in your group should be a non-CS student, but otherwise you're responsible for finding a group. After you have chosen a group, please come to the GSI office hours to discuss the distribution of work among team members. There are three executables we need to be submitted. You need to create at a minimum one serial code that runs in O(n) time, one distributed memory implementation (MPI) that runs in O(n) time and hopefully O(n/p) scaling, and one shared memory implementation (PThreads or OpenMP) that has a simlar performance rates as your MPI code (or better for a single node), for this part of Homework 2.

Email cs267.brian@gmail.com your report and source codes. We need to be able to build and execute your implementations to receive credit. It should be a zip or tar file of a directory that contains both your report and your Makefiles and source code. Spell out in your report what Makefile targets we are to build for the different parts of your report.

Here is the list of items you might show in your report:

• A plot in log-log scale that shows that your serial and parallel codes run in O(n) time and a description of the data structures that you used to achieve it.
• A plot in log-linear scale that shows your performance as a percent of peak performance for different numbers of processors. You can use a tool like Craypat to tell you how many flops are performed for different sizes of n .
• A description of the synchronization you used in the shared memory implementation.
• A description of the communication you used in the distributed memory implementation.
• A description of the design choices that you tried and how did they affect the performance.
• Speedup plots that show how closely your parallel codes approach the idealized p-times speedup and a discussion on whether it is possible to do better.
• Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time. How do they scale with p?
• A discussion on using pthreads, OpenMP and MPI.
You should also undertake one stretch goal for yourselves in the homework. This can be perhaps creating a PThreads AND an OpenMP implementation and comparing them. It could be combining MPI with OpenMP as a hybrid parallel code.

## Resources

• Programming in shared and distributed memory models have been introduced in Lectures 6 and 7, which are available at the course website.
• Shared memory implementations may require using locks that are availabale as omp_lock_t in OpenMP (requires omp.h) and pthread_mutex_t in pthreads (requires pthread.h).
• You may consider using atomic operations such as __sync_lock_test_and_set with the GNU compiler. This syntax changes between compilers.
• Distributed memory implementation may benefit from overlapping communication and computation that is provided by nonblocking MPI routines such as MPI_Isend and MPI_Irecv.
• Other useful resources: pthreads tutorial, OpenMP tutorial, OpenMP specifications and MPI specifications.
• It can be very useful to use a performance measuring tool in this homework. Parallel profiling is a complicated business but there are a couple of tools that can help.
• IPM is a profiling tool that is inserted into your link command in your Makefile (afer you module load ipm) and instrumented versions of your MPI calls are put into your program for you.
• TAU (Tuning and Analysis Utilities) is a source code instrumentation system to gather profiling information. You need module load tau to access these capabilities. This system can profile MPI, OpenMP and PThread code, and mixtures, but it has a learning curve.
• HPCToolkit Is a sampling profiler for parallel programs. You need module load hpctoolkit . You can install the hpcviewer on your own computer for offline analysis, or use the one on NERSC by using the NX client to get X windows displayed back to your own machine.
• If you are using TAU or HPCToolkit you should run in your \$SCRATCH directory which has faster disk access to the compute nodes (profilers can generate big profile files).

# Part 2: GPU

## Overview

You will also be running this assignment on GPUs. You have access to Dirac, an experimental GPU cluster at NERSC. Each node has an NVIDIA Tesla C2050, as well as two quad-core CPUs (See the NERSC Dirac Webpage for more detailed information.) You access the Dirac subsystem by logging into carver.nersc.gov and using specific qsub directives.

## Source Code

We will provide a naive O(n2) GPU implementation, similar to the openmp, pthreads, and MPI codes listed above. It will be your task to make the necessary algorithmic changes and machine optimizations to achieve favorable performance across a range of problem sizes.

## Help

It may help to have a clean O(n) serial CPU implementation as a reference. If you feel this will help you, please e-mail the GSIs after Part 1 is due and we can provide this.

As with Part 1 you can check the correctness of your algorithm by comparing your solution to the serial implementation up to 100 timesteps. Your solution is correct up the the tenth decimal place by this point in time with any order of summation, but not correct if you have missed a particle interaction.

## Submission

Please include a section in your report detailing your GPU implementation, as well as its performance over varying numbers of particles. Here is the list of items you might show in your report:

• A plot in log-log scale that shows the performance of your code versus the naive GPU code
• A description of any synchronation needed
• A description of any GPU-specific optimizations you tried
• A discussion on the strengths and weaknesses of CUDA and the current GPU architecture

## GPU Resources:

[ Back to CS267 Resource Page ]