The purpose of this assignment is introduction to programming in shared and distributed memory models.
Your goal is to parallelize a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behavior shown in the following animation:
The range of interaction forces is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.
Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations.
Don't forget to leave off the -o option for your actual timing runs you use in these calculations and plots.
This year there is a significant new element to this homework. We have discovered over the years that students often have code that gets the wrong answer, but they time it anyways. That leaves us with a bit of detective work to determine if this is an artifact of floating-point arithmetic errors accumulating in the simulation (which we consider OK), or if they were just doing something bad, like you code doesn't actually move the particle in parallel (You can make very fast code this way, but you can see the problem). Not finding your neighbors properly can lead to very fast code that is just wrong. Or different parallelization approaches having different accuracies.
In addition to this problem for us, it becomes a problem for the students. They start with correct slow code, create correct faster code, and somewhere along the way in their optimizations they stopped getting the right answer. Some discover that they have a bug, but when did it all go wrong ?
So, to help everyone, this year students are expected to almost exactly reproduce the same results as the serial implementation. To achieve this "almost-ness" we are expecting equivalence to single-precision. the program is executing in double-precision. We create the output file in single-precision and using the diff command on your output files to determine if all digits of accuracy are there for every particle for the entire simulation. To make this possible a new data member has been added to the particle_t struct, globalID . This is the particles position in the original list of all particles. This allows you to write out your particles in the same order as the original serial code. Use the output of serial to check your solutions for 500 and 10,000 particles. You can achieve perfect binary correctness but it is hard to do this and keep high performance (consider that a stretch goal).
You may start with the serial and parallel implementations supplied below. All of them
run in O(n2) time, which is unacceptably inefficient.
You are welcome to use any NERSC cluster in this assignment. If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.
You may consider using the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version.
You may work in groups of 2 or 3. One person in your
group should be a non-CS student, but otherwise
you're responsible for finding a group. After you have chosen a group, please come to the GSI office hours to discuss the distribution of work among team
members. There are three executables we need to be submitted. You need to create at a minimum one serial code that runs in O(n) time, one distributed memory implementation (MPI) that runs in O(n) time and hopefully O(n/p) scaling, and one shared memory implementation (PThreads or OpenMP) that has a simlar performance rates as your MPI code (or better for a single node), for this part of Homework 2.
Email email@example.com your report and source codes. We need to be able to build and execute your implementations to receive credit. It should be a zip or tar file of a directory that contains both your report and your Makefiles and source code. Spell out in your report what Makefile targets we are to build for the different parts of your report.
Here is the list of items you might show in your report:
You will also be running this assignment on GPUs. You have access to Dirac, an experimental GPU cluster at NERSC. Each node has an NVIDIA Tesla C2050, as well as two quad-core CPUs (See the NERSC Dirac Webpage for more detailed information.) You access the Dirac subsystem by logging into carver.nersc.gov and using specific qsub directives.
We will provide a naive O(n2) GPU implementation, similar to the openmp, pthreads, and MPI codes listed above. It will be your task to make the necessary algorithmic changes and machine optimizations to achieve favorable performance across a range of problem sizes.
It may help to have a clean O(n) serial CPU implementation as a reference. If you feel this will help you, please e-mail the GSIs after Part 1 is due and we can provide this.
As with Part 1 you can check the correctness of your algorithm by comparing your solution to the serial implementation up to 100 timesteps. Your solution is correct up the the tenth decimal place by this point in time with any order of summation, but not correct if you have missed a particle interaction.
Please include a section in your report detailing your GPU implementation, as well as its performance over varying numbers of particles. Here is the list of items you might show in your report: