CS 194-2: Computing resources



Accounts

NOTE: please do NOT e-mail support@millennium to get your class account. We will get you an account if you don't have one already.

Batch scheduling

Ever notice how your computer slows down more and more when you open more and more applications? This is especially true for computationally intensive applications. How can you tell if all your optimization efforts are worth anything, if the computer is running all kinds of random programs at the same time as your super-optimized library? This is why people sometimes set up computers to use a batch scheduler, also called a batch queue. This is a program running on a "master" or "front-end" server, that controls computational jobs on a cluster of "worker" or "compute" nodes. The master ensures that each worker is running at most one job at a time. You can log into the master and submit jobs to the queue. However, you can't log into the compute nodes and run jobs directly; they are protected from logins that would disturb running computations. Furthermore, the scheduler can impose limits on the computational resources demanded by your jobs, like memory or disk or runtime. This protects the cluster from broken or malicious programs, and also ensures a fair distribution of resources among the jobs.

We'll be using clusters with batch schedulers as much as possible in this class, as they ensure consistent and fair timings. Don't worry, they aren't hard to use, although different clusters may use different scheduler commands. For machines that use the PBS system, like the PSI cluster,we provide a "skeleton" script with blanks that you can fill in to run your homeworks. An online tutorial explains how to submit jobs to the PBS queue.

For NERSC machines, each of their clusters may use a different machines may use a different queue system. You should refer to their (excellent) online tutorials, such as this one for Bassi.

Interactive jobs

Interactive jobs are those that don't go through a batch scheduler. They can be handy for debugging a code that doesn't take too long to run. You usually shouldn't just run them like any program, because then they will run on the front-end node, to which 80 gazillion people are logged in. They will be very annoyed if your compute-intensive program prevents them from editing their files and running their own jobs! (They can also use the ps command to find out who you are, so don't be surprised if they send you angry e-mails.) Instead, there might be special ways you can tell the front-end node to run your job on one of the dedicated compute nodes, to which nobody can log in. Not all clusters support interactive jobs; in particular, PSI and Citris do not. (We consider this a good thing, because interactive jobs were the only way to use some of these machines in the past, and they made it hard to get good timings.)

Some of the NERSC machines support interactive jobs, but actually these are sent to the batch scheduler too! The scheduler has a special "interactive" class of jobs, and interactive jobs are scheduled in this class. You run them like interactive jobs, but they are scheduled like batch jobs, and also charged like batch jobs. On NERSC machines, you might also want to use the "debug" job class for debugging, as some NERSC machines have compute nodes specifically dedicated to debug jobs. This may give you a much shorter turnaround if you're in the thick of debugging a tricky code. Debug jobs are submitted like batch jobs, but with a different job class. The batch script you write will specify the job class. Debug jobs are also charged, so be stingy with debugging on NERSC clusters.

Millennium project

The UC Berkeley Millennium project offers a number of parallel machines for research use. They have generously offered their machines and administration expertise for your educational use, as we don't yet have an educational cluster set up. Please be nice and don't flood the cluster with unnecessary jobs, as it will interfere with other users' research work!

PSI batch cluster

One of Millennium's newer machines is the PSI cluster. It has modern multicore or SMP 64-bit x86 processors, and some nodes have special low-latency network hardware. Part of the machine is configured for interactive use, and part is for batch jobs. We'll be using the batch part of the cluster, as it provides more accurate timings. To use PSI, SSH into the front-end node, zen.millennium.berkeley.edu. You can find an example batch scheduling script for PSI here.

Note that the PSI cluster has two kinds of nodes, which you can see on the cluster's home page. One kind has 8 CPUs per node, but slow internode connections; the other kind has 2 CPUs per node, but fast internode connections. You should prefer the ones with 8 CPUs per node for assignments involving threads, as threading alone can't take advantage of the network. The batch scheduling script can be used to select one or the other kind of node; just read its comments and you'll learn how.

Citris batch cluster

Millennium also offers another batch cluster, which is an Itanium-2 based machine called Citris. Each node has two processors. You can access it by logging into the front-end node, grapefruit.millennium.berkeley.edu. It uses the same (PBS) batch scheduling system as PSI does. The Millennium sysadmins tell us that the load on Citris is quite light, so you should consider using it, especially for debugging, sequential, and MPI jobs.

Millennium software

Setting the PATH

In order to access useful programs that are installed on Millennium, you should add some directories to your PATH, MANPATH, and LD_LIBRARY_PATH environment variables. If you are a bash user, here's how you might do this in your .bashrc startup script:

if [ -d /usr/mill/bin ]; then
    export PATH=/usr/mill/bin:$PATH
fi
if [ -d /usr/mill/man ]; then
    export MANPATH=/usr/mill/man:$MANPATH
fi
if [ -d /usr/mill/lib ]; then
    export LD_LIBRARY_PATH=/usr/mill/lib:/usr/mill/pkg/intel_cc/lib:$LD_LIBRARY_PATH
fi
WARNING: I just found out as of 30 August 2007 that your default shell might be csh or tcsh instead of bash. If you don't want to use (t)csh, you should e-mail inst@eecs or help@eecs to get your shell changed. If you are a (t)csh user, you might try putting the following in your .cshrc file (caveat lector: I'm NOT a csh user, I don't know if this works):
if (-d /usr/mill/bin)  set path = ( /usr/mill/bin $path );
if (-d /usr/mill/man)  setenv MANPATH "/usr/mill/man:${MANPATH}";
if (-d /usr/mill/lib)  setenv LD_LIBRARY_PATH "/usr/mill/lib:/usr/mill/pkg/intel_cc/lib:${LD_LIBRARY_PATH}";
Later on, when we do Java assignments, we may have you add other directories and aliases to your startup script. I describe how to do that below.

Accessing Java

The Java JDK contains both the Java runtime (for running Java programs) and Java development tools (e.g., the bytecode compiler "javac"). By default, the "java" and "javac" commands may be mapped to GNU's Java JDK. It supports an earlier version of Java (1.4) than that which we need for our homeworks. Thus, on the Millennium machines, you should use one of the various versions of the Java JDK in the "/usr/mill/pkg/" directory. I usually access these by setting aliases in my .bashrc script (.cshrc for (t)csh users). Make sure to use only those Java JDK versions that support Java 1.5 or newer (this is only a problem on Citris). You can run "java -version" to get the name and version number. If you want to use the same JDK for both Citris and PSI, you can add the following to your .bashrc, if you are a bash user (mfh 30 Aug 2007: corrected from parentheses to curly brackets by a helpful student, many thanks!):

if [ -d /usr/mill/pkg/jrockit ]; then 
    export JAVA_ROOTDIR=/usr/mill/pkg/jrockit;
    alias javac="${JAVA_ROOTDIR}/bin/javac";
    alias java="${JAVA_ROOTDIR}/bin/java";
fi
If you are a (t)csh user, try the following:
if (-d /usr/mill/pkg/jrockit)
    setenv JAVA_ROOTDIR "/usr/mill/pkg/jrockit"
    alias javac "${JAVA_ROOTDIR}/bin/javac"
    alias java "${JAVA_ROOTDIR}/bin/java"
endif
If you want to experiment with different JDK's, take a look in the "/usr/mill/pkg/" directory. Note that things marked "x86" or the like won't work on Citris, because Citris has Itanium 2 processors which can't run x86 executables (and vice versa with things marked "ia64" (Itanium)). For example, on PSI, I use Sun's Java 1.6 JDK by adding the following to my .bashrc (I'm a bash user):
export JAVA_ROOTDIR=/usr/mill/pkg/jdk1.6.0_02
alias javac="${JAVA_ROOTDIR}/bin/javac"
alias java="${JAVA_ROOTDIR}/bin/java"
For (t)csh users, the following would go in your .cshrc:
setenv JAVA_ROOTDIR "/usr/mill/pkg/jdk1.6.0_02"
alias javac "${JAVA_ROOTDIR}/bin/javac"
alias java "${JAVA_ROOTDIR}/bin/java"
The Millennium sysadmins usually make sure that if you can see it in "/usr/mill/pkg/" and it's not specifically marked to be for an incompatible processor, then you can run it. Any exceptions to that rule will be made clear and obvious right away when you try to run them on the wrong processor.

Java performance timings

When you report performance results in Java, it's important that you tell us the name and version of the Java compiler and runtime. Is it Sun's? IBM's? GNU's? Use "javac -version" to get the compiler version, and "java -version" to get the runtime version.

Learning about your machine

On machines that run Linux, you can generally figure out the type and number of processors by reading the (human-readable) file "/proc/cpuinfo". This may be useful for checking that your batch scheduler script gets you the kind of nodes that you want.

NERSC machines

NERSC has generously donated some computer time for our class. We'll let you know if necessary which machines you can use and how to use them. Stay tuned for more information here.

NERSC computer time is limited

NERSC computer time is a limited resource. It's allocated for the entire CS 194-2 course, not for each user. If you burn through all of it, then nobody else can use NERSC machines for the whole semester! We have to ask for a grant for each block of time that we allocate from them. This is very annoying for us and for them, so please don't make us do it!

Conserving NERSC allocation time

There are many ways you can avoid using up all of the NERSC time allocation. First, you can do all your debugging on non-NERSC machines, and only use the NERSC machines once you know that your code works. Second, you can submit your jobs as low priority; this charges only half the hours of regular jobs. This website explains how NERSC charges for jobs. Finally, you can try using "cheaper" NERSC machines, especially for debugging. Bassi's jobs are charged at six times the rate of Seaborg's jobs, for example. This is because Seaborg is much slower, but slow or not, it has 16 processors per node, twice as many as Bassi has per node. Seaborg also uses the same OS and compiler as Bassi, so it's easy to port codes between the two.

NERSC allocation details

In case you need to know, the NERSC allocation number is "mp309". I don't think you'll need to know this, because it should be the only allocation to which you have access. If you ever need to talk to NERSC people, e.g., to change your password, then they might refer to the allocation as belonging to CS267; it actually does, but we're also using it for CS194-2 this semester. You can use NERSC's "NIM" web interface to check on the status of the allocation; NERSC's website offers help on that.

NERSC software

NERSC's web pages for each cluster explain quite well how to use the various software packages. NERSC uses the "modules" system for fixing up the path and other environment variables, so generally you won't have to mess with changing environment variables in your shell's startup scripts.

Working on your home machine

This tutorial explains how to do coding and debuggin on your home Windows machine, by using Cygwin.


Last updated 08 Sep 2007.