Storage Hierarchy Management for Scientific Computing

Ethan Leo Miller

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-95-872
April 1995

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1995/CSD-95-872.pdf

Scientific computation has always been one of the driving forces behind the design of computer systems. As a result, many advances in CPU architecture were first developed for high-speed supercomputer systems, keeping them among the fastest computers in the world. However, little research has been done in storing the vast quantities of data that scientists manipulate on these powerful computers.

This thesis first characterizes scientists' usage of a multi-terabyte tertiary storage system attached to a high-speed computer. The analysis finds that the number of files and average file size have both increased by several orders of magnitude since 1980. The study also finds that integration of tertiary storage with secondary storage is critical. Many of the accesses to files stored on tape could have easily been avoided had scientists seen a unified view of the mass storage hierarchy instead of the two separate views of the system studied. This finding was a major motivation of the design of the RAMA file system.

The remainder of the thesis describes the design and simulation of a massively parallel processor (MPP) file system that is simple, easy to use, and integrates well with tertiary storage. MPPs are increasingly commonly used for scientific computation, yet their file systems require great attention to detail to get acceptable performance. Worse, a program that performs well on one machine may perform poorly on a similar machine with a slightly different file system. RAMA solves this problem by pseudo-randomly distributing data to a disk attached to each processor, making performance independent of program usage patterns. It does this without sacrificing the high performance that scientific users demand, as shown by simulations comparing the performance of RAMA and a striped file system on both real and synthetic benchmarks. Additionally, RAMA can be easily integrated with tertiary storage systems, providing a unified view of the file system spanning both disk and tape systems. RAMA's ease of use and simplicity of design make it an ideal choice for the massively parallel computers used by the scientific community.

Advisor: Randy H. Katz


BibTeX citation:

@phdthesis{Miller:CSD-95-872,
    Author = {Miller, Ethan Leo},
    Title = {Storage Hierarchy Management for Scientific Computing},
    School = {EECS Department, University of California, Berkeley},
    Year = {1995},
    Month = {Apr},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1995/5206.html},
    Number = {UCB/CSD-95-872},
    Abstract = {Scientific computation has always been one of the driving forces behind the design of computer systems. As a result, many advances in CPU architecture were first developed for high-speed supercomputer systems, keeping them among the fastest computers in the world. However, little research has been done in storing the vast quantities of data that scientists manipulate on these powerful computers. <p>This thesis first characterizes scientists' usage of a multi-terabyte tertiary storage system attached to a high-speed computer. The analysis finds that the number of files and average file size have both increased by several orders of magnitude since 1980. The study also finds that integration of tertiary storage with secondary storage is critical. Many of the accesses to files stored on tape could have easily been avoided had scientists seen a unified view of the mass storage hierarchy instead of the two separate views of the system studied. This finding was a major motivation of the design of the RAMA file system. <p>The remainder of the thesis describes the design and simulation of a massively parallel processor (MPP) file system that is simple, easy to use, and integrates well with tertiary storage. MPPs are increasingly commonly used for scientific computation, yet their file systems require great attention to detail to get acceptable performance. Worse, a program that performs well on one machine may perform poorly on a similar machine with a slightly different file system. RAMA solves this problem by pseudo-randomly distributing data to a disk attached to each processor, making performance independent of program usage patterns. It does this without sacrificing the high performance that scientific users demand, as shown by simulations comparing the performance of RAMA and a striped file system on both real and synthetic benchmarks. Additionally, RAMA can be easily integrated with tertiary storage systems, providing a unified view of the file system spanning both disk and tape systems. RAMA's ease of use and simplicity of design make it an ideal choice for the massively parallel computers used by the scientific community.}
}

EndNote citation:

%0 Thesis
%A Miller, Ethan Leo
%T Storage Hierarchy Management for Scientific Computing
%I EECS Department, University of California, Berkeley
%D 1995
%@ UCB/CSD-95-872
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1995/5206.html
%F Miller:CSD-95-872