James Demmel and Peter Ahrens and Hong Diep Nguyen

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2016-121

June 18, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.pdf

We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should ideally not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point addition, makes attaining reproducibility a challenge even for simple operations like summing a vector of numbers, or more complicated operations like the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible sum of floating point numbers, independent of the order of summation. The algorithm depends only on a subset of the IEEE Floating Point Standard 754-2008. It is communication-optimal, in the sense that it does just one pass over the data in the sequential case, or one reduction operation in the parallel case, requiring an ``accumulator'' represented by just 6 floating point words (more can be used if higher precision is desired). The arithmetic cost with a 6-word accumulator is 7n floating point additions to sum n words, and (in IEEE double precision) the final error bound can be up to 10^(-8) times smaller than the error bound for conventional summation. We describe the basic summation algorithm, the software infrastructure used to build reproducible BLAS (ReproBLAS), and performance results. For example, when computing the dot product of 4096 double precision floating point numbers, we get a 4x slowdown compared to Intel Math Kernel Library (MKL) running on an Intel Core i7-2600 CPU operating at 3.4 GHz and 256 KB L2 Cache.


BibTeX citation:

@techreport{Demmel:EECS-2016-121,
    Author= {Demmel, James and Ahrens, Peter and Nguyen, Hong Diep},
    Title= {Efficient Reproducible Floating Point Summation and BLAS},
    Year= {2016},
    Month= {Jun},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html},
    Number= {UCB/EECS-2016-121},
    Abstract= {We define reproducibility to mean getting bitwise identical
results from multiple runs of the same program, perhaps
with different hardware resources or other changes that
should ideally not change the answer. Many users depend on
reproducibility for debugging or correctness. However,
dynamic scheduling of parallel computing resources, 
combined with nonassociativity of floating point addition,
makes attaining reproducibility a challenge even for simple
operations like summing a vector of numbers, or more
complicated operations like the Basic Linear Algebra
Subprograms (BLAS).
We describe an algorithm that computes a reproducible sum 
of floating point numbers, independent of the order of 
summation. The algorithm depends only on a subset of 
the IEEE Floating Point Standard 754-2008. It is 
communication-optimal, in the sense that it does just one 
pass over the data in the sequential case, or one reduction
operation in the parallel case, requiring an 
``accumulator'' represented by just 6 floating point words
(more can be used if higher precision is desired). 
The arithmetic cost with a 6-word accumulator is 7n
floating point additions to sum n words, and (in IEEE double precision) the final error bound can be up to 
10^(-8) times smaller than the error bound for conventional
summation. We describe the basic summation algorithm, 
the software infrastructure used to build reproducible BLAS
(ReproBLAS), and performance results. For example, when
computing the dot product of 4096 double precision floating
point numbers, we get a 4x slowdown compared to Intel Math
Kernel Library (MKL) running on an Intel Core i7-2600 
CPU operating at 3.4 GHz and 256 KB L2 Cache.},
}

EndNote citation:

%0 Report
%A Demmel, James 
%A Ahrens, Peter 
%A Nguyen, Hong Diep 
%T Efficient Reproducible Floating Point Summation and BLAS
%I EECS Department, University of California, Berkeley
%D 2016
%8 June 18
%@ UCB/EECS-2016-121
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html
%F Demmel:EECS-2016-121