Accurate Floating Point Summation

James Demmel and Yozo Hida

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-02-1180
May 2002

http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/CSD-02-1180.pdf

We present and analyze several simple algorithms for accurately summing n floating point numbers, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the sum ( si). We assume a register is available with F > f significant bits. Then assuming that (1) n <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the si in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S = 0, then it is computed exactly. If we increase n slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only ( F >= 2 f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n <= 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.


BibTeX citation:

@techreport{Demmel:CSD-02-1180,
    Author = {Demmel, James and Hida, Yozo},
    Title = {Accurate Floating Point Summation},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2002},
    Month = {May},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html},
    Number = {UCB/CSD-02-1180},
    Abstract = {We present and analyze several simple algorithms for accurately summing <i>n</i> floating point numbers, independent of how much cancellation occurs in the sum. Let <i>f</i> be the number of significant bits in the sum (<i>si</i>).  We assume a register is available with <i>F</i> > <i>f</i> significant bits. Then assuming that (1) <i>n</i> <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the <i>si</i> in decreasing order of magnitude yields <i>S</i> rounded to within just over 1.5 units in its last place. If <i>S</i> = 0, then it is computed exactly. If we increase <i>n</i> slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (<i>F</i> >= 2<i>f</i>).  We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as <i>n</i> <= 33. If double extended is used <i>n</i> can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.}
}

EndNote citation:

%0 Report
%A Demmel, James
%A Hida, Yozo
%T Accurate Floating Point Summation
%I EECS Department, University of California, Berkeley
%D 2002
%@ UCB/CSD-02-1180
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html
%F Demmel:CSD-02-1180