Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Accurate Floating Point Summation

James Demmel and Yozo Hida

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-02-1180
May 2002

http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/CSD-02-1180.pdf

We present and analyze several simple algorithms for accurately summing n floating point numbers, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the sum ( si). We assume a register is available with F > f significant bits. Then assuming that (1) n <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the si in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S = 0, then it is computed exactly. If we increase n slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only ( F >= 2 f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n <= 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.


BibTeX citation:

@techreport{Demmel:CSD-02-1180,
    Author = {Demmel, James and Hida, Yozo},
    Title = {Accurate Floating Point Summation},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2002},
    Month = {May},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html},
    Number = {UCB/CSD-02-1180},
    Abstract = {We present and analyze several simple algorithms for accurately summing <i>n</i> floating point numbers, independent of how much cancellation occurs in the sum. Let <i>f</i> be the number of significant bits in the sum (<i>si</i>).  We assume a register is available with <i>F</i> > <i>f</i> significant bits. Then assuming that (1) <i>n</i> <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the <i>si</i> in decreasing order of magnitude yields <i>S</i> rounded to within just over 1.5 units in its last place. If <i>S</i> = 0, then it is computed exactly. If we increase <i>n</i> slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (<i>F</i> >= 2<i>f</i>).  We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as <i>n</i> <= 33. If double extended is used <i>n</i> can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.}
}

EndNote citation:

%0 Report
%A Demmel, James
%A Hida, Yozo
%T Accurate Floating Point Summation
%I EECS Department, University of California, Berkeley
%D 2002
%@ UCB/CSD-02-1180
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html
%F Demmel:CSD-02-1180