Accurate Floating Point Summation
James Demmel and Yozo Hida
EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-02-1180
May 2002
http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/CSD-02-1180.pdf
We present and analyze several simple algorithms for accurately summing n floating point numbers, independent of how much cancellation occurs in the sum. Let f be the number of significant bits in the sum (si). We assume a register is available with F > f significant bits. Then assuming that (1) n <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the si in decreasing order of magnitude yields S rounded to within just over 1.5 units in its last place. If S = 0, then it is computed exactly. If we increase n slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (F >= 2f). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as n <= 33. If double extended is used n can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.
BibTeX citation:
@techreport{Demmel:CSD-02-1180,
Author = {Demmel, James and Hida, Yozo},
Title = {Accurate Floating Point Summation},
Institution = {EECS Department, University of California, Berkeley},
Year = {2002},
Month = {May},
URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html},
Number = {UCB/CSD-02-1180}
}
EndNote citation:
%0 Report %A Demmel, James %A Hida, Yozo %T Accurate Floating Point Summation %I EECS Department, University of California, Berkeley %D 2002 %@ UCB/CSD-02-1180 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html %F Demmel:CSD-02-1180
