# Accurate Floating Point Summation

### James Demmel and Yozo Hida

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/CSD-02-1180

May 2002

### http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/CSD-02-1180.pdf

We present and analyze several simple algorithms for accurately summing
*n* floating point numbers, independent of how much cancellation occurs in the sum. Let
*f* be the number of significant bits in the sum (
*si*). We assume a register is available with
*F* >
*f* significant bits. Then assuming that (1)
*n* <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the
*si* in decreasing order of magnitude yields
*S* rounded to within just over 1.5 units in its last place. If
*S* = 0, then it is computed exactly. If we increase
*n* slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (
*F* >= 2
*f*). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as
*n* <= 33. If double extended is used
*n* can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.

BibTeX citation:

@techreport{Demmel:CSD-02-1180, Author = {Demmel, James and Hida, Yozo}, Title = {Accurate Floating Point Summation}, Institution = {EECS Department, University of California, Berkeley}, Year = {2002}, Month = {May}, URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html}, Number = {UCB/CSD-02-1180}, Abstract = {We present and analyze several simple algorithms for accurately summing <i>n</i> floating point numbers, independent of how much cancellation occurs in the sum. Let <i>f</i> be the number of significant bits in the sum (<i>si</i>). We assume a register is available with <i>F</i> > <i>f</i> significant bits. Then assuming that (1) <i>n</i> <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the <i>si</i> in decreasing order of magnitude yields <i>S</i> rounded to within just over 1.5 units in its last place. If <i>S</i> = 0, then it is computed exactly. If we increase <i>n</i> slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (<i>F</i> >= 2<i>f</i>). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as <i>n</i> <= 33. If double extended is used <i>n</i> can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.} }

EndNote citation:

%0 Report %A Demmel, James %A Hida, Yozo %T Accurate Floating Point Summation %I EECS Department, University of California, Berkeley %D 2002 %@ UCB/CSD-02-1180 %U http://www.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html %F Demmel:CSD-02-1180