Absorption in Double-Double Arithmetic
In case you can't tell, I've been following a double-double rabbit-hole recently. Those previous two posts were about performance, but this one is about semantics. Specifically, it's about using double-double arithmetic as an oracle for standard double-precision arithmetic, and a specific issue, absorption, that comes up.
Double-double oracles
I have a dream of "safe" numerical programming, where any time you do something funky with floating-point, an exception is raised and then you go and fix the problem. I call this a "floating-point debugger", though maybe Santosh's "sanitizer" terminology is better.
These things exist—I've published one—but, as Bhargav's recent ExplaniFloat paper pointed out, they don't work super well. Specifically, existing debuggers work by pairing each floating-point computation with an "oracle" computation in higher precision. If that higher precision is MPFR with 512 or 2048 bits, then the debugger has few false positives but is super slow, 100s or 1 000s of times slower than normal execution. On the other hand, if you use double-double arithmetic as your oracle, as in EFTSanitizer, you get lots of false positives and negatives, though the plus is a reasonable overhead of 10× or so. Still, it seems to me to be the obvious way forward; 1000-bit arithmetic won't be fast no matter how clever we are.
Specifically, this post is about false negatives. These are a real issue! ExplaniFloat calls out this example:
sqrt(x + 1) - sqrt(x) with x = 1e+100
Let's see what happens with a double-double oracle. With a
double-double oracle, you track and update two values for each
floating-point variable: its floating-point value, and the residual,
which is the difference between that floating-point value and its true
real value. So for example, when you do x + 1
, the result is 1e100
,
because the 1
part is rounded off. Then the residual is 1
. The
tradition is to write a value plus residual as 1e100 + 1
, which is a
little confusing, so I'll adopt the notation 1e100 [+ 1]
.
Now let's see how the computation proceeds. The next step is to
compute sqrt(x)
. The floating-point result is 1e50
, though keep in
mind that (as is standard) I don't actually mean the real number 1e50
,
I mean the closest floating-point number to that real number. The
exact real value of that floating-point number is
100000000000000007629769841091887003294964970946560
Ok, so that's the computed result. What's the residual? Computing this
directly is tricky, so just trust me here: it is
-6.834625285603891e+33
. So our double-double oracle value for sqrt(x)
is 1e50 [- 6.834625285603891e+33]
.
Now let's look at sqrt(x + 1)
. This is sligthly different from sqrt(x)
because even though both x
and x + 1
have the same floating-point
value, they have different residuals. But we can still compute things,
and it turns out that the double-double value for sqrt(x + 1)
is …
drum roll … 1e50 [- 6.834625285603891e+33]
. It's the same! Odd, ok.
Finally, when we do the subtraction, we get a final result of 0 [+ 0]
.
But this is bad: it says that the computed floating-point result, 0
,
is exact (the + 0
bit says the real number is the same). This leads to
a false negative.
What goes wrong
Why does this happen? Here it's worth borrowing the condition number formalism of ATOMU. This formalism describes numerical issues like so:
- Every floating-point operation rounds, which introduces introduced error, named μ. More specifically, operation \(i\) has introduced error \(\mu_i\).
- Every floating-point dataflow (a function taking an argument) amplifies that error by a condition number, named Γ. More specifically, operation \(i\) multiplies its $j$th argument's error by \(\Gamma_{i, j}\).
- The error of a computation is equal to the sum of all of the μs that flow to it, multiplied by all the Γs they see along the way, along all possible paths.
In ATOMU the introduced error & amplification factors are relative, but you can do the same with absolute errors, to analyze double-double oracles.
What I like about this framework is that, if a computation has high error, you can specifically identify the introduced error and path of condition numbers that produced it. (There may be multiple contributing ones, of course.)
Looking at our sqrt(x + 1) - sqrt(x)
example, there are three
significant μ terms: the addition and the two square roots. None of
the Γ terms are significant. The final error is, basically, \((\mu_1 +
\mu_2) - \mu_3\).1 [1 I am dropping the Γ terms, even though in reality
they are there. Trust me!] Now, it turns out that \(\mu_2 > \mu_1\), so
\(\mu_1 + \mu_2\) computes to just \(\mu_2\). And then it turns out that
\(\mu_2 = \mu_3\), which is why the final residual is 0. If you did the
math without rounding error, you'd get \(\mu_1\), which is roughly
\(5e-51\); but our residual is stored in a double, so can't represent
both \(\mu_1\) and \(\mu_2\) at the same time.
Absorption
I call this problem "absorption". Let me make the situation sharper. When you have a binary operation like addition or division, the error is the sum of three sources: the introduced error, the amplified error of the first argument, and the amplified error of the second argument. But imagine all three sources have radically different orders of magnitude: then the "sum" is instead a "maximum". In this case every term's error is the result of a single introduced error μ, multiplied by all of the Γs along some one route to the output. And sometimes the errors cancel away and you're accidentally left with no error, leading to a false positive.
This suggests to me a solution. Instead of adding the errors, like
pick exactly one error to propagate. The default presumably is taking
the maximum error—but in some cases it'd be better to propagate some
other, smaller error. In this sqrt(x + 1) - sqrt(x)
case, the sqrt(x +
1)
term should choose to propagate its input error instead of its
introduced error, and the sqrt(x)
term should choose to propagate no
error instead of its introduced error. Then we'd have:
x + 1 = 1e100 [+ 1] sqrt(x + 1) = 1e50 [+ 5e-51] sqrt(x) = 1e50 [+ 0] result = 0 [+ 5e-51]
Critically, which errors to propagate are now a free choice of our debugger. I'm not sure how it makes this choice—my student Yumeng is testing one option right now—but I think this new freedom of choice has the potential to give us double-double debugging without the false negatives, which would be, I think, nearly perfect? Fingers crossed, and stay tuned!
Footnotes:
I am dropping the Γ terms, even though in reality they are there. Trust me!