Pavel Panchekha


Share under CC-BY-SA.

An Accurate Quadratic Formula

As maintainer of Racket's math library, I sometimes put my numerical methods knowledge to a practical test. For example, a few days ago, I took a look at Racket's quadratic-solutions function, which returns the roots of a quadratic formula. Humorously, I've used the quadratic formula as an example in dozens of talks and in the Herbie paper. Yet it was more complex than I expected!

The quadratic formula

The quadratic formula returns all roots \(x\) of the equation \(a x^2 + b x + c = 0\). You probably learned it in high school; the formula looks something like this:

\[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

You can turn that formula into code, and Racket versions up to 8.2 did just that to implement quadratic-solutions. But it turns out that it will be inaccurate for many inputs, due to both cancellation and overflow. J.A. Søgaard noted this problem in issue #16, and so my task was fixing the accuracy issues.

Before I get into the fixes, let me say a bit about my goals. The math library generally does not provide accuracy guarantees. So the goal is not to meet a fixed ULP bound. But I would like the math library to stay within some small number of ULPs. Similarly, avoiding overflow in all possible cases is difficult, and I'm willing to have overflow occur for extreme inputs—but I'd like that to be a binade or so of overflow. (One way of formalizing this is to imagine a parameterized floating point with variable mantissa and exponent size. I want a asymptotically constant error bounds and an asymptotically constant number of binades that cause overflow, but I'm flexible on the exact constant.)

Let me also add that W. Kahan has a note on this topic, which normally is the gold standard, but Kahan avoids discussing overflow, so I had to come up with some new tricks anyway.


The first issue that occurs with the formula above is cancellation. Consider, for now, positive \(b\) values much larger in magnitude than \(a\) and \(c\); then \(\sqrt{b^2 - 4 a c} \approx b\), so the numerator of the quadratic formula is adding two quantities of similar magnitudes and opposite signs.

This is a classic problem in numerics. The issue is that when you add two quantities of similar magnitudes but opposite signs, the most significant bits are all cancelled out. So even if the inputs are pretty accurate, their most accurate bits are thrown away, and the answer can be much less accurate than the inputs. (In extreme cases, you might throw away all the bits and return \(0\) even when the true answer is something else.)

To fix an issue like this, we have two options: algebraic rearrangement, and higher precision. In this case, algebraic rearrangement works, and when it does it's usually the better path. The rearrangement works like this.

Recall that we're solving the equation \(a x^2 + b x + c = 0\). Divide both sides by \(x^2\). 1 [1 This is valid if \(x\) is nonzero. If \(x\) can be zero, then \(c = 0\), so test for this, and if so return \(0\) and \(-b/a\).] This yields \(a + b \frac1x + c \frac1{x^2} = 0\), which is the quadratic formula again, but in \(\frac1x\) and with \(a\) and \(c\) switching places. In other words, if \(Q\) is the quadratic-solutions function, we have:

\[ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} = Q(a, b, c) = \frac1{Q(c, b, a)} = \frac{2c}{-b \mp \sqrt{b^2 - 4ac}} \]

Importantly, this \(\frac1x\) trick switches the identities of the two roots (the + root becomes the - one and vice versa). And adding two numbers of similar magnitudes and the same sign isn't an accuracy problem, so we can just use the left-hand formula for one root and the right-hand formula for the other root. This avoids cancellation.


Cancellation is always a problem for one of the two roots of the quadratic—pretty much no matter the values of \(a\), \(b\), and \(c\). Overflow is a less common problem: it only shows up if \(a\), $b, or \(c\) is larger in magnitude than \(10^{154}\) or so, or smaller than \(10^{-154}\) or so.2 [2 I'm using the same word "overflow" for both overflow and underflow, but actually due to gradual underflow their behavior is subtly different.] Still, there's no reason to have a function that breaks for some inputs! It should either throw an error or return a good answer, and in this case we can return a good answer.

The overflow happens because \(\sqrt{b^2 - 4ac}\) can overflow even when \(a\), \(b\), \(c\), and the answer do not, basically because this formula has an output of size \(O(b)\) from inputs of size \(O(b)\) but computes an intermediate value of size \(O(b^2)\). The solution to this is tricky, and resolves into two cases based on the sign of \(ac\).

If \(ac\) is positive, we are computing \(\sqrt{b^2 - 4 |a| |c|}\), and either the \(b^2\) or \(|a||c|\) may overflow. We'll fix this by imagining it to be a difference of squares: \(\sqrt{b^2 - \sqrt{4 |a| |c|}^2}\). The inner square root can be computed without overflow: \(x = 2 \sqrt{|a|} \sqrt{|c|}\). The outer square root, meanwhile, can avoid overflow by using the difference of squares formula: \(\sqrt{|b| - x}\sqrt{|b| + x}\). The \(|b| - x\) and \(|b| + x\) terms won't overflow (except for maybe one binade), and since they are square rooted before multiplying, the multiplcation can't overflow either.

If \(ac\) is negative we have a different issue. In this case we are computing \(\sqrt{b^2 + 4 |a| |c|}\). We can still compute \(x = 2 \sqrt{|a|} \sqrt{|b|}\) like above, but now we want a sum of square \(\sqrt{b^2 + x^2}\). Luckily, most math libraries, including Racket's, have a handy hypot function to compute exactly this. (hypot typically sorts its inputs by magnitude and, if \(b\) is larger, computes \(|b| \sqrt{1 + (\frac{x}{b})^2}\). The term inside the square root is between 1 and 2, since \(b\) is larger than \(x\), so no overflow can occur.)

Both of these tricks do cost a few extra ULPs of accuracy (effectively, each sqrt call moves one bit from exponent to mantissa, and so has to lose a bit of mantissa to make room for it) and they are slower (we replace one sqrt call with two or three, depending on sign) but they result in an asymptotically constant number of error bits for most inputs and an asymptotically constant number of binades that overflow.

A subtle cancellation

I said the difference of squares trick avoids cancellation for "most inputs" because there is a subtle but potentially important case where cancellation does occur: the subtraction \(|b| - x\). In some rare cases, this step is subtracting values of similar magnitude and same sign.

Now, it's not strictly speaking the subtraction that is at fault: subtracting values with nearby magnitudes is exact.3 [3 This is called Sterbenz subtraction.] It's more that errors in the inputs of the subtraction are magnified. In this case, the \(|b|\) input is computed exactly based on an input from the user, so here there's no error to magnify. But \(x\) is computed with error—it is the square root of $4ac$—and that could lead to problems. In extreme cases, where \(b \approx \sqrt{4ac}\) but the two are not exactly equal, you could have \(|b| - x\) compute to zero instead of a small positive number—which would correspond to mistakenly thinking the quadratic has one root instead of two.

How big a problem could this be? After all, almost any computation can result in small errors! Well, for a problem to occur, \(b\) and \(x\) would have to have the same size, which I'll suggestively write \(O(b)\) even though there's nothing asympotic going on. Typically we expect that \(x\) will have \(O(b \epsilon)\) error, where \(\epsilon\) is the machine epsilon$, and then the subtraction will be exact and also yield \(O(b \epsilon)\) error. The square root will then have \(O(\sqrt{b}\sqrt{\epsilon})\) error, and it will be multiplied by \(\sqrt{|b| + x}\), which will be \(O(\sqrt{b})\). So the full \(\sqrt{|b| + x} \sqrt{|b| - x|}\) term will compute to roughly 0, with \(O(b \sqrt{\epsilon})\) error. Added or subtracted from \(-b\), this will produce a relative error of \(O(\sqrt{\epsilon})\).

Since \(\epsilon\) is small, this is not super scary. But it means invalidating roughly half of the mantissa bits, which is not quite as good as the guarantee we want (a constant number of bad bits).

In this case there is no clever rearrangement that we can use—at least, I haven't thought of one. But we can use more precision. Unfortunately, I don't have access to quad-precision floating point or anything cool like that in the Racket math library. And I don't want to take recourse to exact rational arithmetic or arbitrary precision (which Racket does support) because it'll be very slow. Instead we can fake higher precision with lower precision—something often called double-double tricks. Let's look at how that works.

Let's say that \(x\) is the computed square root of \(4ac\). It has some error \(e\). Naturally, \(x + e = \sqrt{4 a c}\). Then \(x^2 + 2 x e + e^2 = 4 a c\). We know that \(e\) is small relative to \(x\), so the \(e^2\) term is unimportant. Ignoring it, we get \(e = (4 a c - x^2) / (2 x)\), or (rearranging a bit) \(e = (\frac{4ac}{x} - x) / 2\).4 [4 You can see the relationship to the Heron's formula for the square root; this is one way to derive that formula.]

Now instead of computing \(|b| - x\), we can compute \((|b| - x) - e\) instead. But we need to avoid overflow in this formula. Luckily, \(x\) is already going to be pretty close to the square root of \(4ac\), so we know that either \(a/x\) or \(c/x\) doesn't overflow. So we can compute \(\frac{4ac}x\) as \(4a \frac{c}x\) or \(4c\frac{a}{x}\), thereby avoiding overflow.

Finally we need to compute this \(e\) term accurately. You can do this with FMA tricks. Basically, to compute \(c/x\), you first compute it by normal floating-point division; call the result \(y\). It will equal \(c/x\) plus some error term \(e_y\). Now do fma(y, c, -x), which computes \(c y - x\) with only one rounding operation. That gives you \(c e_y\), within half an ULP. Divide that by \(c\) and you have \(e_y\) within an ulp. So now you have \(c/x\) as two floating-point values. Next you can compute \(z = 2a y\) and its error \(e_z\) via fma(-a, y, z). You have to add on the error \(2 a e_y\). Anyway, the point is that using FMA tricks you can compute the \(e\) term accurately, which lets you compute \(b - \sqrt{ac}\) to \(O(b \epsilon^2)\) accuracy, which ultimately leads to a constant number of error bits.


Putting all these pieces together, I am somewhat confident that we can bound error and overflow risk. To put a number on it, Herbie rates the final solution as averaging 0.3 bits of error—pretty good compared to 34.0 for the basic quadratic. Of course, I also checked what Herbie can do automaticaly in this situation. It can automatically fix the cancellation issue but only has partial solutions to the overflow issue (and thus doesn't even run into the subtle cancellation), scoring its own attempt at 8.8 bits of error. (Of course we're working on improving Herbie!) I don't have a formal worst-case bound—I tried using FPTaylor but right now some unrelated issues keep it from working. I did add some unit tests for various extreme cases, and it seems to behave admirably. Anyway, I've committed some version of the above, and hopefully from now on Racket users can expect a high-quality quadratic formula function!



This is valid if \(x\) is nonzero. If \(x\) can be zero, then \(c = 0\), so test for this, and if so return \(0\) and \(-b/a\).


I'm using the same word "overflow" for both overflow and underflow, but actually due to gradual underflow their behavior is subtly different.


This is called Sterbenz subtraction.


You can see the relationship to the Heron's formula for the square root; this is one way to derive that formula.