Understanding Error Taylor Series
FPTaylor is a big advance in sound floating-point error estimation;1 [1 Alexey Solovyev, the primary author of FPTaylor, notes that the inspiration for FPTaylor came from earlier projects like Rosa and Daisy, and the Taylor series technique itself goes back further.] but it is little-known even in the field. I've been working with error Taylor series recently and think they're the bees knees, so I want them to be more broadly understood.
Error at a point
Suppose you have a floating-point expression \(E\) over some variables \(x, y\), and you want to figure out how much rounding error it incurs. The rounding error differs for different inputs, so let's also fix a specific input \(x, y\), and let's say we want to measure absolute error.2 [2 Though error Taylor series also work for relative or other kinds of error] In this case the correct answer is some number, the difference between the true mathematical answer \(E[x, y]\) and the floating-point version, which I'll write \(R(E)[x, y]\).3 [3 The mnemonic is that I'm "rounding" the expression.]
\[ \mathcal{E} = E[x, y] - R(E)[x, y] \]
So far so good, but \(E[x, y]\), being a real-number computation, is not directly computable, so we want some other way of estimating the error. So perhaps instead we'd like to use some analytical techniques, except that \(R(E)[x, y]\) involves a bunch of floating-point computation, which is some ugly discrete thing that math tools don't work well against.
So the traditional solution to this is to apply an error model, which replaces the deterministic but tricky floating-point operations with non-deterministic overapproximations. For example, one classic approximation is \[ R(f)(x, y) \in f(x, y) (1 + \varepsilon) + \delta \] where ε and \(\delta\) are fresh variables bounded by some constants that depend on your precision. For simplicity I'm going to call both of these new variables "εs".
To be clear, this isn't the only possible error model. You can drop the \(\delta\) term, which will get you a simpler model but one that doesn't handle subnormals. But then, this model doesn't handle infinities or NaN! And you can do better, for example by explicitly consider the ulp size of \(f(x, y)\). The explicit point of an error model is to trade off analyzability for accuracy: it introduces more behaviors than pure floating-point computation, which reduces accuracy, but the overall space of behaviors is more regular, making them easier to analyze.
So if we apply this error model to \(R(E)\) we end up with a new expression \(E^*\) which has variables \(x\) and \(y\) (for which we've picked specific values) but also a bunch of new free ε terms:
\[ \mathcal{E} \in E[x, y] - E^*[x, y, \varepsilon_1, \varepsilon_2, \ldots] \]
Since we don't know the values of the εs, I've replaced the equal sign with set inclusion, to emphasize that the right hand side is actually a whole set of possible errors that only contains the true error.
Usually what we're after is an upper bound on \(\mathcal{E}\), and we can achieve that by finding the maximum of the right hand side. But that can be tricky: with even a modestly-large expression \(E\), \(E^*\) will have dozens of variables to optimize over. And \(E^*\) is in general non-linear, so you suddenly find yourself with a high-dimensional nonlinear optimization problem. That's not good!
The Taylor trick
The error Taylor series trick lets you skip the optimization over all of the εs. Since that means skipping a high-dimensional nonlinear optimization problem, it dramatically simplifies the error estimation problem.
The trick hinges on the fact that while the εs are free variables, they are bounded to be pretty small. So the non-linearity is not very important—higher-order terms just can't matter much because the deviation is pretty small. So we should find a linear approximation to \(E^*\).
Let's work through it. Taylor series represent a function \(f(x)\) as: \[ f(x) \approx f(0) + f'(0) x + o(x^2) \] We'll ignore the higher-order terms entirely, but in principle you can bound them if you need a sound bound for rounding error. In practice, that sound bound almost never matters because again the nonlinearities don't matter much because the εs can only vary over a really small range.
In this case we want to take the Taylor series in each of the εs, so we will get something like:4 [4 Alexey points out that technically FPTaylor doesn't use derivatives, but a generalization, because some error models aren't smooth enough for regular derivatives.]
\begin{align*} & E^*[x, y, \varepsilon_1, \varepsilon_2, \ldots] = &E^*[x, y, 0, 0, \ldots] + \sum_{n} \varepsilon_n \frac{\partial E^*}{\partial \varepsilon_n}[x, y, 0, 0, \ldots] \end{align*}Now, the first term is just \(E^*\) with all the εs set to 0. But in our error model, that reduces \(E^*\) to \(E\), so we have:
\begin{align*} \mathcal{E} &\in E[x, y] - \left(E^*[x, y, 0, 0, \ldots] + \sum_{n} \varepsilon_n \frac{\partial E^*}{\partial \varepsilon_n}[x, y, 0, 0, \ldots]\right) \\ &= -\sum_{n} \varepsilon_n \frac{\partial E^*}{\partial \varepsilon_n}[x, y, 0, 0, \ldots] \end{align*}Now, this last term still depends on the values of the ε variables, but only because we multiply by it: the partial derivative factor substitutes 0 for its value so is independent of ε. Now, recall that we're probably trying to bound the value of \(\mathcal{E}\) from above. To make that term as big as possible, we need to make each multiplier \(\varepsilon_n\) as big as possible. And those are bounded in magnitude by some constant, let's call it \(\epsilon_n\), so to get a bound we need to set \(\varepsilon_n\) to \(\pm \epsilon_n\). And we should just choose the sign so it cancels out the sign of the partial derivative factor, so we can just write that as:
\[ \mathcal{E} \le \sum_{n} \epsilon_n \left| \frac{\partial E^*}{\partial \varepsilon_n}[x, y, 0, 0, \ldots] \right| \]
Note that this expression now has no free variables (for now we're holding \(x\) and \(y\) fixed), so it doesn't involve any optimization. It's just an expression that you can evaluate!
Summary and uses
In summary, when you use an error Taylor series, you go through a series of transformations. You start with an expression \(E\) and an input \(x, y\). Then, you mix in an error model, resulting in an expression \(E^*\), which has vastly more variables than \(E\) did. The Taylor trick then transforms this into a new expression \[ \sum_n \epsilon_n \left| \frac{\partial E^*}{\partial \varepsilon_n}[x, y, 0, 0, \ldots] \right| \] which only has \(x\) and \(y\) as inputs and which is a bound on the error \(\mathcal{E}\).
What you do with this bound afterwards is kind of up to you. One option is to actually use global optimization to find the worst error over all inputs \(x\) and \(y\). Another option is to use this error estimate for tuning. You could even use it for sampling-based error estimation.
A common misconception I see is that error Taylor series require global optimization. That happens because people think of error Taylor series in terms of bounding floating-point error over a range. Yes—that involves global optimization over that range. But if you think about it on a point-by-point basis, the error Taylor series is just a simple expression that you can calculate that estimates floating-point error. The global optimization comes from the task you're applying error Taylor series to, not something inherent in the technique.
Footnotes:
Alexey Solovyev, the primary author of FPTaylor, notes that the inspiration for FPTaylor came from earlier projects like Rosa and Daisy, and the Taylor series technique itself goes back further.
Though error Taylor series also work for relative or other kinds of error
The mnemonic is that I'm "rounding" the expression.
Alexey points out that technically FPTaylor doesn't use derivatives, but a generalization, because some error models aren't smooth enough for regular derivatives.