Logarithms of Taylor Expansions
I’ve already described how Herbie computes series expansions to produce polynomial approximations to floating-point programs. There, I described how expansions of exponentials and the trigonometric functions worked, but I didn’t have a way to expand logarithms and thus powers. I eventually solved the problem with a purpose-built symbolic differentiator, which worked well enough, but I kept feeling that the solution was inelegant. Then recently, Jon Schneider, a college buddy of mind, wrote in with a simple an elegant solution that was also somehow the only obvious thing to do—how had it slipped my mind? I’d like to describe both my ugly and Jon’s elegant solution here.
Symbolic differentiation
Differentiating a mathematical function is a simple, mechanical process that’s easily programmed into a computer. Since a series expansion is defined1 [1 Or maybe this is a theorem for you; regardless. Note that this is actually a simplified case; Herbie uses bounded Laurent series internally, not Taylor series. But the principle is the same.] by \[ f(x) = f(0) + f'(0) x + f''(0) x / 2! + \cdots \] it is sufficient to compute derivatives in order to do so. But as I described in the previous article, Herbie doesn’t compute Taylor series by symbolic differentiation. Instead, it computes Taylor series bottom-up: it first computes the Taylor series of leaf nodes, and then for each function application \(f(g(x))\) computes the series expansion from the series for \(g(x)\).
So if I’m looking to support logarithms in Herbie, I need a way to compute the $n$-th derivative of \(\log(f(x))\) at 0, in terms of \(f(0)\), \(f'(0)\), \(f''(0)\), and so on. To do this I can compute the derivative of \(\log(f(x))\) symbolically as many times as necessary. For example, by the chain rule, \(D[\log(f(x))](0) = f'(0) / f(0)\), and you can differentiate \(f'(x) / f(x)\) again to get the second derivative, and so on.
As you do this, you quickly find a pattern. No matter how many derivatives you take, the result is always a sum of terms, each of which is a product of derivatives of \(f\) to various powers. For example, \(D^2[\log(f(x))](0)\) is the sum of two terms, the first of which is the product of the zeroth derivative of \(f\) to the $-1$-st power and the second derivative of \(f\) to the first power. We can represent each term as a coefficient (one in the case above) and a list of powers for each derivative. Let’s write \([e_0, e_1, e_2, \ldots]\) for list, and then we can write any derivative of \(\log(f(x))\) as a sum of these bracket terms. We start at the first derivative of \(\log(f(x))\), which is \([-1, 1]\) (later terms are implicitly zero).
The generalized chain rule
Now let’s consider how differentiation acts on these terms. There is a useful trick here for avoiding the bog of quotient and multiplication rules, which I’ve heard called the generalized chain rule. Suppose you want to differentiate some expression where the variable being differentiated, \(x\), appears multiple times. For example, suppose we want to differentiate \(f'(x) / f(x)\). Then, do the differentiation once for each instance of \(x\), treating all other instances and constants, and sum the results together. In this case, we’d first differentiate “in the top \(x\)”, which gives us \(f''(x) / f(x)\) (the bottom \(f(x)\) is just a constant, so it just stays there), and then the bottom \(x\), which gives \(-f'(x) f'(x) / f(x)^2\) (the \(f'(x)\) is just a constant).
This gives us an easy way to differentiate the \([c ; e_0, e_1, \ldots]\) terms from \(\log(f(x))\). Each nonzero entry \(e_i\) stands for a factor of \(f^{(i)}(x)^{e_i}\), which differentiates to \(e_i f^{(i+1)}(x) f^{(i)}(x)^{e_i-1}\). All of the other \(e_i\) entries can be taken to stand for constant factors, so this just has the effect of incrementing \(e_{i+1}\) and decrementing \(e_i\) (and moving an \(e_i\) coefficient out front). We can do this to each \(e_i\) and sum the results.
So, for example, if we take \([-1, 1]\), which is the first derivative of \(\log(f(x))\), we can transform this into \(-1 [-2, 2]\) or into \(1 [-1, 0, 1]\). Sum those, and you get \([-1, 0, 1] - [-2, 2]\), which is exactly the result we expect. This mechanical procedure, which does nothing more than manipulate lists, is very easy to program, and runs pretty fast.
Oh! You might wonder why the generalized chain rule works. The proof is simple. Take a function \(f(x)\) with two instances of \(x\), and imagine it a function \(g\) of two different variables, \(x_1\) and \(x_2\). We can imagine that \(g\) that takes a vector of the \(x_1\) and \(x_2\) values as input: \(g([x_1, x_2])\). Now, if we send in the value \(x\) for \(x_1\) and \(x_2\) in \(g\), that will be the same as the \(f\) we started with; that is, write \(\mathbf{x}(x)\) for the function \(x \mapsto [x, x]\), and then \(g(\mathbf{x}(x)) = f(x)\).
We can now use the chain rule for vector-valued functions on \(D[f(x)] = D[g(\mathbf{x}(x))]\):
\begin{align*} D[g(\mathbf{x}(x))] &= (\partial_1 g)(\mathbf{x}(x)) D[\mathbf{x}(x)_1] + (\partial_2 g)(\mathbf{x}(x)) D[\mathbf{x}(x)_2] \\ &= (\partial_1 g)([x, x]) + (\partial_2 g)([x, x]) \end{align*}This last term is precisely the “derivative of \(f\) in first \(x\) variable” plus the “derivative of \(f\) in the second \(x\) variable”.
What I didn’t like about this way of taking logs of Taylor series
The algorithm above worked pretty well: it was plenty fast, it got the right answer, and the code was short. But it felt inelegant. Herbie evaluated exponential and trigonometric functions of Taylor series, after all, thanks to an elegant mathematical trick. Why did logs just go through a stripped-down symbolic differentiator? Now, you may say, no matter how elegant the trick, the results are the same—why did I care, if the code was fast, correct, and easy to maintain? But part of the reason I’m writing Herbie is because it’s fun; and finding neat mathematical tricks is part of that fun. I felt that I was missing out.
But after I wrote the code above, I put the matter away in my mind, because I had something working, and I had other things I needed to do. So when my college friend Jon Schneider emailed me out of the blue with a solution, I was surprised and pleased. At first, I didn’t actually understand what he meant, which made me feel all the dumber when I realized that his solution was simple, clean, and the one obvious thing to do. How had I missed it?
Jon’s solution
Consider again the problem: we want to compute a Taylor series for \(\log(a_0 + a_1 x + a_2 x^2 + \cdots)\). Well, first of all, \(a_0\) can’t be zero, or there is no Taylor series (since \(\log(x)\) diverges near zero). Alright, but if \(a_0\) is nonzero, we can divide by it: \[ \log(a_0 + a_1 x + a_2 x^2 + \cdots) = \log(a_0) + \log(1 + (a_1 + a_2 x + \cdots) x / a_0). \] Let’s write \(y\) for \(a_1 + a_2 x + \cdots\). The above is \(\log(a_0) + \log(1 + y x / a_0)\); apply the Taylor expansion for \(\log(1 + x)\) to get \[ \log(a_0) + \frac{y x}{a_0} - \frac{y^2 x^2}{2 a_0} + \dotsc \] If we expand this out fully (substituting in \(y\)), we’ll get the final Taylor series. But importantly, the only the first \(n\) terms contribute to the coefficient in front of the \(x^n\) term. So if we want to compute any particular coefficient of the final Taylor expansion, we just need to compute a finite sum.
If we actually go through the effort to expand this out, we find that the coefficient of \(x^n\) will involve summing, for each \(i < n\), the \(x^i\) coefficient in \(y^{n-i}\). Finding this coefficient is a matter of finding integer powers of \(y\), which you can do by repeated multiplication. This is, I think, a much more elegant technique, and has the benefit of requiring less code and being clearer. It’s also just a nice trick.
Footnotes:
Or maybe this is a theorem for you; regardless. Note that this is actually a simplified case; Herbie uses bounded Laurent series internally, not Taylor series. But the principle is the same.