Chain Rule
The Chain Rule is an essential theorem of calculus.
Contents
Theorem
The theorem states that if , then
wherever those expressions make sense.
For example, if ,
, and
, then
.
Here are some more precise statements for the single-variable and multi-variable cases.
Single variable Chain Rule
Let each of be an open interval, and suppose
and
. Let
such that
. If
,
is differentiable at
, and
is differentiable at
then
is differentiable at
, and
.
Multi-dimensional Chain Rule
Let and
. (Here each of
,
, and
is a positive integer.) Let
such that
. Let
. If
is differentiable at
, and
is differentiable at
then
is differentiable at
and
. (Here, each of
,
, and
is a matrix.)
Intuitive Explanation
The single-variable Chain Rule is often explained by pointing out that
.
The first term on the right approaches , and the second term on the right approaches
, as
approaches
. This can be made into a rigorous proof. (But we do have to worry about the possibility that
, in which case we would be dividing by
.)
This explanation of the chain rule fails in the multi-dimensional case, because in the multi-dimensional case is a vector, as is
, and we can't divide by a vector.
There's another way to look at it:
Suppose a function is differentiable at
, and
is "small". Question: How much does
change when its input changes from
to
? (In other words, what is
?) Answer: approximately
. This is true in the multi-dimensional case as well as in the single-variable case.
Suppose that (as above) , and
is "small", and someone asks you how much
changes when its input changes from
to
. That is the same as asking how much
changes when its input changes from
to
, which is the same as asking how much
changes when its input changes from
to
, where
. And what is the answer to this question? The answer is: approximately,
.
We must determine how much does change when its input changes from
to
? Answer: approximately
.
Therefore, the amount that changes when its input changes from
to
is approximately
.
We know that is supposed to be a matrix (or number, in the single-variable case) such that
is a good approximation to
. Thus, it seems that
is a good candidate for being the matrix (or number) that
is supposed to be.
This can be made into a rigorous proof. The standard proof of the multi-dimensional chain rule can be thought of in this way.
Proof
The following is a proof of the multi-variable Chain Rule. It's a "rigorized" version of the intuitive argument given above.
This proof uses the following fact: Assume , and
. Then
is differentiable at
if and only if there exists an
by
matrix
such that the "error" function
has the property that
approaches
as
approaches
. (In fact, this can be taken as a definition of the statement "
is differentiable at
.") If such a matrix
exists, then it is unique, and it is called
. Intuitively, the fact that
approaches
as
approaches
just means that
is approximated well by
.
Let and
. (Here each of
,
, and
is a positive integer.) Let
such that
. Let
, and suppose that
is differentiable at
and
is differentiable at
.
In the intuitive argument, we stated that if is "small", then
, where
. In this proof, we'll fix that statement up and make it rigorous. What we can say is, if
, then
, where
is a function which has the property that
.
In the intuitive argument, we said that . In this proof, we'll make that rigorous by saying
, where
has the property that
.
Putting these together, we find that
, where I have taken that messy error term and called it
.
Now, we need to show that as
, in order to prove that
is differentiable at
and that
.
In order to finish off the proof, it's needed to look at and "play around with it", so to speak. The conclusion can be reached by the following fact: If
is an
by
matrix, then there exists a number
such that
for all
.
by the triangle inequality.
We'll call the first term on the right here the "first error term" and the second term on the right the "second error term." If we can show that the "first error term" and the "second error term" each approach as
, then we'll be done.
which approaches
as
. So the "first error term" approaches
. That's good. (
is the
-norm of the matrix
.)
Consider the "second error term", . On top we have the norm of
with a certain (slightly complicated) input. We know that
is supposed to be small, as long as its input is small. In fact, we know more than that. If you take
, and divide it by the norm of its input, then that quotient is also supposed to be small, as long as the input of
is small. This suggests an idea: divide by the norm of the input of
, and look at what we get. But to make up for the fact that we are dividing by the norm of the input of
, we will also have to multiply by the norm of the input of
.
The first term on the right should approach , and the second term on the right hopefully at least remains bounded, as
.
This idea is promising, but there is a problem with it. When we divide by the norm of the input of , we may be dividing by
. The following argument can resolve this anomaly.
We introduce a function such that
is equal to
if
, and
is
if
. Then
for all
, and
as
.
.
Certainly as
. Also, since
, we know that
as
. So
as
, which means that
as
.
.
This remains bounded as .
We have shown that the "second error term" is a product of one term that approaches and another term that remains bounded as
. Therefore, the "second error term" approaches
as
.