AD Theory for Cholesky Factorization

cholesky_theory

@(@\newcommand{\W}[1]{ \; #1 \; } \newcommand{\R}[1]{ {\rm #1} } \newcommand{\B}[1]{ {\bf #1} } \newcommand{\D}[2]{ \frac{\partial #1}{\partial #2} } \newcommand{\DD}[3]{ \frac{\partial^2 #1}{\partial #2 \partial #3} } \newcommand{\Dpow}[2]{ \frac{\partial^{#1}}{\partial {#2}^{#1}} } \newcommand{\dpow}[2]{ \frac{ {\rm d}^{#1}}{{\rm d}\, {#2}^{#1}} }@)@AD Theory for Cholesky Factorization
Reference
See section 3.6 of Sebastian F. Walter's Ph.D. thesis, Structured Higher-Order Algorithmic Differentiation in the Forward and Reverse Mode with Application in Optimum Experimental Design , Humboldt-Universitat zu Berlin, 2011.

Notation

Cholesky Factor
We are given a positive definite symmetric matrix @(@ A \in \B{R}^{n \times n} @)@ and a Cholesky factorization @[@ A = L L^\R{T} @]@ where @(@ L \in \B{R}^{n \times n} @)@ is lower triangular.

Taylor Coefficient
The matrix @(@ A @)@ is a function of a scalar argument @(@ t @)@. For @(@ k = 0 , \ldots , K @)@, we use @(@ A_k @)@ for the corresponding Taylor coefficients; i.e., @[@ A(t) = o( t^K ) + \sum_{k = 0}^K A_k t^k @]@ where @(@ o( t^K ) / t^K \rightarrow 0 @)@ as @(@ t \rightarrow 0 @)@. We use a similar notation for @(@ L(t) @)@.

Lower Triangular Part
For a square matrix @(@ C @)@, @(@ \R{lower} (C) @)@ is the lower triangular part of @(@ C @)@, @(@ \R{diag} (C) @)@ is the diagonal matrix with the same diagonal as @(@ C @)@ and @[@ \R{low} ( C ) = \R{lower} (C) - \frac{1}{2} \R{diag} (C) @]@

Forward Mode
For Taylor coefficient order @(@ k = 0 , \ldots , K @)@ the coefficients @(@ A_k \in \B{R}^{n \times n} @)@, and satisfy the equation @[@ A_k = \sum_{\ell=0}^k L_\ell L_{k-\ell}^\R{T} @]@ In the case where @(@ k=0 @)@, the @[@ A_0 = L_0 L_0^\R{T} @]@ The value of @(@ L_0 @)@ can be computed using the Cholesky factorization. In the case where @(@ k > 0 @)@, @[@ A_k = L_k L_0^\R{T} + L_0 L_k^\R{T} + B_k @]@ where @[@ B_k = \sum_{\ell=1}^{k-1} L_\ell L_{k-\ell}^\R{T} @]@ Note that @(@ B_k @)@ is defined in terms of Taylor coefficients of @(@ L(t) @)@ that have order less than @(@ k @)@. We also note that @[@ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} = L_0^{-1} L_k + L_k^\R{T} L_0^\R{-T} @]@ The first matrix on the right hand side is lower triangular, the second is upper triangular, and the diagonals are equal. It follows that @[@ L_0^{-1} L_k = \R{low} [ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} ] @]@ @[@ L_k = L_0 \R{low} [ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} ] @]@ This expresses @(@ L_k @)@ in term of the Taylor coefficients of @(@ A(t) @)@ and the lower order coefficients of @(@ L(t) @)@.

Lemma 1
We use the notation @(@ \dot{C} @)@ for the derivative of a matrix valued function @(@ C(s) @)@ with respect to a scalar argument @(@ s @)@. We use the notation @(@ \bar{S} @)@ and @(@ \bar{L} @)@ for the partial derivative of a scalar value function @(@ \bar{F}( S, L) @)@ with respect to a symmetric matrix @(@ S @)@ and an lower triangular matrix @(@ L @)@. Define the scalar valued function @[@ \hat{F}( C ) = \bar{F} [ S , \hat{L} (S) ] @]@ We use @(@ \hat{S} @)@ for the total derivative of @(@ \hat{F} @)@ with respect to @(@ S @)@. Suppose that @(@ \hat{L} ( S ) @)@ is such that @[@ \dot{L} = L_0 \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} ) @]@ for any @(@ S(s) @)@. It follows that @[@ \hat{S} = \bar{S} + \frac{1}{2} ( M + M^\R{T} ) @]@ where @[@ M = L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} )^\R{T} L_0^{-1} @]@

Proof
@[@ \partial_s \hat{F} [ S(s) , L(s) ] = \R{tr} ( \bar{S}^\R{T} \dot{S} ) + \R{tr} ( \bar{L}^\R{T} \dot{L} ) @]@@[@ \R{tr} ( \bar{L}^\R{T} \dot{L} ) = \R{tr} [ \bar{L}^\R{T} L_0 \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} ) ] @]@@[@ = \R{tr} [ \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} )^\R{T} L_0^\R{T} \bar{L} ] @]@@[@ = \R{tr} [ L_0^{-1} \dot{S} L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) ] @]@@[@ = \R{tr} [ L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) L_0^{-1} \dot{S} ] @]@@[@ \partial_s \hat{F} [ S(s) , L(s) ] = \R{tr} ( \bar{S}^\R{T} \dot{S} ) + \R{tr} [ L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) L_0^{-1} \dot{S} ] @]@We now consider the @(@ (i, j) @)@ component function, for a symmetric matrix @(@ S(s) @)@, defined by @[@ S_{k, \ell} (s) = \left\{ \begin{array}{ll} 1 & \R{if} \; k = i \; \R{and} \; \ell = j \\ 1 & \R{if} \; k = j \; \R{and} \; \ell = i \\ 0 & \R{otherwise} \end{array} \right\} @]@ This shows that the formula in the lemma is correct for @(@ \hat{S}_{i,j} @)@ and @(@ \hat{S}_{j,i} @)@. This completes the proof because the component @(@ (i, j) @)@ was arbitrary.

Lemma 2
We use the same assumptions as in Lemma 1 except that the matrix @(@ S @)@ is lower triangular (instead of symmetric). It follows that @[@ \hat{S} = \bar{S} + \R{lower}(M) @]@ where @[@ M = L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} )^\R{T} L_0^{-1} @]@ The proof of this lemma is identical to Lemma 2 except that component function is defined by @[@ S_{k, \ell} (s) = \left\{ \begin{array}{ll} 1 & \R{if} \; k = i \; \R{and} \; \ell = j \\ 0 & \R{otherwise} \end{array} \right\} @]@

Reverse Mode

Case k = 0
For the case @(@ k = 0 @)@, @[@ \dot{A}_0 = \dot{L}_0 L_0^\R{T} + L_0 \dot{L}_0^\R{T} @]@ @[@ L_0^{-1} \dot{A}_0 L_0^\R{-T} = L_0^{-1} \dot{L}_0 + \dot{L}_0^\R{T} L_0^\R{-T} @]@ @[@ \R{low} ( L_0^{-1} \dot{A}_0 L_0^\R{-T} ) = L_0^{-1} \dot{L}_0 @]@ @[@ \dot{L}_0 = L_0 \R{low} ( L_0^{-1} \dot{A}_0 L_0^\R{-T} ) @]@ It follows from Lemma 1 that @[@ \bar{A}_0 \stackrel{+}{=} \frac{1}{2} ( M + M^\R{T} ) @]@ where @[@ M = L_0^\R{-T} \R{low} ( L_0^\R{T} \bar{L}_0 )^\R{T} L_0^{-1} @]@ and @(@ \bar{A}_0 @)@ is the partial before and after is before and after @(@ L_0 @)@ is removed from the scalar function dependency.

Case k > 0
In the case where @(@ k > 0 @)@, @[@ A_k = L_k L_0^\R{T} + L_0 L_k^\R{T} + B_k @]@ where @(@ B_k @)@ is defined in terms of Taylor coefficients of @(@ L(t) @)@ that have order less than @(@ k @)@. It follows that @[@ \dot{L}_k L_0^\R{T} + L_0 \dot{L}_k^\R{T} = \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} @]@ @[@ L_0^{-1} \dot{L}_k + \dot{L}_k^\R{T} L_0^\R{-T} = L_0^{-1} ( \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} ) L_0^\R{-T} @]@ @[@ L_0^{-1} \dot{L}_k = \R{low} [ L_0^{-1} ( \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} ) L_0^\R{-T} ] @]@ @[@ \dot{L}_k = L_0 \R{low} [ L_0^{-1} ( \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} ) L_0^\R{-T} ] @]@ The matrix @(@ A_k @)@ is symmetric, it follows that @[@ \bar{A}_k \stackrel{+}{=} \frac{1}{2} ( M_k + M_k^\R{T} ) @]@ where @[@ M_k = L_0^\R{-T} \R{low} ( L_0^\R{T} \bar{L}_k )^\R{T} L_0^{-1} @]@ The matrix @(@ B_k @)@ is also symmetric, hence @[@ \bar{B}_k = - \; \frac{1}{2} ( M_k + M_k^\R{T} ) @]@ We define the symmetric matrix @(@ C_k (s) @)@ by @[@ \dot{C}_k = \dot{L}_0 L_k^\R{T} + L_k \dot{L}_0^\R{T} @]@ and remove the dependency on @(@ C_k @)@ with @[@ \R{tr}( \bar{C}_k^\R{T} \dot{C}_k ) = \R{tr}( \bar{B}_k^\R{T} \dot{C}_k ) = \R{tr}( \bar{B}_k^\R{T} \dot{L}_0 L_k^\R{T} ) + \R{tr}( \bar{B}_k^\R{T} L_k \dot{L}_0^\R{T} ) @]@ @[@ = \R{tr}( L_k^\R{T} \bar{B}_k^\R{T} \dot{L}_0 ) + \R{tr}( L_k^\R{T} \bar{B}_k \dot{L}_0 ) @]@ @[@ = \R{tr}[ L_k^\R{T} ( \bar{B}_k + \bar{B}_k^\R{T} ) \dot{L}_0 ] @]@ Thus, removing @(@ C_k @)@ from the dependency results in the following update to @(@ \bar{L}_0 @)@: @[@ \bar{L}_0 \stackrel{+}{=} \R{lower} [ ( \bar{B}_k + \bar{B}_k^\R{T} ) L_k ] @]@ which is the same as @[@ \bar{L}_0 \stackrel{+}{=} 2 \; \R{lower} [ \bar{B}_k L_k ] @]@ We still need to remove @(@ B_k @)@ from the dependency. It follows from its definition that @[@ \dot{B}_k = \sum_{\ell=1}^{k-1} \dot{L}_\ell L_{k-\ell}^\R{T} + L_\ell \dot{L}_{k-\ell}^\R{T} @]@ @[@ \R{tr}( \bar{B}_k^\R{T} \dot{B}_k ) = \sum_{\ell=1}^{k-1} \R{tr}( \bar{B}_k^\R{T} \dot{L}_\ell L_{k-\ell}^\R{T} ) + \R{tr}( \bar{B}_k^\R{T} L_\ell \dot{L}_{k-\ell}^\R{T} ) @]@ @[@ = \sum_{\ell=1}^{k-1} \R{tr}( L_{k-\ell}^\R{T} \bar{B}_k^\R{T} \dot{L}_\ell ) + \sum_{\ell=1}^{k-1} \R{tr}( L_\ell^\R{T} \bar{B}_k \dot{L}_{k-\ell} ) @]@ We now use the fact that @(@ \bar{B}_k @)@ is symmetric to conclude @[@ \R{tr}( \bar{B}_k^\R{T} \dot{B}_k ) = 2 \sum_{\ell=1}^{k-1} \R{tr}( L_{k-\ell}^\R{T} \bar{B}_k^\R{T} \dot{L}_\ell ) @]@ Each of the @(@ \dot{L}_\ell @)@ matrices is lower triangular. Thus, removing @(@ B_k @)@ from the dependency results in the following update for @(@ \ell = 1 , \ldots , k-1 @)@: @[@ \bar{L}_\ell \stackrel{+}{=} 2 \; \R{lower}( \bar{B}_k L_{k-\ell} ) @]@

Input File: omh/appendix/theory/cholesky.omh