Adaptive high-order splitting schemes for large-scale differential Riccati equations

We consider high-order splitting schemes for large-scale differential Riccati equations. Such equations arise in many different areas and are especially important within the field of optimal control. In the large-scale case, it is critical to employ structural properties of the matrix-valued solution, or the computational cost and storage requirements become infeasible. Our main contribution is therefore to formulate these high-order splitting schemes in an efficient way by utilizing a low-rank factorization. Previous results indicated that this was impossible for methods of order higher than 2, but our new approach overcomes these difficulties. In addition, we demonstrate that the proposed methods contain natural embedded error estimates. These may be used, e.g., for time step adaptivity, and our numerical experiments in this direction show promising results.


Introduction
We consider differential Riccati equations (DREs) of the form Ṗ = A T P + P A + Q − P SP, P (0) = P 0 , where the solution P (t) is matrix-valued and A, Q and S are given matrices.Such equations arise in many different areas, e.g. in optimal/robust control, optimal filtering, spectral factorizations, H ∞ -control, differential games, etc. [1,4,22,27].
A typical application is a linear quadratic regulator (LQR) problem, where one seeks to control the output y = Cx given the state equation ẋ = Ax + Bu by varying the input u.In the case of a finite time cost function, where R x and R u are given matrices, it is well known that the optimal input u * is given in state feedback form.In particular, u * (t) = −R −1 u B T P (T −t)x(t), where P is the solution to the DRE (1) with the specific matrices Q = C T R x C and S = BR −1 u B T .We note that the situation M ẋ = Ax + Bu can be handled in a straightforward way without explicitly inverting the mass matrix M , see e.g.[34].
In this paper, we are interested in the large-scale setting.Even if A ∈ R N ×N is sparse, the solution P is typically dense.Hence, a "large" dimension N is here considerably smaller than the number of components which would be considered large for a vector-valued ODE.A naive method that works well for the small-scale case would run into storage problems already for N = 10000 and be computationally expensive long before that.Recently, many non-naive methods have been proposed for DREs and similar problems, e.g.matrix-valued BDF and Rosenbrock methods [7,6], splitting schemes [34,26] and Krylov projection methods [23,14].The latter are a generalization of the Krylov approach to algebraic Riccati equations and Lyapunov equations [13,20,31].Other methods for such equations, like invariant subspace techniques [2,5,25], typically also generalize to the DRE case by using time-stepping methods of either one-or multi-step type.Further useful references may be found in the recent surveys [9,30].In general, all these methods rely on the fact that the dense solution possesses certain structure.In particular, the solution is positive semidefinite, and in all practical applications it also has low rank.This allows us to factorize P = ZZ T where Z is a matrix with many fewer columns than P .A main idea in all the algorithms listed above is then to only do computations on the factor Z and never actually form the product ZZ T .
Further, we are interested in different types of splitting schemes, since the equation has a natural division into two parts: Ṗ = FP + GP, where FP = A T P + P A + Q and GP = −P SP.
While the full problem is rather difficult, the subproblems Ṗ = FP, P (0) = P 0 , and (2) Ṗ = GP, P (0) = P 0 , are separately much easier and cheaper to solve.In fact, as demonstrated in [34] there exist closed-form expressions for the solutions to both subproblems that are amenable to low-rank computations.In the following, we will denote the solution operator to the full problem by T F +G and to the subproblems by T F and T G ; thus for example the solution to (2) at time t is given by T F (t)P 0 .
To introduce the simplest splitting schemes and our notation, we first discretize the time interval [0, T ] by n equidistant time steps of size h and set t j = jh.Then the approximation to T F +G (t j )P 0 by the Lie splitting scheme is given by S Lie (h) j P 0 , where That is, we switch back and forth between the affine subproblem and the nonlinear subproblem.A more accurate approximation is given by the Strang splitting scheme, defined by the time stepping operator In both cases, we may interchange the order of the F and G operators.For a more thorough introduction to splitting schemes in general, we refer to [21].
It can be shown as in [21] that the Lie splitting is first-order convergent and the Strang splitting second-order convergent, i.e. the errors satisfy In general one can also consider higher-order schemes, but so far this has not been done for DREs.This is due to the fact that multiplicative splitting schemes of the form ) require that some coefficients α j , β j are either negative or complex [10,19], which is not compatible with the low-rank implementation.
The first main contribution of this work is therefore to demonstrate that a new type of additive splitting schemes introduced in [12] allows for arbitrary high order schemes to be implemented efficiently in a low-rank DRE setting.These schemes are of the form and thus only utilize positive step sizes.A minor drawback is that the approximations are no longer guaranteed to be positive semi-definite, since the coefficients γ j may be negative.This prohibits the use of a ZZ T -factorization, and we therefore outline the changes necessary to instead consider a so-called LDL T -factorization (cf.[24]).
The second main contribution lies in the observation that these splitting schemes contain natural lower-order embedded methods, which allows for cheap and easy error estimation.We utilize this to construct high-order splitting schemes with adaptive time-stepping, i.e. the time steps h j = t j+1 − t j are no longer equidistant but chosen as large as possible while keeping the error below a given tolerance.Modifying the step size can greatly increase the efficiency, but only if the computational cost of changing the step size is small.We therefore outline which quantities can be precomputed or recomputed cheaply, and describe efficient updating strategies for the quantities that necessarily change with each step.
A brief outline of the paper is as follows.In Section 2 we state the basic assumptions on the given data and review the use of the ZZ T -and LDL Tfactorizations for low-order splitting schemes.The issues that arise when considering higher-order multiplicative splitting schemes are outlined in Section 3, wherein we also present the new type of additive schemes that eliminate these issues.Error estimates and different kinds of time step adaptivity are discussed in Section 4 and an algorithm summarizing the complete implementation is presented.In Section 5, several numerical experiments demonstrate the validity of the implementation, the efficiency of the methods and the use of adaptive time-stepping.Finally, we collect some conclusions in Section 6.

Low-rank factorizations
The first assumption we make on the problem data is the following: Assumption 1 The matrices A, Q, S and the initial condition P 0 all belong to R N ×N .In addition, Q, S and P 0 are symmetric and positive semi-definite.This implies the existence and uniqueness of a solution P to the DRE (1) such that P (t) is also symmetric and positive semi-definite for all t ≥ 0 [1, Theorem 4.1.6].An important example of when Assumption 1 is satisfied is the LQR setting from the introduction, with R x and R u both symmetric positive definite.Secondly, we assume that the solution has the low-rank property: To the author's knowledge there are currently no known useful criteria on the data in the DRE setting which guarantee that Assumption 2 is fulfilled.However, such low-rank structure is observed in all practical applications, e.g. in LQR problems where B ∈ R N ×m B and C ∈ R m C ×N with m B , m C N .Recently, some results in this direction has been established for algebraic Riccati equations, i.e. the stationary version of (1), in [5].These are generalizations of results for Lyapunov equations [3,33] and it seems likely that further generalizations to the DRE setting could be made.
Assumptions 1 and 2 imply that we can low-rank factorize P (t) = Z(t)Z(t) T and Q = qq T with Z(t) ∈ R N ×r and q ∈ R N ×r Q .Similarly, as demonstrated in [34] we can low-rank factorize also the approximations S Lie (h)P 0 and S Strang (h)P 0 .This is based on factorizing the exact solutions to the subproblems (2)-( 3), for which we have the closed-form expressions T F (h)P 0 = e hA T P 0 e hA + h 0 e sA T Qe sA ds and The latter expression quickly yields an explicit factorization while the former requires that the integral is approximated by a quadrature formula, whereafter column compression is applied.
Considering instead a so-called LDL T -factorization where L(t) ∈ R N ×r and D(t) ∈ R r×r is beneficial for many schemes [24], because it can decrease the amount of computations.This is true also for splitting schemes.Assuming that P 0 = LDL T and considering first the nonlinear subproblem, we have by use of a simplified version of the Woodbury matrix inversion formula [17].Thus L D LT is a low-rank factorization of the solution to the nonlinear subproblem, where L = L and D = (I + hDL T SL) −1 D. In contrast to the ZZ T situation, there exist matrices D and L such that I +hDL T SL is not invertible for all h.However, it certainly is for all h < 1/ρ(DL T SL), where ρ denotes the spectral radius, and therefore the step size can always be chosen such that D is well defined.We note that even for large time steps, this theoretical issue has not yet been observed in practice.We also note that this formulation is cheaper to compute than the corresponding ZZ T -factorization, since it is no longer necessary to compute a Cholesky factorization of the inverse.
Considering next the affine subproblem and assuming that where L 1 = e hA T L, L(s) = e sA T L Q and (s k , w k ) are the n Q nodes and weights of a quadrature formula.We choose the parameters such that the error in this approximation is negligible with respect to the splitting error; for a splitting scheme of order p we typically choose a quadrature formula of order p + 1.For efficiency, the structure of A (sparsity, bandedness, etc.) should be taken into account when computing the terms L 1 and L(s).In our tests, we simply use a 5th-order implicit Runge-Kutta method with a crude error estimate based on halving the internal step size.It seems likely, however, that an approach based on e.g.Krylov subspaces or the Leja point method (see e.g.[11]) would be even more efficient, especially if subspaces from previous steps can be (partially) reused.We note that these terms do not need to be computed to full precision, but like for the integral term their errors should be negligible in comparison to the splitting error.Then, similarly to the ZZ T -case, setting means that L D LT is a low-rank approximation of the solution to the affine subproblem.After forming L and D, column-compression should be applied to eliminate any unnecessary columns.We refer to [24] for an efficient way to do this.
3 High-order splitting schemes Let us now consider low-rank factorization of higher-order multiplicative splitting schemes like the Lie and Strang splitting schemes.Let with s and the coefficients {α k } s k=1 , {β k } s k=1 chosen such that S(h) is a splitting scheme of order p ≥ 3. Then the coefficients must include either negative or complex values [10,19].In the first case, computing e γhA T P 0 for such a negative coefficient γ corresponds to taking a negative time-step for the system ẋ = A T x.If A e.g.corresponds to a discretization of the Laplacian (a common application) we are thus solving the heat equation backwards in time, which is ill-posed.It is therefore only possible to consider the class of problems where A corresponds to the discretization of an analytic operator, but even in this case the evaluation of (I + hZ T SZ) −1 or (I + hDL T SL) −1 tends to yield step size restrictions.We therefore do not think that this is a worthwhile direction of research to pursue.
In the case of a ZZ T -factorization, a complex coefficient γ destroys the structure of I + γhZ T SZ and we can only factorize it in very special cases.Considering instead an LDL T -factorization leads to problems with complex arithmetic: If L and D are real, the approximation L D LT to T G (γh)LDL T will have L real but D complex-valued.Such input to the affine subproblem will then lead to both L and D being complex-valued.Once this is the case, we not only have to do computations fully in complex arithmetic but we also have issues with column compression since the complex values do not match the "transpose"-formulation. Switching instead to a complex LDL H -factorization results in similar issues.Like negative coefficients, using complex coefficients thus does not seem worthwhile.
However, the necessity of negative or complex coefficients only holds for the type of multiplicative splitting schemes mentioned above.Recently, a new type of additive splitting schemes was introduced in [12].These are either of the asymmetric type which are of order s if the coefficients γ 1 , . . ., γ s are chosen appropriately, and the symmetric type which are of order 2s.(We only consider the case of minimal number of stages here.One might of course add extra stages in order to improve the local error structure, but given the form of the schemes it would then make more sense to instead increase the order.)In both cases, the roles of F and G may be interchanged.
At first sight these methods may look computationally expensive.However, (as noted in [12]) if we have the possibility to work in parallel then taking one step with either method is only as expensive as taking s Lie splitting steps.More important is that they only require real, positive step sizes.This eliminates all the issues listed above, and allows us to consider splitting schemes for DREs of arbitrarily high order.
Because the coefficients {γ k } s k=1 may include negative values, using a ZZ Tfactorization to formulate these methods is impossible.However, instead using an LDL T -factorization is not only possible but rather straightforward after the preliminary work in the previous section.The only additional computational work is a column compression step after forming the linear combinations.In an optimized code, most of this work could additionally be done while waiting for the slowest processor that takes s steps to finish.Using a higher-order method also requires us to compute terms of the form e γhA T L more accurately (unless we also increase the step size h and thereby the error), and to use a higherorder quadrature formula to approximate the integral term in T F (h)P 0 .
We also note here that while the LDL T -factorization does not guarantee that the approximations are positive semi-definite, in practice this still seems to hold.This is likely due to the fact that the approximations are very close to the solution of the full problem, which is guaranteed to be positive semidefinite.

Time adaptivity
An additional major feature of the schemes (4) and ( 5) is the existence of natural embedded lower-order methods.This seems to have been overlooked by [12].For example, the scheme 2 is of order 2, and it obviously contains the first-order method T F (h)T G (h).This holds true for all the schemes, symmetric and asymmetric.In general, neglecting the last terms of the sum and using other coefficients {γ k } yields embedded methods of order p − 1 in the asymmetric case, and of order 2p − 2 in the symmetric case with p = 2, 3, . . ., s.Since these lower-order approximations are simply linear combinations of previously computed terms they are cheap to compute.In our case, the only extra computational effort is a column compression step.The embedded methods yield natural error estimates.For example, we have where Φ s and Φ s−1 are the principal error functions of the two methods.Thus the difference S s asym (h)P 0 − S s−1 asym (h)P 0 is a local error estimate of order s − 1.In the symmetric case we instead get an error estimate of order 2s − 2.
These error estimators may be used to control the size of h, with the aim of keeping the local error below a certain tolerance while minimizing the computational effort.There are many different kinds of such controllers, see e.g.[16,32].As an example, we choose a simple PI-controller which typically provides a smoother step size sequence than the commonly used deadbeat I-controller.It is given by [16,32] where h n is the n:th time step, e n is the error estimate at t n , TOL is the desired accuracy (tolerance) and is a safety factor .The parameters (k I , k P ) determine the characteristics of the controller such as responsiveness and robustness.In our numerical experiments we set = 0.9 and (k I , k P ) = (0.2/p, 0.2/p), where p is the order of the error estimate.These are similar to the values recommended for explicit Runge-Kutta methods when using the error per unit step strategy [15].Clearly, these are not optimal values for splitting schemes, but an in-depth investigation for a variety of typical problems is out of the scope of this paper.
The evaluation of T G (h)P 0 requires the same effort whether the step size is varying or not.Evaluating T F (h)P 0 , on the other hand, requires that the approximation of the integral term Qe sA ds is recomputed in every step, while it previously could be precomputed.We note that typically the rank of Q is sufficiently small in relation to the rank of the solution approximation that this extra computational cost is small and easily outweighed by the benefits of adaptivity.Nevertheless, we suggest here a strategy to decrease this cost further.We need to compute T where L(s) = e sA T L Q for given nodes s k and weights w k .The main idea now is to change only a few nodes (and thereby also the weights) in each step, such that the interval [0, h n ] is covered as evenly as possibly.For a quadrature rule of order p we need n Q = p + 1 nodes if we do not place the nodes optimally, in contrast to e.g.Gaussian quadrature which would need roughly half as many.However, by storing the computed matrices L(s k ) and keeping most of the nodes unchanged we will still decrease the overall computation cost.Thus, we define the initial nodes by s k = kh1 p for k = 0, . . ., p and compute the initial weights from with n = 1.Then, to update these nodes and weights given a new h n , we follow the procedure outlined in algorithmic form in Algorithm 1.(In Algorithm 1 and in the following, blkdiag denotes the block diagonal operator, i.e. it places its block arguments on the diagonal of an otherwise zero matrix.)Essentially, we add a node at h n if the interval increases, and then remove the node which makes the remaining sequence as close to equidistributed as possible.Similarly, if the interval decreases, we iteratively relocate the nodes that are outside the new interval to the midpoints of the largest gaps between the nodes in the new interval.In order to ensure that the nodes s k cover the interval [0, h n ] well, we recompute the whole sequence if the step size changes by more than 25%.We note that we could of course, in theory, store all the previously computed L(s k ) and use increasingly high-order quadrature formulae.However, this would yield a major increase in the storage requirements while having little effect on the overall accuracy.
Remark 1 We note that in e.g. a real-world optimal control problem, it is frequently the case that the state of the system is sampled at regular, predetermined intervals.The feedback control thus needs the solution of the corresponding DRE at these specific times.This suggests that a constant, matching step size should be employed, or that the adaptive step size is restricted.Neither approach is desirable; the former is inefficient compared to the adaptive approach, and the latter destroys the smooth time step sequence the PI controller is intended to provide.However, assuming that the exact solution is sufficiently regular, we may still use the more efficient adaptive time stepping and simply interpolate the computed approximations to find the values at the desired times.For example, assume that we use piecewise linear interpolation.Then on the interval [t n−1 , t n ], the error between the interpolant P I and the Set ŝk = s k , k = 0, . . ., p, and ŝp+1 = hn 6.
Remove the node ŝj such that d j = min if the new node ŝp+1 was removed then 8.
Set L(ŝ k ) to the matrices L(s k ) and L(hn) that match the nodes ŝk 12.
Find i such that d i = max Add a new node ŝj+l at ŝ0 /2 if i = 0, at (hn + ŝj+l−1 )/2 if i = j + l or at Reorder ŝk and L(ŝ k ) so that the nodes are increasing 24.
end for 25.
end if 26. end if 27. Compute new weights { ŵk } p k=0 from Equation ( 6 To estimate the second term, we can first use the available approximation Then by differentiating Equation (1) we get P = A T Ṗ + Ṗ A− Ṗ SP −P S Ṗ , from which we can estimate P (t n ).Both of these operations may be low-rank factorized; if Thus the norm of P may be estimated efficiently by two applications of A T and two column compression operations.This estimation may be incorporated into the step-size controller to automatically ensure that the interpolation error is bounded by a fixed tolerance.Whether this is costeffective or not is of course heavily dependent on the rank of the approximation, and thus of the problem data.
To actually compute the interpolant in a low-rank setting, we note that if so that this computation comes at the cost of one column compression step.This interpolation procedure is less straightforward if the setting is generalized to that of time-varying matrices.However, in that case the strategy of sampling the system at constant time intervals is also rather dubious.
Remark 2 It is enough if the terms involved in one step of the splitting methods are of the same accuracy as the local error.Therefore, the error estimates may additionally be used to determine the optimal tolerances for column compression and the actions of the matrix exponentials.As these quantities are obviously not independent, however, a proper implementation requires some care.We have not used this feature in our numerical experiments and instead rely on experience to choose reasonable tolerances.
Finally, we present the full procedure for approximating the solution to Equation (1) in algorithmic form in Algorithms 2-4.We consider only the symmetric case of the additive splitting schemes, since the asymmetric version is analogous; change the order of the error estimator from 2s − 2 to s − 1 and only use the L + or L − terms instead of both.When a step is rejected, it is likely that it is because the approximation of I Q (h n ) is poor.We thus first recompute the whole sequence of quadrature nodes and then retry the step with the same step size.Only if this also fails do we decrease the step size and proceed as normal.Compute in parallel L j ± and D j ± such that for j = 1, . . ., s, according to Algorithms 2 and 3. 7.
Form if tn + hn > T then 23.
end if 25. end while Output:

Numerical experiments
In order to verify the validity of the proposed splitting schemes, a number of numerical experiments were performed using MATLAB implementations of the presented algorithms.
Different norms may be used to measure the errors.In all our experiments, we consider relative errors at the final time, measured in the Frobenius norm.That is, if the approximation P n and a given reference approximation P ref both approximate the solution P (T ), the error is given by where • F denotes the Frobenius norm.

Order investigation, small-scale
As a first test, we demonstrate that the methods exhibit the expected orders of convergence when constant step sizes are used.For this, we consider a small-scale problem with N = 10 and take A, Q, S and P 0 to be random matrices with the latter three having rank 4. The small dimension of the problem means that we may compute a highly accurate reference approximation by unrolling the matrix-valued problem into a vector-valued problem of dimension N 2 and applying a standard method for ODEs.Here we utilize the MATLAB built-in function ode15s, which implements an adaptive variable-order multistep method, with an absolute tolerance of 10 −20 and a relative tolerance of 2.22 • 10 −14 (the minimum).
For this test, we consider the asymmetric splitting schemes (4) of orders 2 and 3, the symmetric schemes (5) of orders 2, 4, 6 and 8, as well as the 2nd-order Strang splitting.To compute terms of the form e hA T L, we use the 5th-order implicit Runge-Kutta scheme RadauIA [18, Chapter IV.5] and halve the step size until two subsequent approximations differ (relatively) by at most 10 −6 .This can clearly be done better, ideally with adaptive time stepping also on this level, but it is sufficient for our purposes.We set the column compression tolerance to 10 −16 so that it has no effect on the results.
The results are shown in Figure 1, where it can be seen that all the methods do, indeed, achieve the expected converge orders.However, a few comments are in order.First, the 3rd-order asymmetric scheme actually exhibits an order of convergence which is slightly larger than 3.This is not true in general and we interpret this as the structure of the error being favourable for this particular problem.Secondly, the errors for the 6th-and 8th-order methods level out around 10 −12 .This is due to round-off error accumulation in each step.Using a dense instead of low-rank factored version of the code, computing e hA explicitly and approximating I Q (h) to high accuracy gives similar results.The leveling out of all the error curves for large step sizes is due to leaving the asymptotic regime; for these step sizes also lower-order error terms influence the result.Thirdly, we note that the 2nd-order asymmetric method performs slightly better than both the Strang splitting and the 2nd-order symmetric method.However, since it is 50% more expensive if parallelization is not used, and even more so if it is, we clearly still prefer the symmetric method.Fig. 1 Errors plotted against step sizes for the problem defined in Section 5.1.We observe that all the methods exhibit the expected convergence orders until the round-off level is reached, except for very large step sizes.

Order investigation, larger-scale
We consider also a larger, real-world problem, arising from the optimal control of steel cooling [8,28].This is essentially a finite-element discretization of a semi-linear PDE given on a non-convex two-dimensional domain.It results in matrices The problem also involves a mass matrix, i.e. the state equation is M ẋ = Ax + Bu.We handle this without inverting M by straightforward modifications to the code as in [34].
Additionally, due to a scaling of the problem, a simulation time step of 1 second corresponds to a real time step of 10 −2 seconds.To avoid confusion, we work with the simulation time throughout, and therefore use a final time T = 4500.The exact solution to the problem is unavailable, and since the other currently existing methods are limited to low orders it is infeasible to use these to compute a sufficiently accurate reference approximation.Instead, we use the 8th-order symmetric splitting scheme itself for this, but with a step size half as large as the smallest step size for the actual approximations.In this experiment we do employ parallelization through use of MATLAB's parfor command, using 8 cores on a cluster built out of Intel 2650v3 CPUs.We restrict ourselves to the Strang splitting and the symmetric methods, since our tests indicate that these are typically more efficient than the asymmetric methods.We perform two tests, one with N = 371 and one with N = 1357.Fig. 2 Errors plotted against step sizes for the problem defined in Section 5.2.Left: N = 371.Right: N = 1357.We observe that the second-order methods show second-order behaviour for all step sizes used, while the higher-order methods suffer from order reduction.For the smaller problem size, we recapture the higher-order behaviour for the smallest step sizes, while the larger problem size requires even smaller step sizes before this happens.Regardless of this, the errors of the higher-order methods are significantly smaller than those of the second-order methods.
Except the time step size, the only varying parameter is the relative tolerance for computing the matrix exponential actions.In the smaller example, this is set to 10 −3 for the Strang splitting and 10 −3 , 10 −6 , 10 −8 and 10 −8 for the additive schemes of order 2, 4, 6 and 8, respectively.In the larger example, we take instead 10 −3 , 10 −3 , 10 −5 , 10 −6 and 10 −6 , respectively.The column compression tolerance is in all cases set to N , where is the machine epsilon.
Figure 2 shows the results, with N = 371 on the left and N = 1357 on the right.In the smaller example, we observe that the second-order methods behave as expected, while the higher-order methods only achieve their respective orders for small step sizes.The fact that the errors level out at around 10 −11 can be avoided by computing the matrix exponentials more accurately, but at additional cost.In the larger example, the situation is slightly worse in that neither of the higher-order methods reach their asymptotic regimes with the used step sizes.This issue may be due to a lack of regularity in the solution to the exact problem.As in the smaller example, we could eliminate the leveling out of the error by decreasing the tolerance for the matrix exponential actions.However, as the computation times required for these small errors are already rather long, we do not do this.In spite of these issues, we note that the higher-order methods still produce much smaller errors for all step sizes except the largest.Also included in Figure 2 are the corresponding errors for the secondorder Rosenbrock method proposed in [6].These computations were done using the MATLAB software M-MESS 1.0.1 [29], which implements the improved LDL T -formulation given in [24].We choose the parameters suggested in the example code for the steel cooling problem.For the smaller problem, we observe clear second-order convergence with errors of comparable size to the splitting schemes.The situation is similar in the larger problem, except that the error evens out for small step sizes, likely due to inner iterations not being computed accurately enough.
Fig. 4 Errors plotted against computation times for the problem defined in Section 5.1.We see that the lower-order methods are most efficient for high error levels, while the higherorder methods are most efficient for low error levels.For errors around 10 −4 , the efficiency of all the methods is comparable.
step size h is the only varying parameter.We observe that all the methods are roughly equivalent for high tolerances, while for error levels below 10 −4 , the symmetric methods outperform the others.For very small errors, the 6th-and 8th-order methods are clearly superior.This is in spite of the fact that parallelization was not used in this case (since the extra time spent on transferring data was much larger than the actual computation time).
The results for the steel cooling problem are shown in Figure 5.These are similar to the small-scale case in that the higher-order methods are more efficient for small errors while the Strang splitting is most efficient for large errors.The plot is slightly misleading, because the low matrix exponential tolerances required for the high-order methods to reach the smallest errors are not strictly required for the less accurate approximations.Similarly, the lowerorder methods would need to compute the matrix exponentials more accurately when the step size is further decreased.Thus the real cut-off point where the higher-order methods become more efficient lies somewhere between the error levels 10 −5 and 10 −7 .We also observe that the 8th-order method is superior to the 6th-order method for small errors.This is due to the parallelization: the cost of increasing the order 2 is equivalent to only one extra Lie splitting step, and one extra processor.Using even higher orders may thus be beneficial, but eventually the overhead costs incurred by the parallelization will dominate.
The strange kinks in the error curves require an explanation.For the Strang splitting this happens twice when N = 371, and on the latter occurrence the computation time even decreases slightly when the step size is decreased.This Fig. 5 Errors plotted against computation times for the problem defined in Section 5.2.Left: N = 371.Right: N = 1357.The lower-order methods are again most efficient for high error levels and vice versa, though the difference between the methods is much less than in Figure 4.
happens due to the way we compute the actions of the matrix exponentials: if the requested accuracy is not reached, the computation is repeated with twice as many sub-steps.Reducing the time step by a factor two makes this computation easier, and it may thus be that most of these computations require only half as many sub-steps as for the larger time step.With twice as many time steps, the total computation time is therefore roughly unchanged.
To provide an indication of what parts of the splitting schemes are expensive, all the methods were run through MATLAB's profile command while solving the steel cooling problem with N = 1357 and with either 40 or 640 time steps, corresponding to h = 112.5 or h = 7.0313.The tolerance for the e γhA T L-computations was set to 10 −6 in all cases.The results are shown in Table 2 and 3. We observe that, as expected, the evaluations of T G are essentially free in comparison to T F .The cost of the latter completely dominates the overall procedure.Along with the observation in the previous paragraph, this provides additional incentive for studying better implementation strategies for this basic operation.
Further, we observe that the relative cost of column compression increases with the order of the method, since L j ± and D j ± increase in size.In total, however, this cost is negligible, despite the fact that the ranks of the approximations are not very small.It should be noted here that due to the difficulties of accurately timing parallel code in this level of detail, a serial implementation was used.While this skews the ratios, the effect is very small because almost all column compressions originate from T F -evaluations (95% for the order 8 method and the smallest step size).As expected, the relative cost of the one-time computation of I Q (h) is higher for a small number of time steps, but even in the worst case it is measured in single digit percentages.With many time steps, the relative cost is negligible.
Finally, we note that Figure 5, like Figure 2, also includes the results for the second-order Rosenbrock method.While a fair comparison is difficult, and many parameters could be further fine-tuned for all the methods, these results clearly indicate that the splitting schemes constitute a competitive alternative to this class of methods.3 Computational time breakdown for the splitting schemes when applied to the steel cooling problem with N = 1357 and 640 time steps.Shown is the time spent on the given operation, divided by the total time for the integration (in percent).The numbers are not independent, e.g.computing I Q (h) requires several e γhA T L evaluations and column compressions.

Time adaptivity
Finally, we test the full time step adaptive code with the 4th-order symmetric splitting scheme.In Figure 6 we have plotted the results of using four different tolerances on the small-scale problem defined in Section 5.1.We plot both the error estimated by the method using the embedded method, and the actual error.The latter is computed by using the same method, but by taking 10 equidistant steps in each of the steps given by the adaptive code.We observe that the actual error is in all cases less than the estimated error, and the difference increases as the tolerance decreases.This is due to the fact that the error estimate is of a lower order than the actual method used.The effect is more pronounced here than usual, since in the symmetric case the accuracy of the estimate is 2 orders less than the method.In each figure we have also plotted the step sizes, and we see that the controller works well in finding the maximum possible step size.For the largest tolerance, the controller is too cautious and does not quite reach the tolerance until the simulation is over.Effects like this can (and should, this is one area we aim to pursue in the near future) be tuned by adjusting the parameters k I and k P .In Figure 7, we have repeated the same experiment but on the steel problem with N = 371 and only with the tolerance 10 −3 .Also in this case, the adaptiveness seems to work well -the maximum possible step size (given the tolerance) is quickly reached and after this it varies very little.In this case, the difference between the error estimate and the actual error is not as large Fig. 6 Computed error estimate, actual error and step size for each time step, when applying the adaptive 4th-order symmetric splitting scheme to the problem defined in Section 5.1.We consider error per unit step (EPUS), i.e. all errors are divided by the time step.The tolerances used are, from left to right, 10 −1 , 10 −2 and 10 −3 .We observe that the adaptivity finds the maximum step size such that the error estimate is equal to the tolerance.Due to the lower-order estimate, the actual error is in all cases less than the estimated error, and the difference increases as the step size decreases.The sudden drop in step size (and error) in the final step is necessary in order to exactly reach the final time T .as in the previous example.This is likely due to the order reductions observed in Section 5.2.
The left plot shows the results when Algorithm 1 is not used and the right plot when it is.In both cases, we used a column compression tolerance of 10 −8 , a relative tolerance of 10 −4 for the matrix exponential actions and quadrature of order 9 to compute I Q (h n ).When not using Algorithm 1 we use Gaussian quadrature rather than Newton-Cotes, and thus these computations only need 4 quadrature nodes compared to the updating formula which needs 10.Still, as demonstrated by the computational time breakdown in Table 4, the latter is more efficient because typically only one or even none of these nodes need to be updated in each step.In the current experiment, 111 steps were taken.Of these, 6 were rejected which required all 10 nodes to be updated.During the remaining 105 steps, a total of 10 nodes required an update.Error estimate (EPUS) Actual error (EPUS) h Fig. 7 Computed error estimate, actual error and step size for each time step, when applying the adaptive 4th-order symmetric splitting scheme with tolerance 10 −3 to the problem defined in Section 5.2.The method was applied either without (left) or with (right) Algorithm 1.We consider error per unit step (EPUS), i.e. all errors are divided by the time step.We observe that the adaptivity works rather well.4 Computational time breakdown for the adaptive splitting scheme when applied to the steel cooling experiment in Section 5.4, either with or without the use of Algorithm 1.

Method
The first column shows the relative computation times depending on this choice.The other columns show the time spent on the given operation, divided by the total time for the respective method.All numbers are in percent, and "CC" is an abbreviation for column compression.Only the numbers in the last two columns are independent.The remaining computation time was spent on (unoptimized) caching of matrices and general bookkeeping.

Conclusions
We have introduced a family of splitting schemes for differential Riccati equations which may be of arbitrarily high order, and shown that they may be implemented efficiently in a large-scale setting by utilizing the low-rank LDL Tfactorization.Our numerical experiments indicate that the higher-order methods are more efficient when high accuracy is desired, though this of course depends on the actual problem.In addition, we have demonstrated that these methods contain natural embedded error estimates, which e.g. may be used for time step adaptivity.While further research on appropriate controller parameters in this setting is required, experiments show that even a basic implementation gives promising results.

Algorithm 2 Algorithm 3 Algorithm 4
Computing the low-rank factorization of T G (h)P 0Input: Matrices S ∈ R N ×N , L 0 ∈ R N ×r and D 0 ∈ R r×r with P 0 = L 0 D 0 L T 0 , step size h 1. Compute D = (I + hD 0 L T 0 SL 0 ) −1 D 0 2. Set L = L 0 Output: Matrices L and D such that LDL T ≈ T G (h)P 0 Computing the low-rank factorization of T F (h)P 0 Input: Matrices L 0 ∈ R N ×r , D 0 ∈ R r×r such that P 0 = L 0 D 0 L T 0 , step size h, approximate low-rank factorization L I D I L T I of I Q (h) 1. Compute L = e hA T L 02.Form L = [ L, L I ] and D = D 0 0 0 D I and column-compress Output: Matrices L and D such that LDL T ≈ T F (h)P 0 Approximating the solution to Equation (1) ×r and D 0 ∈ R r×r such that P 0 = L 0 D 0 L T 0 Input: Desired method order 2s, coefficients {γ k } s k=1 for order 2s, coefficients {β k } s k=1 for order 2s − 2, initial time step h 1 , desired error tolerance TOL Input: Equidistant nodes s k and weights w k for a quadrature rule of order s + 1 on [0, h 1 ] 1. Set α k = γ k − β k for k = 1, . . ., s − 1 and αs = γs 2. Set k I = 0.2/(2s − 2), k P = 0.2/(2s − 2) 3. Set n = 1, tn = 0 and en = 0 4. while tn + hn ≤ T do 5. Low-rank approximate I Q (hn) ≈ L I D I L T I according to Algorithm 1, store the computed L(s k ) 6.

Table 2
Computational time breakdown for the splitting schemes when applied to the steel cooling problem with N = 1357 and 40 time steps.Shown is the time spent on the given operation, divided by the total time for the integration (in percent).The numbers are not independent, e.g.computing I Q (h) requires several e γhA T L evaluations and column compressions.
\ Operation Total time I Q (hn)T F e γhA T L CC