Machine Learning & Control

Machine Learning & Control#

Modern machine learning provides useful tools and perspectives for control theory. Framing control problems as data modeling tasks enables powerful function approximation, estimation, and optimization techniques from machine learning to be applied.

Learning-Based Control#

Learning-based control addresses the automated and data-driven generation or adaptation of elements of the controller formulation to improve control performance.
The learning setup can be diverse:
- Offline learning involves adapting the controller between trials or episodes while collecting data.
- Online learning adjusts the controller during closed-loop operation (e.g. repetitive tasks) or using data from one task execution.

System Identification#

System Identification is the process of constructing a model of a (dynamical) system from observations (measurements) of its inputs and outputs. System identification also includes the optimal design of experiments for efficiently generating informative data for fitting such models as well as model reduction.

In control engineering, the objective is to obtain a good performance of the closed-loop system, which is the one comprising the physical system, the feedback loop and the controller. This performance is typically achieved by designing the control law relying on a model of the system, which needs to be identified starting from experimental data. If the model identification procedure is aimed at control purposes, what really matters is not to obtain the best possible model that fits the data, as in the classical system identification approach, but to obtain a model satisfying enough for the closed-loop performance.

In control engineering, we typically rely on the mathematical modelling of the system from first principles and then optionally use parameter identification methods to determine the values of the mathematical model’s parameters.

In our case, we will not rely on that and instead prefer fitting a model based only on measured data.

Exercise 6 (Dynamic System Model Evaluation)

How do we evaluation the fitted model of the system?

Solution to ( 6

We could evaluate the model’s prediction on a held-out validation set.
Additionally we could evaluate the model’s long-term predictions in an open-loop manner.
On top of that, we could evaluate the model’s step[1] and impulse[2] responses from different initial states. This is especially useful for linear systems.
It is also valuable to look at what the model’s residuals and make sure there is no correlation with other available information such as the input.

Data Collection#

Before learning a model of the system, we need to collect either during the system’s operation i.e. online data collection, or inbetween episodes of the system’s operation i.e. offline data collection. The collected data has to be informative, i.e. comprehensive and be representative of the all behaviours of the system that we desire to capture.

Data informativity is a condition on the data under which it is possible to distinguish between different models in a (parametric) model class.

Commonly used input signals for system identification:

Step function
Pseudorandom binary sequence (PRBS)
Autoregressive, moving average process
Periodic signals: sum of sinusoids
Schroeder Sweep (Sum of phase shifted sinusoids)

\[ u(t) = \sum\limits_{k=1}^M A_k \cos(\frac{2\pi k \tau(t)}{T} + \phi_k), \quad t = 0, 1, \dots , N − 1 \]

Where:

\[\begin{split} \begin{array}{l} k = \sqrt{\frac{P}{M}}\\ \phi_1 = 0\\ \phi_k = \phi_{k-1} − \frac{\pi k^2}{N}, \quad k = 2, 3, \dots , M \end{array} \end{split}\]

With $M$ the number of harmonically related frequencies, $\phi_k$ the phase angles of the harmonic components to produce low Peak Factor (PF), $N$ the number of time steps, $P$ the total desired input power.

For certain systems in many safety-critical tasks such as space exploration or robot-assisted surgery we have to rely on human experts to control the system and avoid damaging it or its environment.

Cart#

cart_env = create_cart_environment(max_steps=500, goal_position=20, max_position=30)

../_images/5282ce099a43f90b7d36202ecca449defce66e551ee297cdf17ad09b598d3b5e.png

../_images/0ec50286a0a4767097e0e080c6aaeb548cc3506309bb8059e35009527f1adcd4.png

Let’s look at the phase portrait[3] instead.

../_images/a1722c7521cf8fe6e2e7ab02f831ac24949884aea593c06c6c08a4d6881b39b4.png

Let’s visualize the Power Spectral Density (PSD) of the input and state signals

../_images/34d8eb7da2554a3b54d1b75ea2f38988a7d72d13b93ea32067eaec9873e8ef19.png

Model Fitting#

Naive Deep Learning#

\[ \mathbf{x}_{t+1} = f_\theta(\mathbf{x}_t, \mathbf{u}_t) \]

Where $\theta$ are learnable parameters.

Issues with such an approach:

Domain shift from the training distribution to the real-world distribution.
Black-box.

Deep Learning with Nominal Model#

A real-world dynamical system can be described as:

\[ \mathbf{x}_{t+1} = \underbrace{f_n(\mathbf{x}_t, \mathbf{u}_t)}_{\text{nominal dynamics}} + \underbrace{g_r(\mathbf{x}_t, \mathbf{u}_t)}_{\text{residual dynamics}} + \underbrace{w_t}_{\text{disturbance}} \]

Where $f_n$ is the nominalsystem model, $g_r$ an additive residual term accommodating uncertainty and $w_t$ represents distburnances.

Many learning-based techniques make use of this explicit distinction by only learning the residual term.

An example of the use of this approach can be found in [SSOConnell+19].

Visual depiction of the ground effect, the complex aerodynamic effect between the drone and the ground, which is nonlinear, nonstationary, and very hard to model using standard system identification approaches. Adapted from Neural Control blog post. is a figure depicting the ground effect that is modelled using a neural network and Comparison of drone landing with and without neural network modelling of residual ground effect. Taken from Neural Control blog post. is a comparison of drone landings with and without neural network modeling.

../_images/60_neural_lander_diagram.png — Fig. 17 Visual depiction of the ground effect, the complex aerodynamic effect between the drone and the ground, which is nonlinear, nonstationary, and very hard to model using standard system identification approaches. *Adapted from Neural Control blog post*.#

../_images/60_neural_lander.gif — Fig. 18 Comparison of drone landing with and without neural network modelling of residual ground effect. *Taken from Neural Control blog post*.#

Issues with such an approach:

Requires nominal model.
Still susceptible to domain shift issues.

Sparse Identification of Nonlinear Dynamics (SINDy)#

The SINDY algorithm identifies fully nonlinear dynamical systems from measurement data. This relies on the fact that many dynamical systems have relatively few terms in the right hand side of the governing equations:

\[ \dot{\mathbf{x}} = f(\mathbf{x}) \]

To determine the function $f$ from data, we collect a time history of the state $\mathbf{x}$ and either measure the derivative $\dot{\mathbf{x}}$ or approximate it numerically. The data are sampled at several times $t_1, t_2, \dots, t_m$ and arranged into two matrices:

\[\begin{split} \mathbf{X} = \begin{bmatrix} \mathbf{x}^T(t_1)\\ \mathbf{x}^T(t_2)\\ \vdots\\ \mathbf{x}^T(t_m) \end{bmatrix} = \begin{bmatrix} x_1(t_1) & x_2(t_1) & \dots & x_n(t_1)\\ x_1(t_2) & x_2(t_2) & \dots & x_n(t_2)\\ \vdots & \vdots & \ddots & \vdots\\ x_1(t_m) & x_2(t_m) & \dots & x_n(t_m)\\ \end{bmatrix} \end{split}\]

\[\begin{split} \dot{\mathbf{X}} = \begin{bmatrix} \dot{\mathbf{x}}^T(t_1)\\ \dot{\mathbf{x}}^T(t_2)\\ \vdots\\ \dot{\mathbf{x}}^T(t_m) \end{bmatrix} = \begin{bmatrix} \dot{x}_1(t_1) & \dot{x}_2(t_1) & \dots & \dot{x}_n(t_1)\\ \dot{x}_1(t_2) & \dot{x}_2(t_2) & \dots & \dot{x}_n(t_2)\\ \vdots & \vdots & \ddots & \vdots\\ \dot{x}_1(t_m) & \dot{x}_2(t_m) & \dots & \dot{x}_n(t_m)\\ \end{bmatrix} \end{split}\]

Next, we construct a library $\Theta(\mathbf{X})$ consisting of candidate non-linear functions of the columns of $\mathbf{X}$. For example, \Theta(\mathbf{X})$ may consist of constant, polynomial, and trigonometric terms:

\[ \Theta(\mathbf{X}) = \begin{bmatrix} \mathbf{1} & \mathbf{X} & \mathbf{X}^{P_2} & \mathbf{X}^{P_3} & \dots & \sin(\mathbf{X}) & \cos(\mathbf{X}) \end{bmatrix} \]

Here, higher polynomials are denoted as $\mathbf{X}^{P_2}$, $\mathbf{X}^{P_3}$, etc., where $\mathbf{X}^{P_2}$ denotes the quadratic nonlinearities in the state x:

\[\begin{split} \mathbf{X}^{P_2} = \begin{bmatrix} x_1^2(t_1) & x_1(t_1)x_2(t_1) & \dots & x_n^2(t_1)\\ x_1^2(t_2) & x_1(t_2)x_2(t_2) & \dots & x_n^2(t_2)\\ \vdots & \vdots & \ddots & \vdots\\ x_1^2(t_m) & x_1(t_m)x_2(t_m) & \dots & x_n^2(t_m)\\ \end{bmatrix} \end{split}\]

Each column of $\Theta(\mathbf{X})$ represents a candidate function for the right-hand side of the system’s ODE. There is tremendous freedom in choosing the entries in this matrix of nonlinearities.

Because we assume that only a few of these nonlinearities are active in each row of $f$, we may set up a sparse regression problem to determine the sparse vectors of coefficients $\Xi = \begin{bmatrix} \xi_1 ^ \xi_2 & \dots & \xi_n \end{bmatrix}$ that determine which nonlinearities are active:

\[ \dot{\mathbf{X}} = \Theta(\mathbf{X})\Xi \]

This is illustrated in Schematic of the SINDy algorithm, demonstrated on the Lorenz equations. Data are collected from the system, including a time history of the states \mathbf{X} and derivatives \dot{\mathbf{X}}. Next, a library of nonlinear functions of the states, \Theta(\mathbf{X}), is constructed. This is then used to find the fewest terms needed to satisfy \dot{\mathbf{X}} = \Theta(\mathbf{X})\Xi. The few entries in the vectors of \Xi, solved for by sparse regression, denote the relevant terms in the right-hand side of the dynamics. Taken from brunton_discovering_2016. Each column $\xi_k$ of $Xi$ is a sparse vector of coefficients determining which terms are active in the right-hand side for one of the row equations:

\[ \dot{x}_k = f_k(x) = \Theta(x^T)\xi_k \]

../_images/70_sindy_diagram.svg — Fig. 19 Schematic of the SINDy algorithm, demonstrated on the Lorenz equations. Data are collected from the system, including a time history of the states $\mathbf{X}$ and derivatives $\dot{\mathbf{X}}$. Next, a library of nonlinear functions of the states, $\Theta(\mathbf{X})$, is constructed. This is then used to find the fewest terms needed to satisfy $\dot{\mathbf{X}} = \Theta(\mathbf{X})\Xi$. The few entries in the vectors of $\Xi$, solved for by sparse regression, denote the relevant terms in the right-hand side of the dynamics. *Taken from [BPK16a]*#

SINDy with Control (SINDyc)#

SINDYc generalizes the SINDY method to include inputs and control. In particular, it considers the nonlinear dynamical system with inputs $\mathbf{u}$:

\[ \dot{\mathbf{x}} = f(\mathbf{x}, \mathbf{u}) \]

The SINDY algorithm is readily generalized to include actuation, as this merely requires building a larger library $\Theta(\mathbf{X}, \mathbf{U})$ of candidate functions that include $u$; these functions can include nonlinear cross terms in $\mathbf{x}$ and $\mathbf{u}$. This extension requires measurements of the state $\mathbf{x}$ as well as the input signal $\mathbf{u}$.

../_images/70_sindy_with_control_diagram.svg — Fig. 20 Schematic of the SINDy with control algorithm. Active terms in a library of candidate nonlinearities are selected via sparse regression. *Taken from [FKK+21]*.#

Koopman Operator#

Given $\mathcal{F}$ a space of functions $g: \Omega \rightarrow \mathbb{C}$, and $\Omega$ the state space of our dynamical system. The Koopman operator is defined on a suitable domain $\mathcal{D}(\mathcal{K}) \subset \mathcal{F}$ via the composition formula:

\[ [\mathcal{K}g](\mathbf{x}) = [g \circ f](\mathbf{x}) = g(f(\mathbf{x})), \quad g \in \mathcal{D}(\mathcal{K}) \]

Where $\mathbf{x}_{t+1} = f(\mathbf{x}_t)$

The functions $g$, referred to as observables, serve as tools for indirectly measuring the state of the system. Specifically, $g(x_t)$ indirectly measures the state $x_t$.

In this context, $[\mathcal{K}g](\mathbf{x}_t) = g(f(\mathbf{x}_t)) = g(\mathbf{x}_{t+1})$ represents the measurement of the state one time step ahead of $g(\mathbf{x}_t)$. This process effectively captures the dynamic progression of the system.

The key property of the Koopman operator $\mathcal{K}$ is its linearity. This linearity holds irrespective of whether the system’s dynamics are linear or nonlinear. Consequently, the spectral properties of $\mathcal{K}$ become a powerful tool in analyzing the dynamical system’s behavior.

if $g \in \mathcal{F}$ is an eigenfunction of $\mathcal{K}$ with eigenvalue $\lambda$, then:

\[ g(\mathbf{x}_t) = [\mathcal{K}^t g](\mathbf{x}_0) = λ^t g(\mathbf{x}_0), \quad \forall n \in \mathbb{N}. \]

One of the most useful features of Koopman operators is the Koopman Mode Decomposition (KMD). The KMD expresses the state $\mathbf{x}$ or an observable $g(\mathbf{x})$ as a linear combination of dominant coherent structures. It can be considered a diagonalization of the Koopman operator.

As a result, the KMD is invaluable for tasks such as dimensionality and model reduction. It generalizes the space-time separation of variables typically achieved through the Fourier transformor singular value decomposition (SVD).

Dynamic Mode Decomposition (DMD)#

Dynamic Mode Decomposition (DMD) is a popular data-driven analysis technique used to decompose complex, nonlinear systems into a set of coherent structures (also called DMD modes), that grow, decay, and/or oscillate in time revealing underlying patterns and dynamics through spectral analysis. In other words, the DMD converts a dynamical system into a superposition of modes whose dynamics are governed by eigenvalues.

The simplest and historically first interpretation of DMD is as a linear regression.

We consider a discrete-time dynamical systems represented as:

\[ \mathbf{x}_{t+1} = f(\mathbf{x}_t), \quad n = 0, 1, 2, \dots, \]

Given discrete-time snapshots of the system:

\[ \{\mathbf{x}^{(m)}, \mathbf{y}^{(m)}\}^M_{m=1}, \quad \text{s.t.} \quad \mathbf{y}^{(m)} = f(\mathbf{x}^{(m)}), \quad m = 1, \dots , M. \]

We define the snapshot matrices $\mathbf{X}, \mathbf{Y} \in \mathbb{C}^{d\times M}$ as:

\[ \mathbf{X} = \begin{bmatrix}x^{(1)} & x^{(2)} & \dots & x^{(M)} \end{bmatrix} \]

\[ \mathbf{Y} = \begin{bmatrix}y^{(1)} & y^{(2)} & \dots & y^{(M)} \end{bmatrix} \]

We seek a matrix $\mathbf{K}_{\text{DMD}}$ such that $\mathbf{Y} \approx \mathbf{K}_{\text{DMD}} \mathbf{X}$. We can think of this as constructing a linear and approximate dynamical system.

To find a suitable matrix KDMD, we consider the minimization problem:

(1)#\[ \underset{\mathbf{K}_{\text{DMD}} \in \mathbb{C}^{d\times d} }{\min} \left\lVert \mathbf{Y} − \mathbf{K}_{\text{DMD}} \mathbf{X} \right\rVert_F , \]

where $\left\lVert . \right\rVert_F$ denotes the Frobenius norm[4]. Similar optimization problems will be at the heart of the various DMD-type algorithms we consider in this review. A solution to the problem in (1) is:

\[ \mathbf{K}_{\text{DMD}} = \mathbf{Y} \mathbf{X}^{+} ∈ \mathbb{C}^{d\times d}, \]

where $^{+}$ denotes the Moore–Penrose pseudoinverse.

In practice, this is computed using the Singular Value Decomposition (SVD) as follows:

\[\begin{split} \begin{array}{ll} \mathbf{X} \approx U \Sigma V^∗ & \text{(truncated SVD of rank r)}\\ \tilde{\mathbf{K}}_{\text{DMD}} = U^∗\mathbf{Y} V \Sigma^{-1} & \text{(Compute compression)}\\ \tilde{\mathbf{K}}_{\text{DMD}} W = W \Lambda & \text{(Compute eigendecomposition)}\\ \phi = YV\Sigma^{-1}W & \text{(Compute the modes)}\\ \end{array} \end{split}\]

The core goal of DMD is to apply linear algebra and spectral techniques to the analysis, prediction, and control of nonlinear dynamical systems. However, DMD often faces several challenges that have been a driving force for the many versions of the DMD algorithm that have appeared.

Generally speaking, the error of DMD and its approximate KMD can be split into three types:

The projection error is due to projecting/truncating the Koopman operator onto a finite-dimensional space of observables. This is linked to the issue of closure and lack of (or lack of knowledge of) non-trivial finite-dimensional Koopman invariant subspaces.
The estimation error is due to estimating the matrices that represent the projected Koopman operator from a finite set of potentially noisy trajectory data.
Numerical errors (e.g., roundoff, stability, further compression, etc.) incurred when processing the finite DMD matrix.

Variants#

Table 1 Summary of some DMD methods. *Adapted from [Col23]*.#
DMD Method	Challenges Overcome	Key Insight/Development
Forward-Backward DMD	Sensor noise bias.	Take geometric mean of forward and backward propagators for the data.
Total Least-Squares DMD	Sensor noise bias.	Replace least-squares problem with total least-squares problem.
Optimized DMD Bagging Optimized DMD	Sensor noise bias. Optimal collective processing of snapshots.	Exponential fitting problem, solve using variable projection method. Statistical bagging sampling strategy.
Compressed Sensing	Computational efficiency. Temporal or spatial undersampling.	Unitary invariance of DMD extended to settings of compressed sensing (e.g., RIP, sparsity-promoting regularizers).
Randomized DMD	Computational efficiency. Memory usage.	Sketch data matrix for computations in reduced-dimensional space.
Multiresolution DMD	Multiscale dynamics.	Filtered decomposition across scales.
DMD with Control	Separation of unforced dynamics and actuation.	Generalized regression for globally linear control framework.
Extended DMD	Nonlinear observables.	Arbitrary (nonlinear) dictionaries, recasting of DMD as a Galerkin method.
Physics-Informed DMD	Preserving structure of dynamical systems. Numerous instances given in general framework.	Restrict the least-squares optimization to lie on a matrix manifold.

DMD with Control (DMDc)#

../_images/70_dmdc_overview.svg — Fig. 22 Overview of Dynamic Mode Decomposition with Control (DMDc). *Adapted from [PBK16]*.#

One of the most successful applications of the Koopman operator framework lies in control with demonstrated successes in various challenging appli- cations. These include fluid dynamics, robotics, power grids, biology, and chemical processes.

The key point is that Koopman operators represent nonlinear dynamics within a globally linear framework. This approach leads to tractable convex optimization problems and circumvents theoretical and computational limitations associated with nonlinearity. Moreover, it is amenable to data-driven, model-free approaches.

DMDc extends DMD to disambiguate between unforced dynamics and the effect of actuation.

The DMD regression is generalized to:

\[ \mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t) \approx A \mathbf{x}_t + B \mathbf{u}_t \]

where $A \in \mathbb{C}^{d\times d}$ and $B \in \mathbb{C}^{d\times q} are unknown matrices.

Snapshot triplets of the form $\{\mathbf{x}^{(m)}, \mathbf{y}^{(m)}, \mathbf{u}^{(m)}\}^M_{m=1}$ are collected, where we assume that:

\[ \mathbf{y}^{(m)} \approx f(\mathbf{x}^{(m)}, \mathbf{u}^{(m)}), \quad m = 1, \dots , M. \]

The control portion of the snapshots is arranged into the matrix $\Upsilon = \begin{pmatrix}u^{(1)} & u^{(2)} & \dots & u^{(M)}\end{pmatrix}$.

The optimization problem in (1) is replaced by

(2)#\[\begin{split} \underset{A, B}{\min} \left\lVert \mathbf{Y} − \begin{pmatrix}A & B \end{pmatrix}\Omega \right\rVert_F^2,\\ \text{where} \quad \Omega = \begin{pmatrix}\mathbf{X} \\ \Upsilon\end{pmatrix} \end{split}\]

A solution is given as $\begin{pmatrix}A & B \end{pmatrix} = \mathbf{Y} \Omega^{+}$.

Machine Learning & Control

Contents

Machine Learning & Control#

Learning-Based Control#

System Identification#

Data Collection#

Cart#

Model Fitting#

Naive Deep Learning#

Deep Learning with Nominal Model#

Sparse Identification of Nonlinear Dynamics (SINDy)#

SINDy with Control (SINDyc)#

Koopman Operator#

Dynamic Mode Decomposition (DMD)#

Variants#

DMD with Control (DMDc)#

Comparison#

Application to Systems#

Cart#

Data#

SINDYc#

DMDc#

Inverted Pendulum#

Exercise#