MPC and AlphaZero

MPC and AlphaZero#

AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and go.

It is built from three core pieces:

  • Value Function (Neural Network) to estimate the optimal cost-to-go for any given state.

  • Policy (Neural Network) to determine the action to take at a given state.

  • Monte Carlo Tree Search (MCTS) to simulate and search for the best plan.

../_images/50_alphazero_online_play.png

Fig. 16 Illustration of an on-line player such as the one used in AlphaGo, AlphaZero, and Tesauro’s backgammon program. At a given position, it generates a lookahead tree of multiple moves up to some depth, then runs the off-line obtained player for some more moves, and evaluates the effect of the remaining moves by using the position evaluator of the off-line player.#

In AlphaZero, the policy and value networks are trained off-line and an approximate version of the fundamental DP algorithm of policy iteration. A separate on-line player is used to select moves, based on multistep lookahead minimization and a terminal position evaluator that was trained using experience with the off-line player.

This approach performs better than using the off-line policy directly because of the long lookahead minimization, which corrects for the inevitable imperfections of the neural network-trained off-line player, and position evaluator/terminal cost approximation.

In model predictive control (MPC), there is no off-line training and we use the system’s model for the on-line rollout. The control interval is equivalent to the number of steps in lookahead minimization, while the prediction interval is equivalent to the total number of steps in lookahead minimization and truncated rollout.