How can an algorithm play games?

Recently, AlphaGo made news as the first computer program to beat a world champion at the game of Go by 4 games to 1. Even more recently, it beat the player with highest ELO rating 3 games to none. And even more recently, AlphaZero, the next step in the AlphaGo evolution, was able to not only beat AlphaGo at Go but beat able to beat top programs at chess and shogi as well.

How does an algorithm play a game? In this post, I want to explore some of the methods and ideas used in an accessible format. In a follow up post, I hope to explore how computers *reason* about games. So, let’s begin!

Let’s start off with a game you probably mastered years ago – Tic Tac Toe. If both players play optimally in Tic Tac Toe, the game ends in a draw. You probably know how to play this game perfectly, guaranteeing at least a draw and sometimes even winning against players who make mistakes. But here’s a different task for you – how do you explain to a computer how to replicate your results?

Your first method might be to supply the program with a list of possible states the board can be in and then manually tell it what move to play in each situation. After all, as far as a modern computer is concerned, the number of possible legal configurations of a Tic Tac Toe board is rather small. The notion of having a function that maps states of the board to actions will be our notion of a *strategy*.

Well, that algorithm is rather unsatisfying. After thinking for a bit, you may even codify the results a bit to clean the algorithm up a bit. In the end, you can reach this known optimal algorithm for tic tac toe play, which I am going to paraphrase from the Wikipedia page for Tic Tac Toe. Here it is in pseudocode.

On your turn, do the highest possible ranked step (lower number means higher rank):

- Win (if it’s possible to win, then complete your line)
- Block (block the opponent’s imminent threat to win)
- Fork (create two of your lines at once that are both ready to win)
- Block a fork (deny your opponent a fork by either creating two in a row and forcing a defense or taking the fork spot)
- Center (take the center)
- Opposite corner (take a corner opposite to a corner your opponent has taken)
- Empty Corner (take an empty corner)
- Empty Side (take an empty side)

It turns out that this is enough to play a perfect game of tic tac toe, which you may have already been using implicitly. This is our first and most unsatisfying way to create a game playing algorithm – an explicit description of a winning strategy. This is only possible when dealing with the simplest games, as here we relied heavily on the fact that there are not many options available to the player on each turn of Tic Tac Toe. How many steps would be involved in a list that described the perfect strategy for chess?

Well, now we have our algorithm for Tic Tac Toe. To develop our second algorithm, let’s delve a bit into formal game theory.

Zermelo’s theorem states that every finite, two player, perfect information game without an element of chance either one player has a winning strategy or both players have a strategy that forces a draw. Finite means the game has a finite number of configurations, and perfect information means that no element of the game is hidden from any player at any time. The original theorem reasons about infinite sequences within finite games, but for our more practical game playing approach we won’t consider this. So, does this apply to complicated games like Chess and Go? Yes, indeed it does, but unfortunately it doesn’t say whether white wins, black wins, or the game ends in a draw – it just says that the game is determined from the starting position. This is a useful tool when *reasoning *about games, but not so if we want to decide what moves to play to win.

The proof of the simple version of this theorem (no infinite sequences) leads us to our second algorithm – backwards induction.

Since we are dealing with two players, let us call them red and blue for simplicity. Now imagine a game ‘unraveled’ in the following way to create a mathematical tree:

The root node of the tree is the starting configuration of the game, and we denote who plays in this position as well. Since the turns are alternating, this is all the turn specification we need to do.

Each child of the root node represents one move away from the starting configuration. In chess, we would start with the starting configuration of chess with a note that the white player moves first. Then, we would make a child node for *each possible move *and denote these nodes by the resulting board. So there would be a node in which a pawn was moved to e4, one for the pawn moved to d4, one in which the left knight moves to the left, etc etc.

Now, continue this game tree until we reach a node in which the game has ended – either white or black has given checkmate or the game has ended in a draw. Now, look at all these “ending” nodes where a result has been given. Let’s go one to one move before this.

Since we have evaluated all these nodes, let’s have go to one move before, and say that it is black to play in this situation before this. Black will choose the move that maximizes his payoff, meaning that he will choose to move to a win if possible, a draw if a win is not possible, and a loss node if it can’t be avoided. This gives an evaluation to this ‘one move before’ node, even though it didn’t have an evaluation before. If we do this the whole way up, we eventually evaluate the starting position, and now finding the optimal strategy is a pathfinding problem. This lets us both reason about the game and discover a winning strategy.

Although this means that the computer is the one computing the winning strategy and not us, the cost is enormous. For Tic Tac Toe alone, a naïve approach (one that does not consider symmetry) would start with a root node, that root node would have 9 children, *each of those children would have 8 children, each of those 8 children would have 7 children, *so on so forth. That’s an enormous structure for just Tic Tac Toe! If we tried that kind of approach with chess, we would get a time out. If we tried it with Go, our computer would probably revolt.

Let’s generalize that argument we gave regarding the simplified Zermelo’s theorem. Let’s just assign a value of +1 for the white player winning, 0 for a draw, and -1 for the white player losing. Then, we can do the same type of reasoning as described before but with a convenient notation – at every turn black tries to minimize the score while white tries to maximize it. In fact, this can be generalized to any set of numbers, not just 1,0, and -1 – this is what is called the MiniMax algorithm.

Here is a small game, poorly drawn in Paint with my laptop trackpad. At the end, the bottom nodes, we are given the payoffs 17, -3 ,2 and 4. These represent the payoffs for the first player. Since the second player will be the one who decides between the 17 and the -3 and the 2 and the 4, he will minimize this value and choose the -3 and the 2. Therefore, to the first player these nodes are worth -3 and 2, respectively. Since the first player will want to maximize on his turn, he will choose the final result of 2 between the 2 and the -3. Therefore, at the start position, we say player 1 has a value of 2, and that’s the end value of this game. If we were to repeat this type of example with 1,0 and -1 we could pivot back to our original argument by saying who wins the game given optimal play.

We can improve this algorithm and get to our first usable algorithm – Alpha Beta pruning. Using Alpha Beta pruning, you could write a chess engine that could probably beat you (unless you are rather good at chess), representing the first time in this article we talked about an algorithm capable of surpassing your own abilities as a human.

Alpha refers to the best possible explored value for the maximizer, and Beta refers to the best possible explored value for the minimizer. The improvement comes from the following – nodes will only be explored when they either present the option of beating Alpha for the maximizer or undercutting Beta for the minimizer.

For a small game tree this might not seem too useful but imagine that we had a way of partially evaluating the game. In chess, this would take the form of a function that maps chess positions to numbers, the higher they are, the more advantaged white. Novice chess players are already taught a simple rule of thumb regarding this, where a pawn is worth 1, a knight/bishop is worth 3, a rook is worth 5 and a queen is worth 9. Of course, this completely ignores crucial factors of chess such as position. Armed with a suitable evaluation function and data based on previous games from masters, Deep Blue was able to use MiniMax to beat the then world champion of chess, Gary Kasparov in 1997. It did this by looking 12 moves ahead and applying its evaluation function to the resulting boards. So that’s pretty good!

But not good enough… It turns out that even with our powerful MiniMax logic and optimizations like Alpha Beta to make it easier to compute, games like Go are still way too big to compute effectively using these types of algorithms.

Let’s ask an important question now: what if we stopped playing perfectly, and just aimed for playing as good as possible? All the algorithms described have offered a surefire way to compute the best move, the only error that could possibly be introduced is through an imperfect partial evaluation function. But, other than that, we’ve always been describing a way to pick the best possible move based on the information we have. Let’s spice things up and add some randomness and imperfections to our play.

AlphaGo and AlphaZero both used machine learning techniques. I’ll only provide a brief, high level view on these techniques, because, honestly, why read it from me when you can read it directly from DeepMind itself? They’ve already written accessible blog posts on the topic, which you can find here. https://deepmind.com/research/alphago/

The first cool thing that AlphaGo did was use Monte Carlo Tree search. The basic idea behind MCTS is that we should evaluate nodes using a random process. Given a node representing a move, let’s simulate a bunch of random moves until the game reaches the end, and note whether we won, lost, or drew. Now, let’s do that a lot of times and create a winning “fraction”. This is our node evaluation, and we’ll only expand our search tree when we get favorable results.

In addition to this, AlphaGo used two neural networks – one that predicted the winner of the game and one that chose a good move to play. A neural network is a model designed to mimic the structure of neurons in the brain. It consists of multiple “perceptrons” linked together. A perceptron takes as input a linear combination of some inputs (even the result of other perceptrons) and “fires” an output if they exceed a certain threshold. By layering many of these together, DeepMind was able to train a neural network to evaluate positions based on data from previously played games and one that could guess the next move taken. These two networks worked together to create the monumental result of beating Lee Sedol, a world champion in Go.

AlphaZero did something entirely different. It did not use previous data (hence, the name “Zero”) and only learned from playing itself many times and figuring out what worked and what didn’t for itself. It did this through a technique called reinforcement learning. Basic reinforcement learning is based off of something called a Markov decision process. The very basic idea is this: at each game state, our player has multiple options available to it, and probabilities assigned to each of those actions. It then picks one based on probability and earns a reward. The goal of the player is to maximize its reward. If playing a move leads to something unfavorable, it should adjust the probability of that move so that it is less likely to happen and vice versa. A very basic implementation of this is MENACE, the Machine Educable Noughts and Crosses Engine, which we hope to explore in a later article. You can read about it and explore the basic principles of reinforcement learning here – it’s an article discussing both a physical and Python re-implementation of MENACE. http://chalkdustmagazine.com/features/menace-machine-educable-noughts-crosses-engine/

Our virtual implementation will also be put on this blog sometime in the future, so you can check that out, too. UPDATE: it’s now here! https://bogobogosort.wordpress.com/2018/04/04/m-e-n-a-c-e/

These techniques are also being used in imperfect information games such as poker – note that when we moved to the domain of machine learning, we dropped the perfect information requirement. If you want to read more about that, check out DeepStack.

*Senthil Rajasekaran*