M.E.N.A.C.E.

In this post, Sameer and I will introduce a Python implementation of MENACE, the Machine Educable Noughts and Crosses Engine – Noughts and Crosses is another name for Tic Tac Toe.

Sameer spearheaded the code here, making using of a really great Tic Tac Toe environment he had made previously. Check out the github repo here: https://github.com/eskeype/GamePlayer

We’ll present some code here, but the formatting is less than ideal due to the. WordPress formatting. The code here is meant to just show what parts of the working code look like for intuition, its not meant to be run

So, let’s talk about what MENACE is. Around 1958, Martin Gardner published an article about teaching a machine made from matchboxes how to play a game called ‘hexapawn’. Hexapawn is an extremely simple game played on a 3×3 chessboard. On the first row, white is given three pawns and on the last row black is given three pawns. The object of the game is to either advance one of your pawns to the end, take all of the enemy’s pawns, or to place your enemy is a position where he cannot move. Take a look: http://cs.williams.edu/~freund/cs136-073/GardnerHexapawn.pdf

Martin Gardner proposed that all sufficiently excited readers try for themselves – get 24 matchboxes that represent the 24 states of the game in them and place beads in them. Then, the machine plays by the following way, in a state S, open the matchbox labeled S and pull a bead at random from it. The bead you pull from it will determine the machine’s move. If the machine ends up losing, take one bead out of the move type it made from each state, so it is less likely to do that in the future. If it won, do the exact opposite, reward it by placing one more bead in each state. In the paper linked above, you can see that it took 11 loses before the hexapawn machine was able to play a perfect game.

MENACE is pretty much the exact same idea, but with Tic Tac Toe, not hexapawn. The idea of MENACE was first conceived by Donald Michie in the 1960s. Both represent a rudimentary version of reinforcement learning, the powerful artificial intelligence technique behind the success of DeepMind’s AlphaZero (and a lot of other stuff).

So what does a matchbox look like in Python? In fact, problem 381 of Leetcode gives us a good version. Sameer and I both wrote solutions separately to this problem, mine can be found on my github : https://github.com/pijel/assorted-algos/blob/master/RandomizedCollection.py

Here’s what Sameer’s looks like:



Import random
	class RandomizedCollection:
	    """
	    RandomizedCollection -> new empty RandomizedCollection
	    """
	    
	    def __init__(self):
	        """
	        Initialize your data structure here.
	        """
	        self._element_list = []#collection elements
	        self._element_to_indicies = {}#collection element to set of inidicies containing that element
	        
	

	    def insert(self, val):
	        """
	        Inserts a value to the collection. Returns true if the collection did not already contain the specified element.
	        :type val: int
	        :rtype: bool
	        """
	        self._element_list.append(val)
	

	        if val not in self._element_to_indicies:
	            self._element_to_indicies[val] = set()
	

	        self._element_to_indicies[val].add(len(self._element_list)-1)
	        
	        
	

	    def remove(self, val):
	        """
	        Removes a value from the collection. Returns true if the collection contained the specified element.
	        :type val: int
	        :rtype: bool
	        """
	        if val not in self._element_to_indicies:
	            return False
	

	        val_ind = self._element_to_indicies[val].pop()
	        
	            
	        self._element_list[val_ind] = self._element_list[-1]
	        self._element_to_indicies[self._element_list[-1]].add(val_ind)
	        self._element_to_indicies[self._element_list[-1]].remove(len(self._element_list)-1)
	        
	        
	        self._element_list.pop()
	        
	        if len(self._element_to_indicies[val])==0:
	            del self._element_to_indicies[val]
	            
	        return True
	        
	

	    def getRandom(self):
	        """
	        Get a random element from the collection.
	        :rtype: int
	        """
	        return random.choice(self._element_list)
	

	    def empty(self):
	        return len(self._element_list) == 0
	

	    def __str__(self):
	        return str(self._element_list)
			
	


I recommend viewing the code on Github, as unfortunately WordPress tends to wrap lines where it shouldn’t, which is absolutely deadly in Python. The specific Github link for this code can be found here:
https://github.com/eskeype/GamePlayer/blob/master/src/RandomizedCollection.py

This is our Python ‘matchbox’ - a data structure that is able to insert, delete, and retrieve a random element in an average of O(1) time. It works by storing both a list that represents the elements actually in the matchbox and a dictionary that keeps track of the indices where each element can be found. The idea here is that we store potential positions to play in these Randomized Collections and use getRandom to represent drawing a bead from that matchbox. Insert and Delete will both be used when it is time to train the machine by either punishing the moves it made or rewarding them.

This is basically the key behind the implementation. In the Github you can see files for Match.py, which organizes the match between two players, TicTacToe.py, which defines the rules of Tic Tac Toe, and two other files – PerfectPlayer.py and OtherPlayers.py. In OtherPlayers.py, we allow for a human player and a random player who literally just picks from any valid move. The Perfect Player uses a full game tree and backwards induction to always play a winning move, but it sometimes shows some irregularities - it may for example, prefer to create a fork instead of finishing a line and winning, as both are winning states to the Perfect Players. The Perfect Player will never lose, you can manage only a draw against it at most.

And finally, we have the Matchbox Player itself. The main component behind the Matchbox Player is a file maintained externally through the use of pickle (though we may change it to cPickle later) called move_distribution. Move_distribution is a dictionary that maps states of the tic tac toe board to randomized collections, and it’s updated in the Match.py file, so we don’t read and write it every time we play a game. Intuitively, move_distribution represents the matchboxes and their contents. The Matchbox Player keeps track of the move history in from of a list of pairs of a tuple and an integer, referring to the board it saw and the move it played. This is essential for its training.



def train(self, winner):
	if winner == self.player_id:
		#reward
		for move in self.move_history:
			self.move_distribution[move[0]].insert(move[1])
	

	elif winner == 3 - self.player_id:
		#punish 
		for move in self.move_history:
			self.move_distribution[move[0]].remove(move[1])
	


Every player technically has a train function, but only the Matchbox Player has anything other than “pass”. This is the meat of the Matchbox Player’s function. At the end of every game we go through either the winning or losing moves and remove(add) one of them, so it is less (more) likely to play it again. If the Matchbox player enters a state it’s never seen before, it makes a new randomized collection to play with.


				
valid_moves_list = valid_moves(board)
	

for i in valid_moves_list:
	for j in range(len(valid_moves_list)):
		distribution.insert(i)


It looks at the length of the list of valid moves and puts it in that many beads, just like the original MENACE. This means endgame moves are punished and rewarded much harder than early game moves.

Once again, to understand the code I really recommend viewing it in its entirety on Github here. https://github.com/eskeype/GamePlayer

Those are the quick details about how the Matchbox Player was designed. Now, let’s talk about how it performs! We mentioned before that if the Matchbox Player sees a state its never seen before, it will create a randomized collection that has an equal probability of selecting each valid move. That means with no training, our Matchbox player is essentially a Random Player.

For ending states of Tic Tac Toe, there are 91 distinct states in which X wins, 44 distinct states in which O wins, and 3 draws, given that X denotes the first player. That means that, between two Random Players, X will have a decisive advantage. Let’s first pit two Random Players against each other and see how that turns out.

In a 100 game bout between Random Players, we get 53 wins for X and 11 draws. This seems to be fairly standard, as we hover around 50 – 60 ish wins for X and 10ish draws in repeated attempts. With 100,000 matches, we had on the order of 58K wins and 12K draws.

After training for 10,000 matches, our Matchbox Player was able to win 80 games in a 100 bout match as X against the random player and draw 14 times. That means it still lost 6 times, but it came a long way from being a random player. Running it repeatedly gave similar results.

Against the Perfect Player, the Matchbox Player that trained 10K times on a Random Player is able to manage 37 draws, more than twice the 15ish we see from the Random Player. It’s still nowhere in the league of the Perfect Player.

Our training is very slow, and we’re not sure exactly how we should work to speed it up. One would expect that playing 10K games of tic tac toe would be enough for mastery, but with our current algorithm there’s still room for sizeable improvement. One thing that a lot of Tic Tac Toe algorithms do that we do not is account for symmetry when looking at boards. This may be a factor behind the slow training. There’s still a lot to consider!

Of course, this algorithm is not the best. If we are literally going to build a dictionary of states, then what stops us from using backwards induction, as mentioned in a previous blog post? For these kind of resources, we could easily implement something like the Perfect Player, and that plays perfectly.
https://bogobogosort.wordpress.com/2018/03/20/how-algorithms-play-games/

But consider what it did: using no prior knowledge whatsoever and an algorithm that barely mentioned tic tac toe itself, the Matchbox Player was able to learn something about the game for itself. It wasn’t able to do it perfectly, but it was able to convincingly beat its previous iteration. I think that’s pretty cool.

Sameer Punjal Senthil Rajasekaran

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s