Saturday, October 1, 2016



Is an AI developed by Google Deepmind that recently became the first machine to beat a top level human Go player.


Is an attempt to apply the same techniques used in AlphaGo to Tic-Tac-Toe. Why? I hear you ask. Tic-tac-toe is a very simple game and can be solved using basic min-max.

Because it's a good platform to experiment with some of the AlphaGo techniques which it turns out they work at this scale. Also the neural networks involved can also be trained on my laptop in under an hour as opposed too the weeks on an array of super computers that AlphaGo required.

The project is written in Python using TensorFlow, the Github is here and contains code for each step that AlphaGo used in it's learning. It also contains code for Connect 4 and this ability to build games of Tic-Tac-Toe on larger boards.

Here is a sneak peak at how it did in the 3x3 game. In this graph it is training as first player and gets too an 85% win rate against a random opponent after 300000 games.

I will do a longer write up of this at some point, but in the mean time here is a talk I did about AlphaToe at a recent DataScienceFestival event in London. Which gives a broad overview of the project:


Thursday, May 26, 2016

Using Net2Net to speed up network training

When training neural networks there are 2 things that combine to make life frustrating:
  1. Neural networks can take an insane amount of time of train.
  2. How well a network is able to learn can be hugely affected by the choice of hyper parameters(hyper parameters here refers mainly to the numbers of layer and numbers of nodes per layer, but can also include learning rate, activation functions, etc) and without training a network in full you can only guess at which choices are better.
If a network could be trained quickly number 2 wouldn't really matter, we could just do a grid search(or even particle swarm optimization(or maybe Bayesian optimization)) to run through a lots of different possibilities and select the hyper parameters with the best results. But for something like reinforcement learning in computer games the amount of time to train is counted in days so better hope your first guess was good...

My current research is around ways to try and get neural networks to adjust there size automatically, so that if there isn't sufficient capacity in a network it will in some way determine this and resize itself. So far my success has been (very) limited, but while working on that I thought I would share this paper: Net2Net: Accelerating Learning via Knowledge Transfer which has a good, simple approach to resizing networks manually while keeping there activation unchanged.

I have posted a numpy implementation of it here on Github.

Being able to manually resize a trained network can give big savings on networks training time because when searching through hyper parameters options you can start off with a small partially trained network and see how adding extra hidden nodes or layers affects test results.

Net2Net comprises of 2 algorithms Net2WiderNet which adds nodes to a layer and Net2DeeperNet which adds a new layers. The code for Net2WiderNet in numpy looks like this:

This creates the weights and biases for a layer 1 wider than the existing one. To increases the size by more nodes simply do this multiple times(note the finished library on github has the parameter new_layer_size to set exactly how big you want it). The new node is a clone of a random node from the same layer. The original node and it's copy then have their outputs to the next layer halved so that the overall output from the network is unchanged.

How Net2WiderNet extends a layer with 2 hidden node layer to have 3

Unfortunately if 2 nodes in the same layer have exactly the same parameters then their activation will always be identical, which means their back propagated error will always be identical, they will update in the same way, their activation will still be the same, then you gained nothing by adding the new node... To stop this happening a small amount of noise is injected into the new node. This means as they train they have the potential to move further and further apart while training.

Net2DeeperNet is quite simple, it creates an identity layer, then adds a small amount of noise. This means that the network activation is only unchanged if the layer is a linear layer, because otherwise the activation functions non-linearity will alter the output. So bare in mind if you have an activation function on your new layer(and you almost certainly will) then the network output will be changed and will have worse performance until it has gone through some amount of training.
Here is the code:


Usage in TensorFlow

This technique could be used in any neural network library/framework, but here is how you might use it in TensorFlow.

In this example we first train a minimal network with 100 hidden nodes in the first and second layers and train it for 75 epochs. Then we do a grid search of different numbers of hidden nodes for 50 epochs to see which lead to the best test accuracy.

Here are the final results for the different numbers of hidden nodes:

1st layer2nd layerTrain accuracyTest accuracy

Note: don't take this as the best choice for MNIST, this could still be improved by longer training, dropout to stop overfitting, batch normalization, etc

Sunday, May 15, 2016

PyDataLondon 2016

Last week I gave a talk at PyDataLondon 2016 hosted at the Bloomberg offices in central London. If you don't know anything about PyData it is an community of Python data science enthusiasts that run various meetups and conferences across the world. If your interested in that sort of thing and they are running something near to you I would highly recommend checking it out.

Below is the YouTube video for my talk and this is the associated GitHub, which includes all the example code.

The complete collection of talks from the conference is here. The standard across the board was very high, but if you only have time to watch a few, of those I saw here are two that you might find interesting.

Vincent D Warmerdam - The Duct Tape of Heroes Bayesian statistics

Bayesian statistics is a fascinating subject with many applications. If your trying to understand deep learning at a certain point research papers such as Auto-Encoding Variational Bayes and Auxiliary Deep Generative Models will stop making any kind of sense unless you have a good understanding of Bayesian statistics(and even if you do it can still be a struggle). This video works as a good introduction to the subject. His blog is also quite good.

Geoffrey French & Calvin Giles - Deep learning tutorial - advanced techniques

This has a good overview of useful techniques, mostly around computer vision(though they could be applied in other areas). Such as computing the saliency of inputs in determining a classification and getting good classifications when there when there is only limited labelled data.

Ricardo Pio Monti - Modelling a text corpus using Deep Boltzmann Machines in python

This gives a good explanation of how a Restricted/Deep Boltzmann Machine works and then shows an interesting application where a Deep Boltzmann Machine was used to cluster groups of research papers.

Monday, May 2, 2016

Mini-Pong and Half-Pong

I'm going to be giving a talk/tutorial at PyDataLondon 2016 on Friday the 6th of may, if your in London that weekend I would recommend going, there are going to be lots of interesting talks, and if you do go please say hi.

My talk is going to be a hands on, on how to build a pong playing AI, using Q-learning, step by step. Unfortunately training the agents even for very simple games still takes ages and I really wanted to have something training while I do the talk, so I've built two little games that I hope should train a bit faster.


This a version of pong with some of visual noise stripped out, no on screen score, no lines around the board. Also when you start you can pass args for the screen width and height and the game play should scale with these. This means you can run it as an 80x80 size screen(or even 40x40) and save to having to do the downsizing of the image when processing.


This is an even kinder game than pong. There is only the players paddle and you get points just for hitting the other side of the screen. I've found that if you fiddle with the parameters you can start to see reasonable performance in the game with an hour of training(results may vary, massively). That said even after significant training the kinds of results I see are a some way off how well google deepmind report doing. Possibly they are using other tricks not reported in the paper, or just lots of hyper parameter tuning, or there are still more bugs in my implementation(entirely possible, if anyone finds any please submit).

I've also checked in some checkpoints of a trained half pong player, if anyone just wants to quickly see it running. Simply run this, from the examples directory.

It performs significantly better than random, though still looks pretty bad compared to a human. 

Distance from building our future robot overlords, still significant.

Sunday, March 13, 2016

Deep-Q learning Pong with Tensorflow and PyGame

In a previous post we went built a framework for running learning agents against PyGame. Now we'll try and build something in it that can learn to play Pong.

We will be aided in this quest by two trusty friends Tensorflow Google's recently released numerical computation library and this paper on reinforcement learning for Atari games by Deepmind. I'm going to assume some knowledge of Tensorflow here, if you don't know much about it, it's quite similar to Theano and here is a good starting point for learning.


  • You will need Python 2 or 3 installed.
  • You will need to install PyGame which can be obtained here.
  • You will need to install Tensforflow which can be grabbed here.
  • You will need PyGamePlayer which can be cloned from the git hub here.
If you want to skip to the end the completed Deep Q agent is here in the PyGamePlayer project. The rest of this post with deal with why it works and how to build it.


If you read the Deepmind paper you will find this definition of the Q function:
Q function
Lets try and understand it bit by bit. Imagine an agent trying to find it's way out of a maze. In each step he knows his current location, s in the equation above, and can take an action, a, moving one square in any direction, unless blocked by a wall. If he gets to the exit he will get a reward and is moved to a new random square in the maze. The reward is represented by the r in the equation. The task Q-Learning aims to solve is learning the most efficient way to navigate the maze to get the greatest reward.

Bunny agent attempts to locate carrot reward

If the agent were to start by moving around the maze randomly he would eventually hit the reward which would let him know it's location. He could then easily learn the rule that if your state is the reward square then you get a reward. He can also learn that if in any square adjacent to the reward square and you take the action of moving towards it you will get the reward. Thus he knows exactly the reward associated with those actions and can prioritize them over other actions.

But if just choosing the action with the biggest reward the agent won't get far as for most squares the reward is zero. This where the max Q*(s',a') bit of the question comes in. We judge the reward we get from an action not just based on the reward we get for the state it puts us in but also best reward we could get from the best(max) actions available to us in that state. The gamma symbol γ is a const between 0 and 1 that acts as a discount on the reward of things in the future. So the action that gets the reward now is judged better than the action that gives the reward 2 turns from now.

The function Q* represents the abstract notion of the ideal Q* function, in most complex cases it will be impossible to calculate that exactly so we use a function approximator Q(s, a; θ). When a machine learning paper references a function approximator they are (almost always) talking about a neural net. These nets in Q learning are often referred to as Q-nets. The θ symbol in the Q function represents the parameters(weights and bias) of our net. In order to train our layer we will need a loss function, that is defined as:
Loss function

y here is the expected reward of the state using the parameters of our Q from iteration i-1. Here an example of running a q-function in tensorflow. In this example we are running the simplest state possible. It is just an array of states, with a reward for each and the agents actions are moving to adjacent states:

Python Q-learning with tensor flow

Setting up the agent in PyGamePlayer

Create a new file in the your current workspace, that should have the PyGamePlayer project it in(or simply create a new file in the examples directory in PyGamePlayer). Then create a new class that inherits from the PongPlayer class. This will handle getting the environment feedback for us. It gives reward when ever the players score increase and punishes whenever the opponents score increases. We will also add a main here to run it.


If you run this you will see the player moving to the bottom of the screen as the pong AI mercilessly destroys him. More inteligence is needed, so we will override the get_keys_pressed method to actually do some real work. Also as a first step, because the Pong screen is quite big and I'm guessing none of us have a super computer lets compress the screen image so it's not quite so tough on our gpu.


How do we apply Q-Learning to Pong?

Q-Learning makes plenty of sense in a maze scenario but how do we apply it to pong? The Q-function actions are simply the key press options, up, down, or no key pressed. The state could be the screen, but the problem with this is that even after compression our state is still huge, also Pong is a dynamic game, you can't just look at a static frame and know what's going on. Most importantly what direction the ball is moving.

We will want our input to be not just the current frame, but the last few frames, say 4. 80 times 80 pixels is 6400 times 4 frames that's 25600 data points and each can be in 2 states(black or white) meaning there are 2 to the power of 25600 possible screen states. Slightly too many for any computer to reasonably deal with.

This is where the deep bit of deep Q come in. We will use deep convolutional nets(for a good write up of these try here) to compress that huge screen space into a smaller space of just 512 floats and then learn our q function from that output.

So first lets create our convolutional network with Tensorflow:

Create network

Now we will use the exact same technique we used for the simple Q-Learning example above, but this time the state will be a collection of the last 4 frames of the game and there will be 3 possible actions.

This is how you train the network:

And getting the chosen action looks like this:

So get_key_presses needs to be changed to store these observations:

The normal training time for something like this even with a good GPU is in the order of days. But even if you were to train the current agent for days it would still perform pretty poorly. The reason for this is because if we start using the Q-function to determine our actions it will initially be exploring the space with a very poor weights. It is very likely that it will find some simple action that leads to a small improvement and get struck in a local minima doing that.

What we want is too delay using our weights until the agent has a good understanding of the space in which it exists. A good way to initially explore the space is to move randomly then over time slowly add in more and more moves chosen by the agent until eventually the agent is in full control.

Add this to the get_key_presses method
And then make the choose_next_action method this:
And so now hazar, we have a Pong AI!

The PyGamePlayer project:
The complete code for this example is here
Also I've now added the games mini-pong and half-pong which should be quicker to train against if you want to try out modifications.
And further here is a video of a talk I've done on this subject

Sunday, February 28, 2016

Book review: The Phoenix Project

This book is about IT project management, but unlike most management books, instead of being a series of case studies, it is written like a novel. The main character, Bill, is suddenly promoted from head of middleware technology to CTO at a time when the company, Parts limited, is failing apart at the seems.

What I found most surprising was that the book actually worked as a novel. I found myself genuinely caring about Bill and people he worked with and getting sucked in to the drama of delivering the project.

Also for a book about IT project management it's amazing what a page turner it is. Especially in the early sections of the book where everything is going wrong. I found myself wondering how I would deal with his different problems which made me really wish there was a deathtrap dungeon style "choose your own adventure" book based around IT project management:

  • Your CEO is demanding you release the project at the pre-agreed date though you have major worries that it will not scale in production, do you:
    • Accept his demands release the change: turn to page 72
    • Refuse to do the release: turn to page 143
Maybe one day...

As a developer I also found it interesting learning how other aspects of the IT organization work. There is quite a lot of details on what a CIO, CEO and head of sales actually do.
Also though I'm sure the transformations they make in the book are possible, many companies have successfully made it, I found the time scale for these changes a little improbable and it seemed too easy to get buy in from people who were initially so sceptical.

Things to take away from the book
  • In order to mange anything you need to know what work is actually happening.
  • If you have a constraint on any system improving anything but that constraint is not actually going to make things better. In the book a single operations guy is the constraint for half the IT organization.
  • IT should be integrated into all aspects of an organization not seen as an external department in itself.
  • Operations and development need to be integrated.
  • Limit work in progress.
  • Understand The 3 ways:
    • Systems thinking: The flow of work from business requirements through development and ops to the customer.
    • Amplify feedback loops: Make sure issues found in any part of the process are fed back to where they can be fixed. If ops had a problem it may have started with dev. If dev had a problem maybe it needs to go back to the requirements.
    • Culture of continual experimentation and learning: Take the time to improve things, it will be worth it.
  • For IT there are 4 types of work:
    • Business projects
    • Internal projects
    • Operational changes
    • Unplanned work

Tuesday, December 29, 2015

How to run learning agents against PyGame

PyGame is the best supported library for programming games in Python. There are 1000's of open source games that have been built with it.
I wanted to do some reinforcement learning neural networks in games and PyGame seemed the best choice.
But I was a bit disappointed that most examples involved hacking the original game files.
I wanted to see if I could write a general framework for running learning agents in PyGame that would require zero touching of the games files.

If you are interested in writing your own implementation of this then read on, but if you just want to dive straight in and start playing with your own learning agents I've uploaded a project here which contains examples of setting up Pong and Tetris.

Getting started

  • You will need Python 2 or 3 installed.
  • You will need to install PyGame which can be obtained here. You can also install it via pip, but this can be tricky in windows.
  • You will need numpy, downloadable from here or again can be installed from pip
  • You will also need a game of some sort to try and set up. Here is a good simple pong game.

The plan

There are 3 things a learning agent needs to be able to play a game:
  1. An array of all the pixels on the screen when ever it is updated.
  2. The ability to output a set of key presses back into the game.
  3. A feedback value to tell it when it is doing well/badly.
We will tackle them in order.

Grabbing the screen buffer

In PyGame there is this handy method which will give us a 3 dimensional numpy array for each colour of every pixel on screen.
 screen = pygame.surfarray.array3d(pygame.display.get_surface())  

We could write our learning agent by hacking the game file and inserting this line into the main loop after the screen has been updated. But a better way to do it(and a way that allows us to have zero touches of the game file) is to intercept any calls to the pygame.display.update method and then grab the buffer then, like so:
 import pygame  
 import pygame.surfarray  

 # function that we can give two functions to and will return us a new function that calls both
 def function_combine(screen_update_func, our_intercepting_func):  
   def wrap(*args, **kwargs):  
               **kwargs) # call the screen update func we intercepted so the screen buffer is updated  
     our_intercepting_func() # call our own function to get the screen buffer  
   return wrap  

 def on_screen_update():  
   surface_array = pygame.surfarray.array3d(pygame.display.get_surface())  
   print("We got the screen array")  

 # set our on_screen_update function to always get called whenever the screen updated  
 pygame.display.update = function_combine(pygame.display.update, on_screen_update)  
 # FYI the screen can also be modified via flip, so this might be needed for some games  
 pygame.display.flip = function_combine(pygame.display.flip, on_screen_update)  

You can try this out by inserting this code before you start your game and it will print out the screen buffer as it comes in.

Intercepting key presses

The normal method in PyGame detecting key presses is via this method:
 events = pygame.event.get()  

So we can intercept it and have it return our learning agents key presses:
 import pygame  
 from pygame.constants import K_DOWN, KEYDOWN  

 def function_intercept(intercepted_func, intercepting_func):  
   def wrap(*args, **kwargs):  
     # call the function we are intercepting and get it's result  
     real_results = intercepted_func(*args, **kwargs)
     # call our own function and return our new results  
     new_results = intercepting_func(real_results, *args, **kwargs)  
     return new_results  
   return wrap  

 def just_press_down_key(actual_events, *args, **kwargs):  
   return [pygame.event.Event(KEYDOWN, {"key": K_DOWN})]  

 pygame.event.get = function_intercept(pygame.event.get, just_press_down_key)  

I should also warn you that the pygame.event.get method can be called with args to filter out which events are needed. If your running a game that uses these you will either need to handle them or just use my complete implementation here.

Getting feedback to the player

The final piece of the puzzle is handling the feedback/reward from the game. Unfortunately there is no standard way of doing scoring in PyGame so this will always require some amount of going through the game code, but it can still be done with zero touches.

For the pong game the scores are stored in two global variables bar1_score and bar2_score, which can be imported. Our reward is when the score changes in our favor.

 last_bar1_score = last_bar2_score = 0  

 def get_feedback():  
   global last_bar1_score, last_bar2_score  

   # import must be done inside the method because otherwise importing would cause the game to start playing  
   from games.pong import bar1_score, bar2_score  
   # get the difference in score between this and the last run  
   score_change = (bar1_score - last_bar1_score) - (bar2_score - last_bar2_score)  
   last_bar1_score = bar1_score  
   last_bar2_score = bar2_score  
   return score_change  

But for other games, such as Tetris there may not be a globally scoped score variable we can grab. But there may be a method or a set of that methods that we know are good/bad. Such as a player_takes_damage, level_up or kill_enemy. We can use our function_intercept code from before to grab these. Here is an example in Tetris using the result of removeCompleteLines to reward our agent:
 import tetris  

 new_reward = 0.0
 def add_removed_lines_to_reward(lines_removed, *args, **kwargs):  
   global new_reward  
   new_reward += lines_removed  
   return lines_removed  

 tetris.removeCompleteLines = function_intercept(tetris.removeCompleteLines,  

 def get_reward():  
   global new_reward  
   temp = new_reward  
   new_reward = 0.0  
   return temp  

Dealing with frame rates

One final issue that you may need to consider is that the learning agent will significantly impact the execution speed of the game. In a lot of games the physics is scaled by the elapsed time since the last update. If your agent takes 1 second to process a single frame in pong then in the next update loop the ball will have already passed off the screen. The agent may also struggle to learn if there is significant variance in the movement of different frames.
This can be handled by intercepting the pygame.time.get_ticks method and pygame.time.Clock in the same way as we have the other functions. See this file for details.

Pong in PyGamePlayer

Now all that remains is too stitch all those parts together and plug in the learning agent. In my project I've chosen to do this in a class, but it would be fine as a script.
Below is an example of the full thing set up to learn against Pong using the PyGamePlayer module. The PongPlayer simply needs to inherit from the PyGamePlayer class and implement the get_keys_pressed and get_feeback methods, the framework handles everything else.
 from pygame.constants import K_DOWN  
 from pygame_player import PyGamePlayer  

 class PongPlayer(PyGamePlayer):  
   def __init__(self):  
     Example class for playing Pong  
     super(PongPlayer, self).__init__(force_game_fps=10) # we want to run at 10 frames per second
     self.last_bar1_score = 0.0  
     self.last_bar2_score = 0.0  

   def get_keys_pressed(self, screen_array, feedback):  
     # The code for running the actual learning agent would go here with the screen_array and feeback as the inputs
     # and an output for each key in the game, activating them if they are over some threshold.
     return [K_DOWN]  

   def get_feedback(self):  
     # import must be done here because otherwise importing would cause the game to start playing  
     from games.pong import bar1_score, bar2_score  

     # get the difference in score between this and the last run  
     score_change = (bar1_score - self.last_bar1_score) - (bar2_score - self.last_bar2_score)  
     self.last_bar1_score = bar1_score  
     self.last_bar2_score = bar2_score  
     return score_change

 if __name__ == '__main__':  
   player = PongPlayer()  
   # importing pong will start the game playing  
   import games.pong  

So hazar! We now have the worlds worst Pong AI.

In my next post I'll go through writing a good reinforcement learning agent for this.
If you have any questions/correction please don't hesitate to contact me.

Full source code here.