Saturday, October 1, 2016



Is an AI developed by Google Deepmind that recently became the first machine to beat a top level human Go player.


Is an attempt to apply the same techniques used in AlphaGo to Tic-Tac-Toe. Why? I hear you ask. Tic-tac-toe is a very simple game and can be solved using basic min-max.

Because it's a good platform to experiment with some of the AlphaGo techniques which it turns out they work at this scale. Also the neural networks involved can also be trained on my laptop in under an hour as opposed too the weeks on an array of super computers that AlphaGo required.

The project is written in Python using TensorFlow, the Github is here and contains code for each step that AlphaGo used in it's learning. It also contains code for Connect 4 and this ability to build games of Tic-Tac-Toe on larger boards.

Here is a sneak peak at how it did in the 3x3 game. In this graph it is training as first player and gets too an 85% win rate against a random opponent after 300000 games.

I will do a longer write up of this at some point, but in the mean time here is a talk I did about AlphaToe at a recent DataScienceFestival event in London. Which gives a broad overview of the project:



  1. Interesting project! It would be nice if you could add a very basic explanation on how to run it.
    I tried to copy the ""-file to the root folder. But when ran it, it gave me this:
    File "/home/hb9/projects/python/AlphaToe/", line 73, in save_network
    pickle.dump(variable_values, f)
    TypeError: write() argument must be str, not bytes

    1. Hi hb9, glad your interested :). My current best explanation is given in the video of the talk. But I've now added more docs to each of the files which should help a bit more. Very surprised by your error, I find it runs fine. Can you send me the complete stack trace, as that error appears to be deeper in the pickle code? What version of python are you running? I've tested on python 2.7

    2. Hey, python 3.5 still throws that error. I just tried it using 2.7 and now it's finally learning. Still it would be good to mention that one has to copy the script-file to the root folder and run it there (necessery for me at least). I would write some how-to-run-steps in the readme file.

      You asked for the stack trace, here it is:
      hb9@rocket:~/projects/python/AlphaToe$ python3
      loading pre-existing network
      Traceback (most recent call last):
      File "", line 52, in
      load_network(session, variables, NETWORK_FILE_PATH)
      File "/home/hb9/projects/python/AlphaToe/", line 86, in load_network
      variable_values = pickle.load(f)
      TypeError: a bytes-like object is required, not 'str'

    3. OK, I've worked out what was wrong and fixed it. In python 3+ pickle expected the file to be opened as a binary stream. New check in has this fixed and I've added a bit more explanation(though could still do with more). Hope this helps.

  2. Hi, Daniel,

    Thanks for sharing your interesting project. I ran first to generate the policy network by specifying NETWORK_FILE_PATH = 'current_network.p'. But I found the value of win_rate output is quit low. Is it because some of games end in a tie?

    episode: 997000 win_rate: 0.106
    episode: 998000 win_rate: 0.109
    episode: 999000 win_rate: 0.113

    When I continue to execute the 2nd step (python, I got errors listed below. Do you know what causes that error? Thanks!

    Instructions for updating:
    Use `tf.global_variables_initializer` instead.
    Traceback (most recent call last):
    File "", line 66, in
    load_network(session, reinforcement_variables, REINFORCEMENT_NETWORK_PATH)
    File "/home/hd_songm/AlphaToe/common/", line 103, in load_network
    Either delete the network file to train a new network from scratch or change the in memory network to match that dimensions of the one in the file""" % (file_path, ex))
    ValueError: Tried to load network file current_network.p with different architecture from the in memory network.
    Error was Dimension 0 in both shapes must be equal, but are 9 and 16 for 'Assign' (op: 'Assign') with input shapes: [9,100], [16,300].
    Either delete the network file to train a new network from scratch or change the in memory network to match that dimensions of the one in the file

    1. Hi Minghu, Sorry for the slow response, I've been quite busy.

      Just had a chance to re-test this now, the big problem was I had been trying to get the network working with a 4x4 board and it working on that I had set the learning rate to 1-e6 which it seems is far too low to learn anything. Putting it back up to 1-e4 makes everything work smoothly again. I've done this in my last commit.

      Also the other error you get is because the hidden nodes I had set up were different between the value network and policy network. I've now fixed this so they both run with 3 hidden layers with 100 nodes each and it seems to be working now.

      My apologies, for leaving it in a bad state. Hope it works for you now :)

  3. Hi Daniel, I'm very impressed by the progress of AlphaGo and I decided to understand more how it works starting from AlphaToe.

    I build a human_player to test goodness of learning and a sort of smart_player using some of the perfect player rules just to understand if playing against a smarter player leads to faster learning.

    I noticed some things.

    Valid_only=False in stochastic moves is too disadvantageous against a human or smart player, so I decided to try valid_only=True.

    After few 10.000 iterations the log argument in the loss function goes to 0 and makes weights NaN. I solved it by clipping to a small value (1e-10) and it works (don't know exactly the impact of this on policy gradient and the NN).

    Learning against the (not so) smart player I implemented leads to almost a 25% of win rate mainly due to the ability to reach a draw (obviously) even after a million of episodes. But against human the network is still too weak. I didn't play with hyperparameters nor with the NN layers.

    Do you have any suggestion on how to make the NN smarter against a human player? Working on hyperparamers? Playing against a perfect player? What else?

    Thanks in advance and thanks for the great job you're doing!
    Matteo Gelosa (matteo AT

    1. Hi Matteo

      Thanks for your interest, is it possible you could submit a PR into the project for your fix?
      As for making the machine better, part of the problem here is that alpha go used this deep network method in combination with Monte Carlo Tree Search, but for a game as simple as tic tac toe, this would mean just using MCTS.

      You could try building a more complicated game, 15x15 with 5 in a row tic-tac-toe and training to play that with MCTS?

  4. Hi Daniel

    First of all, great project! I am fascinated with AlphaGo and the doors it opens. Your project is a great way of learning how it all works.

    Using the (without any core modifications) I printed some additional statistics after 1 million episodes. The last 10k episodes (from 990000 to 1000000) yielded the following values:
    Wins: 8102
    Losses (legal): 554
    Losses (illegal move): 1344
    Draws: 0

    From what I understood from your video, the win-ratio could never be 1 because in a normal Tic-Tac-Toe game, even with a perfect algorithm, there is a possibility of ending in a tie (which makes sense). However, I was surprised to see that after a million episodes, there were absolutely no draws, and there were still lost games against a random opponent.

    If it is not because of the game's nature, why does it stop learning after a win-ratio of 80%?

    I tried playing against the network after 1M episodes and it is still quite easy! I can win without much effort. Why is that?

    After that, I tried with but the results were not better, since it was still easy for me to win. Then I tried the but using the MinMax algorithm as the opponent player (first with a maximum depth of 3 and then 6). Both approaches were no better than the previous attempts.

    Even though it could not always win, I expected the network to at least become unbeatable. I don't understand this :(