Playing Shinko with reinforcement learning
All the code can be found here: https://github.com/doranbae/shinko
Shinko is a mobile puzzle game where you need to make added sum to 5 using blocks given to you.
Shinko has few specific features.
At the start of each game, you are given rows of boxes with random numbers drawn from 1 to 4. I will call them matrix.
You are also given another set of boxes (at length of 3). I will call them noxes (Number BOXES). You can use these noxes to make addition to the outer most layer of the matrix.
For example, nox is 3. If you add this nox to 2 in the matrix, then 3+2 is 5.
When the addition (nox to matrix) makes 5, 5 (from the matrix) disappears.
Another example is when the result of the addition is bigger than 5. For example, if you add nox 4 to 2 in matrix, the result is 6 (4+2 = 6). In this case, the remainder of 1 (after making 5) breaks off. The origianl 2 disappears (because you have reached 5 or more by adding 4 to 2), however, 1 remains. So the total numbers in the matrix is not changed. This acts as a difficulty or penalty in the game.
Your goal is to make all of the element in the matrix disappear with the smallest number of noxes.
You will be able to see up to 3 future noxes. You can strategically make your moves anticipating what noxes you will have in near future.
Reinforcement learning to play the game
After training the RL model, I compared the result with a vanilla model. Vanilla model is programmed to find the best action based on the current reward. RL model is trained to find the best action based on current and future rewards. As a result (hold thy breath!), the vanilla model performs better in terms of number of counts for winning the game. Sadly speaking, my RL model is not good enough. However, one interesting thing the RL model can do is to choose the action anticipating the future reward. This will be discussed later (part, Vanilla vs. RL: strategic action trait)
Game playing module
This script plays Shinko based on simple logic of addition. For the sake of simplicity, it has modified the original mobile game by removing the spliting feature. Instead of the game allowing you to add 2 with 5 to make 7 (and then resulting remainder 2 remains), you are prohibited to make any move that will result in the sum of the addition to be larger than 5. self.valid_actions keeps track of what moves are valid to make for each turn. Here, action refers to the index of the matrix. So if action = 7, it means the player chooses to make the addition of nox to flattened matrix index 7.
The logic for playShinko_vanilla.py is simple. The machine will play the game by
Look at the next 1 nox
Among the valid moves, choose the best move by making this calculation: 5 - matrix - nox
If the result of the above calcuation is 0, that means the nox will surely make the addition to result in 5
The model chooses the smallest result value (excluding negatives) as its best move
To see the sample of the vanilla play, execute the file in python
Reinforcement learning module
Reinforcement learning is divided into two components:
This file follows the same game features of the playShinko_vanilla.py, but few new/altered configuration to enable the reinforcement learning. Most notably, this file pre-process the input data for the neural network built from keras.
In order to to feed the current state and noxes together to the neural network, I have transformed the input data into 3 by n(=matrix_width) array. Shinko agent must be able to anticipate the future value
Using keras, I built a neural netork to train to play Shinko.
def qtable_nn_model(): # build model model = Sequential() model.add( InputLayer( batch_input_shape = ( 3, flat_matrix_length ) ) ) model.add( Dense( 30, activation= 'sigmoid' ) ) model.add( Dense( flat_matrix_length, activation = 'linear' ) ) # compile model model.compile(loss = 'mse', optimizer='adam', metrics=['mae']) return model
When training, I experimented with number of environments to train with. Here, environment refers to the matrix initialization in the beginning. I tried upto 20 different environment as more than 20 took tool long to train on my MacBook Pro. If I set up more powerful environment in the cloud, I expect I can can be more aggressive with training.
After training is over, you need to save the model. I am using Kera's built in method, which I learned from here.
# save the model model_json = trained_model.to_json() with open( 'model.json', 'w') as json_file: json_file.write(model_json) # serialize weights to HDF5 trained_model.save_weights("model.h5") print("Saved model to disk")
It is time to do the testing. Since Shinko has not labeled dataset, it is not likely that I can get an accuracy. Instead, I evaluated how well it can play the game Shinko compared to the vanilla model. Vaniall model is programmed to find the best action based on the current reward. RL model is trained to find the best action based on current and future rewards.
The first thing to do is to load the model.
# load trained model by loading json and creating model json_file = open('model.json', 'r') loaded_model_json = json_file.read() json_file.close() loaded_model = model_from_json(loaded_model_json) # load weights into new model loaded_model.load_weights("model.h5") print("Loaded model from disk")
I want to evaluate the model from the two aspects: (1) whether it wins more games; and (2) whether it can anticipate future reward to play strategically at the current state.
I played 300 random games, meaning, 300 Shinko with 300 different environment. The reason why I gave RL model different environment is to avoid overfitting. I want it to be able to play new random games as well.
Only after the first few 1000 training, the RL model was able to strategically choose an action based on future rewards. For example, this is what happened during testing.
I made it play one of the random games, and on its 3rd move, it was given noxes of 3, 1, 2 (in that order) with a matrix of [[3,2,2,1,2],[1,3,2,1,5]]. Shinko agent had to choose an action with nox 3 to the matrix, and it chose action 8. Strictly speaking from current reward perspective only, it does not make sense much in the beginning, becuase 3+1 is not 5, but 4. However, when you remember what is coming after 3, is 1. This 1 (next nox) can help making 4 into a 5. And that is precisely what the model choose as an action
If it was a vanilla model, it would have probably chosen index 4 or 7 to yield an immediate reward by making a 5. Sadly, I guess this strategic action alone is not enough to beat the game.
Few ideas on how to improve the model performance
Add more layers to the network
Better reward strategy
All the code can be found here: https://github.com/doranbae/shinko