AIs Battle Royale: The ultimate snake

In a previous post, we developed two AIs to play the game of Snake, of which, the first one implements a heuristic algorithm to select its moves while the second one consists of a deep neural network trained with a reinforcement learning algorithm (Deep Q-Learning). Afterward, in a second post, we used a genetic algorithm to create different generations of snakes where the best individuals of each were chosen to reproduce and generate the next offspring. In this new post, we are going to understand how to develop a new kind of AI using a deep search algorithm with the goal to push over the limits of the previous ones and finally let the four AIs compete in the arena and find out which one will be the winner of the Snake AIs Battle Royale tournament.

Deep Search Algorithm

As already introduced earlier, our new AI consists of a deep search algorithm to exhaustively search for the best path the snake has to follow to eat its next food safe and sound. It is a recursive algorithm that uses the idea of backtracking to explore all possible paths of a tree and return the one leading to the highest reward. It involves exhaustive searches of all the nodes of the tree by going ahead, if possible, else by backtracking. In our case, a node corresponds to a different state of the environment while the depth corresponds to the maximum length of the path explored so far. Here, the word backtrack means that when we are moving forward and there are no more nodes along the current path, we move backwards on the same path to find nodes to traverse. All the nodes will be visited on the current branch until all of them have been traversed after which the next path will be selected. Each node has its own value which can be both positive or negative and at the end of the search will be chosen the path with the maximum value where its value it is nothing but the sum of the values of the nodes along it. Such algorithms can be useful to solve optimization problems: these kind of problems can have many possible solutions, each solution has a value, and we wish to find a solution with the optimal (minimum or maximum) value. Such a solution is just one optimal solution to the problem since there may be several solutions that achieve the optimal value.

Algorithm implementation

Given its brute force nature, the algorithm could theoretically simulate all possible future moves and always find the best one. Nevertheless, existing calculators have a limited computation speed and, at least we don’t want to wait until the end of the Universe, we would never find it. Thus, our algorithm is not going to explore all possible paths, but it will stop once reached a fixed level of recursion depth. We will simply call this parameter as depth. Considering the snake moving in a 20×20 arena and four possible directions for each step, the algorithm would need to explore 4400 paths before finding the best one. However, the real number of paths to explore is significantly lower if we consider that if in a sub-path we have that the snake dies, then it would not generate further sub-paths. In particular, a snake longer than one unit moving in the direction opposite to his current direction would automatically die because of eating itself. This consideration would reduce the number of paths to explore down to 3400. Therefore, reasonable depth values range from 5 to 20 if our hardware is performing enough to support it with the last one exploring all paths of length 20 before choosing its move (320 paths). Another important parameter is the state of the environment, that is, a matrix representation of the game where each of the 22×22 cells can assume three different values:

  1. Empty;
  2. Wall or snake’s part;
  3. Food.

Where 22 is because the matrix includes the walls surrounding the arena. player_x and player_y are the coordinates of the head of the snake in the environment matrix and initial_depth is a constant value to remember the value of the original depth. Now, let’s dive into the algorithm implementation by analyzing its pseudo-code step-by-step.

def deep_search_move(state, player_x, player_y, depth=10, initial_depth=10):

        # end of search because reached limit depth
        if depth == 0:
            return 40 - distance_closest_food(player_x, player_y)

        # end of search because collision occurred
        if state[player_y, player_x] == 1:  
            return -1000 + (initial_depth - depth)

As we can see from the above code, if the recursion terminates when reached the maximum depth, then it returns a value that is proportional to the proximity to the food. In this way, paths terminating closer to the food will get a better score and will be more likely to be chosen. Otherwise, if the search finished because a collision occurred, it just returns a negative value indicating that the explored path resulted to be a wrong one. The tweak of adding initial_depth - depth to the negative value -1000 is used to slightly better reward those paths taking more steps before causing the snake’s death.

Going on with the next part of the algorithm, we consider the consequences of being in a particular state. For this purpose, it is added a bonus or malus to the current state if some specific conditions are met. For example, if a food has been eaten, it is added a bonus of 100 to mark the explored path as a good one. This time we subtract the value initial_depth - depth so to prefer those paths leading to the food requiring the least number of steps. Conversely, a malus of -80 points is added if some dangerous situations are met in order to make them less likely to encounter. Following is reported an example of a dangerous situation as well as the code relative to this section.

The snake encountered a dangerous situation, in fact, choosing to turn right would unavoidably cause game over
        # bonus or malus reward awarded when some conditions are met
        add_reward = 0

        # during the exploration a food has been eaten (BONUS)
        if state[player_y, player_x] == 2:  
            add_reward = 100 - (initial_depth - depth)

       # during the exploration a dangerous situation has been encountered (MALUS)
       if danger(player_x, player_y):
            add_reward -= 80

The last part of the algorithm presents a typical recursive search structure. It keeps interacting with the environment and exploring more sub-paths by simulating moving toward each of the four possible directions. Finally, the value of the original state is restored and is returned the value corresponding to the direction belonging to the best path explored so far, eventually modified with the bonus or malus encountered in the current state.

       # array storing the value for each direction
       reward = [-1000, -1000, -1000, -1000] # right, left, up, down

       # remember the original value of the position the snake has just moved on
       old_state_value = state[player_y, player_x]

       # set the position as not-empty since occupied by the head of the snake
       state[player_y, player_x] = 1

       # explore the subproblems and obtain the value for each direction
       reward[right] = deep_search_move(state, player_x+1, player_y, depth-1, initial_depth)
       reward[left] = deep_search_move(state, player_x-1, player_y, depth-1, initial_depth)
       reward[up = deep_search_move(state, player_x, player_y-1, depth-1, initial_depth)
       reward[down] = deep_search_move(state, player_x, player_y+1, depth-1, initial_depth)
       # search finished: restore the original value of the current position

       # restore the old value of the state
       state[player_y, player_x] = old_state_value

       # return the value of the best direction plus the modifier
       return reward[argmax(reward)] + add_reward

The real code derived from this pseudo-code can be found on the Github repository linked at the end of the post.

Algorithm evaluation

As we did for the other AIs, we are going to test a single snake controlled by our new algorithm in the arena in order to evaluate its performance. As expected, this new AI completely smashed the records set by the previous AIs reaching an outstanding high score of 87 points and an average of 58.5 with a depth parameter of just 8. The video below shows the amazing run where the deep search algorithm controlled snake (purple) broke the new length record. Can a human-controlled snake do better?

Snake AIs Battle Royale

The Battle Royal tournament is ready to begin now, but first, let’s compare the max scores and average scores of each AI in a table to sum up their performances.

metricGreen Snake (Hard-coded)Yellow Snake (RL)Blue Snake (GA)Purple Snake (DS)
Average score3527.523.2558.5
Max score62525587
Average score and highest score achieved by the three different AI over 100 games

And now let’s make them compete in the arena! Let’s also remind that the hard-coded (HARD-CODED) policy controlled snake is the green one, the deep reinforcement learning (RL) snake is the yellow one, the genetic algorithm (GA) created snake is the blue one and finally, the purple one is the deep search (DS) algorithm controlled snake.

“Even when they band together, the weak are nothing before true power.”

Purple Snake

Even when competing in the arena, the purple snake takes the scene overwhelming its rivals. However, its longer computation time makes the game drastically slower, thus this kind of algorithm is not adapt when playing in real-time. A way to create a stronger AI would be to train the reinforcement learning snake using all the pixels composing the game screen instead of a shorter binary array and replace the dense network with a convolutional one. Nevertheless, training this kind of network would require days on GPUs to achieve competitive levels but it seems a good way if we want to create an AI both fast and able to achieve super-human performance at the same time.

Find more on

References

Leave a comment