PGSS Team Project Research Paper

Blokus Duo: Using Deep Search Algorithms to Explore Turn-Based Game Strategy

Jake Frischmann, Etash Jhanji, Micheal Huang, Sebastian Liu, Alden Bressler, Delia Brown

Pennsylvania Governor's School for the Sciences (PGSS) · 2024

Abstract

This paper evaluates several algorithmic approaches to playing Blokus Duo, a two-player variant of the strategy game Blokus. We implemented and compared four methods: minimax, Monte Carlo Tree Search (MCTS), MCTS enhanced with a convolutional neural network (CNN), and Proximal Policy Optimization (PPO). Each algorithm was trained and tested through large-scale self-play simulations and evaluated against both random agents and human players. Our experiments show that minimax occasionally performed well against humans but was computationally impractical due to the game's large branching factor. PPO achieved moderate success against random play but failed to develop meaningful long-term strategies. Standard MCTS proved stronger but was consistently outperformed by MCTS with CNN, which achieved win rates of 66% as player 1 and 48% as player 2 against the standard MCTS agent. Importantly, the CNN-enhanced MCTS was the only model able to consistently defeat experienced human players. All code was written in Python using Jupyter notebooks, with training and evaluation results stored and shared via GitHub.

I. Introduction

A. The Game of Blokus

Blokus is a turn-based strategy board game in which players attempt to place their inventory of 21 unique polyominoes onto a limited grid. Each set consists of pieces ranging from one to five squares in various configurations (Figure 1).

1 i1 1□

2 i2 2□

3 i3 3□

4 4-line 4□

5 5-line 5□

6 z4 4□

7 t4 4□

8 l4 4□

9 square 4□

10 w 5□

11 p 5□

12 f 5□

13 t5 5□

14 X 5□

15 z5 5□

16 v5 5□

17 u 5□

18 v3 3□

19 n 5□

20 y 5□

21 l5 5□

Figure 1: Labeled Inventory of 21 Polyomino Pieces

Players take turns placing one piece onto the board. Aside from their starting move, a player can only place pieces by connecting a corner of the piece they are playing to a corner of one of their previously placed pieces. Additionally, the edges of the placed piece may not touch the edges of any placed piece of the same color. Each piece may also be flipped or rotated in 90-degree increments before placement. The game ends when both players are unable to play more pieces. The player with the greatest total squares placed on the board wins.

Blokus Duo is the two-player version used for this project. It features a 14×14 board (vs. the traditional 20×20), with starting positions at (5, 10) and (10, 5).

B. Coding Environments & Language

The project was implemented in Python, using packages such as pandas, numpy, TensorFlow/Keras, and PyTorch/Stable-Baselines3. Development occurred on Replit (collaborative) and Visual Studio Code (data collection). Source control was managed with Git; code is available on GitHub.

C. App Intro, User Interface & Playing the Game

A PyGame application was built to demonstrate results. A human plays on a 14×14 grid against a selected algorithm. The human's pieces appear in orange; the computer's in purple. The player types a piece number, drags it to the desired location, and rotates it with the R key before committing the placement. Corner dots of the appropriate color indicate available future moves for each player. Scores (total squares placed) are shown at the bottom-left.

D. Optimizing Play: Project Aims

1. Optimization Introduction

The challenge was to see if an algorithmic approach could create a theoretically optimal Blokus player. Historical precedents include IBM's Deep Blue (chess, 1997) and DeepMind's AlphaGo and AlphaZero. Both used MCTS or minimax combined with learned evaluation. Blokus is not a "solvable" game. Its branching factor is extreme: ~508 moves at move 1, ~2,500 at move 2, ~3,800 at move 3. For comparison, chess has roughly 20–60 moves per turn and Go has ~250.

2. General Strategies in Human Play

After extensive human playtesting, the team identified three primary strategic goals:

Create more opportunities: increase available corner count.
Limit opponent moves: block the opponent's corner access and enclose territory.
Play large pieces early: 5-square pieces are hard to fit but worth most points; play them before the board fills up.

Advanced community-developed strategies (e.g., the Barasona Opening in 4-player Blokus) informed the team's understanding but were not explicitly encoded due to complexity.

3. Translation into Computer Algorithms

The human strategies were quantified into weighted scoring metrics: squares placed, available corners, corner direction count, opponent corner blocking, distance from opponent's starting position, and piece inventory difficulty. Nine of twelve candidate weights proved effective; three were removed as counterproductive.

II. Approaches and Algorithms

A. Minimax

1. Overview

Minimax is a search algorithm that evaluates all possible moves (up to a fixed depth), selecting the move that maximizes the current player's score while minimizing the opponent's. The name reflects these dual objectives.

To illustrate the principle, the paper uses the "Game of Eight," where players alternately add 1, 2, or 3 to a running total and the player reaching exactly 8 wins. A full decision tree can be drawn and optimal play determined by propagating terminal values back up the tree. Applied to Blokus, the same principle holds but the tree cannot be fully expanded. Instead, a depth limit of 3 and a scoring heuristic substitute for terminal values.

2. Implementation

The algorithm searched 3 moves deep and evaluated board states using 9 weighted features. A recursive lookahead method generated child states, computed their scores, and propagated the best value upward. Legal moves were shuffled before evaluation to break symmetry.

Appendix A1: Minimax (smartTurn): board.py

# Algorithmic method to determine the best possible move for a given board
# state. This iteration uses minimax.
def smartTurn(self, level):
    move_list = self.calculateLegalMoves(only_fives_rounds=4)
    random.shuffle(move_list)
    best_val = -10001
    best_move = []
    for my_move in move_list:
        tempBoard = copy.deepcopy(self)
        tempBoard.place_piece(my_move)
        if tempBoard.checkWin(tempBoard):
            return (my_move)
        val = self.lookahead(tempBoard, level)
        if val > best_val:
            best_val = val
            best_move = copy.copy(my_move)
    return best_move

Appendix A2: Lookahead: board.py

# Method to search through possible future moves. Recursively searches
# until its depth limit is reached, where it returns the value of the board state.
def lookahead(self, board, depth):
    if depth == 0:
        return board.calculateBoardScore_dots(board)
    else:
        board.switchPlayer()
        move_list = board.calculateLegalMoves(only_fives_rounds=10)
        best_val = -9999
        for my_move in move_list:
            tempBoard = copy.deepcopy(board)
            tempBoard.place_piece(my_move)
            if board.checkWin(tempBoard):
                return (-10000)
            val = tempBoard.lookahead(tempBoard, depth - 1)
            if val > best_val:
                best_val = val
        return (-1 * best_val)

3. Strengths & Limitations

Minimax is straightforward to implement and the scoring weights are directly interpretable. However it has three major flaws: the scoring function is human-designed (imperfect), it considers every move including clearly weak ones (wasted compute), and even 3-move lookahead is too slow in Python given the game's branching factor.

4. Alpha-Beta Pruning

Alpha-beta pruning skips branches that cannot influence the final decision. By maintaining α (max lower bound) and β (min upper bound) and cutting off nodes outside this window, the algorithm achieved a ~4× speedup while producing identical move quality. Despite this improvement, minimax remained too slow and too weak against experienced humans.

B. Monte Carlo Tree Search (MCTS)

1. Structure

MCTS consists of four steps repeated per simulation:

Selection. Traverse the tree from the root, choosing child nodes via the UCB (Upper Confidence Bound) score until an unexplored node is reached. UCB balances exploration (visiting uncertain nodes) and exploitation (revisiting promising nodes) via a tunable constant C.
UCB(s) = Q(s)/N(s) + C · √(ln(N(parent)) / N(s))
Expansion. Add the selected node's legal moves as children. Each child stores the move tuple [x, y, piece, orientation] and the resulting board state.
Simulation (Evaluation). Estimate the win probability at the expanded node using a board scoring function (first ~4 moves) or win/loss terminal values (deeper play). Reaching terminal states was a key optimization in MCTS v3.
Backpropagation. Update Q and N values for all nodes on the selection path. The sign alternates because moves along the path alternate between players.

Typical simulation counts ranged from 500 to 4,000 per turn.

2. Implementation Iterations

Version	Key Change	Performance vs. Human
v1	Initial implementation; flawed UCB scoring; over-explored promising nodes	Worse than expected; not tested on humans
v2	Fixed UCB; improved data structures; ~1,000 sims/turn	Won against beginners; 94% vs. random; 79% vs. Minimax
v3	Near-terminal evaluation; ~100× faster; 30–35 tree layers deep	60–70% vs. expert humans at 700 sims; wins even at 50 sims

3. Strengths & Limitations

MCTS v3 beat the best human players tested and handled the large action space better than minimax. Limitations include: large memory requirements at high simulation counts, incomplete tree coverage, and evaluation still partially dependent on a human-crafted heuristic.

C. MCTS with Convolutional Neural Networks (CNNs)

Two CNNs were added to address MCTS's remaining weaknesses:

1. Value Network

Estimates win probability P(win) ∈ [−1, 1] from a board state. Built with Keras Sequential layers. Intentionally trained to ~28% accuracy (via high dropout), sufficient confidence to guide search without disrupting the exploration-exploitation balance. Partially replaced the hand-crafted board scoring function. Post-integration, the final evaluation used an average of the value network output and the heuristic score.

2. Policy Network

Outputs a probability distribution over all 32,928 action indices. Trained via supervised learning on data from hundreds of thousands of simulated games. At expansion time, children receive these probabilities instead of equal weights. After action masking (illegal moves → probability 0), legal probabilities are re-normalized additively (equal increases rather than proportional scaling) so no legal move has zero probability.

3. Network Architecture (Policy Network V1)

Layer	Type	Output Shape	Parameters
conv2d_3	Conv2D	(None, 14, 14, 32)	1,184
batch_normalization_5	BatchNorm	(None, 14, 14, 32)	128
conv2d_4	Conv2D	(None, 9, 9, 128)	147,584
batch_normalization_6	BatchNorm	(None, 9, 9, 128)	512
conv2d_5	Conv2D	(None, 7, 7, 64)	73,792
flatten_1	Flatten	(None, 3,136)	0
dense_3	Dense	(None, 4,096)	12,849,152
activation_5 + bn + dropout	—	(None, 4,096)	16,384
dense_4	Dense	(None, 4,096)	16,781,312
activation_6 + dropout + bn	—	(None, 4,096)	16,384
dense_5	Dense	(None, 8,182)	33,521,654
activation_7 + dropout + bn	—	(None, 8,182)	32,728
dense_6	Dense (output)	(None, 32,928)	269,449,824

Appendix A3: Policy Network V1 Layer Composition (Keras)

4. Data Gathering

Hundreds of thousands of simulations were run by distributing across an entire computer lab (46 machines). Collected per-turn data included board state, legal moves, move probabilities, reward, weights, simulation count, model type, and remaining inventory. Board augmentation (flipping for the opposing player's perspective) effectively doubled the dataset.

Table 1: Data Collected per Turn

Field	Example / Description
Board	`[[0,0,0,...], ...]`: 14×14 grid encoded as integers
Moves	`[[3,4,14,5], ...]`: list of legal (x,y,piece,orient)
Move Probabilities	32,928-element softmax output from policy network
Reward	1 (win), 0 (draw), −1 (loss)
Weights	Board scoring function weight vector
No. of Simulations	e.g. 1,345
Model Type	monteCarlo, randomTurn, playSmart_v2, etc.
Inventory	Remaining piece IDs for current player, e.g. [3,4,7,13,19]

5. Comparison with AlphaGo

AlphaGo uses MCTS augmented with policy and value networks but trains them entirely from win/loss terminal feedback, with no hand-crafted heuristic. The team's approach is similar in structure but uses a human-designed board scorer alongside the CNN. A fully AlphaGo-style approach was infeasible due to limited compute and the absence of a sufficiently large expert-game corpus.

D. Proximal Policy Optimization (PPO)

1. Overview

PPO (Schulman et al., 2017) is a policy-gradient RL algorithm that uses clipped surrogate objectives and multiple epochs of minibatch SGD to improve stability and sample efficiency. It is well-suited for large, continuous action spaces.

2. Implementation

The environment uses PettingZoo's AECEnv for sequential two-agent interaction and Stable-Baselines3's MaskablePPO for training. The observation space is a 238-element integer vector (196 board cells + 42 inventory bits). The action space is Discrete(32,928). An action mask binary vector (32,928 bits) is passed alongside each observation to zero-out illegal actions at the policy level.

Table 2: Observation Space of Blokus Environment

Feature	Domain	Length	Notes
Board	0: empty, 1: agent block, 2: opponent block	196	14×14 grid
Pieces (agent)	0: used, 1: available	21	Current agent's inventory
Pieces (opponent)	0: used, 1: available	21	Opponent's inventory
Action Mask	0: illegal, 1: legal	32,928	Passed separately

Table 3: Action Space of Blokus Environment

Feature	Domain	Length	Notes
X-coordinate	0–13	14	—
Y-coordinate	0–13	14	—
Piece	0–20	21	Masked as inventory shrinks
Orientation	0–7	8	Largely masked (duplicates)
Total	—	32,928	14 × 14 × 21 × 8

Reward design: +1 on win, −1 on loss. Per-piece-size rewards were tried but encouraged greedy large-piece play rather than strategic positioning. A −50 illegal-move penalty was found counterproductive once action masking was in place.

Hyperparameters: lr = 0.001, ent_coef = 0.005. Training ran for ~100,000 timesteps (~4,000 games), reaching 98.38% win rate vs. random as Player 1.

3. Strengths & Limitations

PPO is fast at inference and versatile. However, its need for many more games (likely hundreds of thousands) to develop strategic depth meant it underperformed relative to MCTS-based methods given the available training budget. In theory, additional training time could allow PPO to surpass MCTS.

III. Results

First-Player Advantage

In 5,000 random-agent simulations, Player 1 won significantly more often than Player 2:

Table 4: Random Simulation Win Results (n = 5,000)

	P1 Wins	P2 Wins	Draws	Total
Count	2,635	2,024	341	5,000
Rate	52.7%	40.48%	6.82%	100%
Avg score	51.87	50.29	—	—
Avg win margin	+4.56	+2.98	—	—

Table 4: χ²-GOF test confirms the distribution is significantly different from 50/50 (p < 0.05). Independent z-tests confirm each player's rate differs significantly from 50%.

PPO Learning Curve

Win rate of the PPO model (as Player 1) vs. a random agent, recorded every ~10,000 timesteps (Figure 11). The curve is generally positive with an outlier dip at ~40,000 steps attributable to an exploration detour. Best checkpoint: 98.38% at ~100,000 timesteps.

Figure 11: Win rate of the PPO model (Player 1) vs. random agent over training timesteps. Approximate reconstruction from paper Figure 11.

Head-to-Head Matchup Table

~2,500,000 simulations across all non-PPO matchups. PPO matchups limited to ~50 games.

P1 \ P2	Random	Minimax	PPO	MCTS	MCTS w/ CNN
Random	52.7	—	—	—	—
Minimax	90.63	98.44	—	—	—
PPO	98.38	†	—	61.29	—
MCTS	100.00	98.82	100.00	0.00	8.65
MCTS w/ CNN	100.00	96.35	51.55	66.09	—

Table 5: Player 1 win rate (%) against Player 2. † PPO vs. Minimax omitted; both deterministic, always producing the same game. MCTS w/ CNN vs. itself not measured.

IV. Conclusion

Minimax, despite its simplicity, performed well in random-agent benchmarks and against beginner-level humans. Even with minimal training, MCTS with CNN produced the highest win rates of all methods. MCTS without CNN had competitive win rates and both MCTS variants were able to defeat experienced humans. PPO performed poorly relative to search-based methods, likely due to insufficient training data and hyperparameter optimization. Despite strong performance against random opponents, human and cross-agent play exposed PPO's strategic limitations.

V. Future Direction

Integrate PPO as the second-phase (self-play reinforcement) initializer for the MCTS + CNN policy network, completing the AlphaGo-style training loop.
Code optimizations for faster simulation, improving user experience and enabling more training.
Explore more complex architectures (e.g., AlexNet) trained on larger datasets.
Evaluate advanced agents against a wider pool of human players across skill levels.

VI. Code Availability

All project code is available at github.com/JJFrisch/BlokusDuo and its various branches. The local copy of the repository used in development is at ~/repos/BlokusDuo.

VII. Acknowledgements

The team thanks project advisor Dr. Neil Simonetti and teaching assistant Tripp Hoover. Human testers Ogechi Anyamele and Ashley Driesbach provided invaluable game evaluation. Mr. Andrew McGuier provided additional compute (lab access). Heartfelt gratitude to the PGSS Alumni Association and all PGSS faculty and staff.

VIII. References

"Blokus Discoveries." gottfriedville.net. link
"Deep Blue." ibm.com. link
"Kasparov vs. Deep Blue | The Match That Changed History." chess.com, Oct. 2018.
"AlphaGo." Google DeepMind. link
Silver, David, et al. "AlphaZero: Shedding new light on chess, shogi, and Go." Google DeepMind, Dec. 2018.
"The Barasona Opening." BoardGameGeek.
"Common Beginner Mistakes." Pentolla.com. link
"Algorithms - Minimax." cs.stanford.edu. link
"MCTS/UCT in solving real-life problems." ResearchGate, Figure 1 of MCTS survey.
Browne, Cameron, et al. "A survey of monte carlo tree search methods." IEEE Transactions on Computational Intelligence and AI in Games, Vol. 4, No. 1, March 2012.
Schulman, John, et al. "Proximal Policy Optimization Algorithms." arXiv, Aug. 2017. arxiv.org/abs/1707.06347
"CNN Architecture – Detailed Explanation." InterviewBit, June 2022.
Terry, J, et al. "Pettingzoo: gym for multi-agent reinforcement learning." NeurIPS, Vol. 34, pp. 15032–15043, 2021.
Hill, et al. "Stable baselines." GitHub. link
Zouitine, Adil. "Masking in Deep Reinforcement Learning." boring-guy.sh, June 2022.