ML/AI · Case Study
Blokus Duo: Deep Search & Reinforcement Learning for a Combinatorial Board Game
A research project comparing four algorithmic approaches: Minimax, MCTS, CNN-augmented MCTS, and PPO, applied to Blokus Duo, a two-player polyomino strategy game with a branching factor that dwarfs Chess and Go.
Role: ML engineer + research author | Year: 2025
Abstract
Project Overview
This paper evaluates several algorithmic approaches to playing Blokus Duo, a two-player variant of the strategy game Blokus. We implemented and compared four methods: minimax, Monte Carlo Tree Search (MCTS), MCTS enhanced with a convolutional neural network (CNN), and Proximal Policy Optimization (PPO). Each algorithm was trained and tested through large-scale self-play simulations and evaluated against both random agents and human players. Our experiments show that minimax occasionally performed well against humans but was computationally impractical due to the game's large branching factor. PPO achieved moderate success against random play but failed to develop meaningful long-term strategies. Standard MCTS proved stronger but was consistently outperformed by MCTS with CNN, which achieved win rates of 66% as player 1 and 48% as player 2 against the standard MCTS agent. Importantly, the CNN-enhanced MCTS was the only model able to consistently defeat experienced human players. All code was written in Python using Jupyter notebooks, with training and evaluation results stored and shared via GitHub.
Background & Game Description
What is Blokus Duo?
Blokus Duo is a two-player variant of the polyomino placement game Blokus. Each player owns 21 unique pieces, ranging from a single square up to five-square "pentominoes," and must place them on a 14×14 board, one per turn.
Every placement after the first must touch a corner of one of the player's own previously placed pieces but must not share an edge with any of their own pieces. Pieces may be freely rotated and flipped. The player who places the most total squares wins.
Player 1 starts at position (5, 10) and Player 2 at (10, 5) on the 14×14 board.
Why is it hard for AI?
Blokus Duo has a staggering branching factor. At the start there are ~508 possible moves; by the second turn, ~2,500; and by the third turn, ~3,800. Compare this to Chess (~20–60 per turn) and Go (~250 per turn).
A conservative estimate of only 500 possible moves over 15 turns yields more possible games than atoms in the human body, meaning exhaustive search is completely infeasible.
Winning requires long-horizon spatial reasoning: securing territory, blocking opponent corners, and managing piece difficulty across the entire game arc.
Board Layout (14×14)
Each cell is either empty (0), occupied by Player 1 (1), or occupied by
Player 2 (2). The starting positions are marked ★.
■ Player 1 territory (teal) ■ Player 2 territory (orange) ★ Starting position
Piece Inventory
Each player holds 21 unique polyominoes, ranging from 1 to 5 squares. The project code assigns them IDs 1–21:
| ID | Name | Squares | Orientations |
|---|---|---|---|
| 1 | i1 | 1 | 1 |
| 2 | i2 | 2 | 2 |
| 3 | i3 | 3 | 2 |
| 4 | quadruple line | 4 | 2 |
| 5 | quintuple line | 5 | 2 |
| 6 | z4 | 4 | 4 |
| 7 | t4 | 4 | 4 |
| 8 | l4 | 4 | 8 |
| 9 | square | 4 | 1 |
| 10 | w | 5 | 4 |
| 11 | p | 5 | 8 |
| 12 | f | 5 | 8 |
| 13 | t5 | 5 | 4 |
| 14 | x | 5 | 1 |
| 15 | z5 | 5 | 4 |
| 16 | v5 | 5 | 4 |
| 17 | u | 5 | 4 |
| 18 | v3 | 3 | 4 |
| 19 | n | 5 | 8 |
| 20 | y | 5 | 8 |
| 21 | l5 | 5 | 8 |
Environment & Code Architecture
The game engine is built around a central Board class (board.py) that manages all
game state. The PettingZoo BlokusEnv wrapper (PPO/blokusEnv.py) exposes the
environment to RL frameworks.
The board is a 14×14 list of integers. Each player tracks their available "possible squares," the corner positions where a new piece could legally be anchored, which are incrementally updated rather than recomputed from scratch each turn.
# Board is stored as a 14x14 grid of integers:
# 0 = empty | 1 = Player 1 | 2 = Player 2
self.board = [[0 for _ in range(s)] for _ in range(s)]
# Each player's reachable corner positions:
# format: [x, y, [NE_open, SE_open, SW_open, NW_open]]
self.possible_squares = [
[[4, 4, [True, True, True, True]]], # P1 starts at (5,10)
[[s-5, s-5, [True, True, True, True]]] # P2 starts at (10,5)
]
# Each player's remaining piece inventory (IDs 0-20)
self.inv = [
list(range(21)), # Player 1
list(range(21)) # Player 2
]
self.score = [0, 0] # tiles placed by each playerMove generation is the performance-critical path. Rather than scanning the whole board, it iterates only over pre-tracked "possible squares" (corner attachment points) and checks each piece/orientation combination against placement rules.
def calculateLegalMoves(self, only_fives_rounds=0):
legal_placements = []
# Iterate only over proven corner attachment points
for x, y, [NE, SE, SW, NW] in self.possible_squares[self.turn - 1]:
if self.board[y][x] != 0:
continue # already occupied
if self.turn in self.getEdgesValues(x, y):
continue # would share an edge with own piece
for piece_num in self.inv[self.turn - 1]:
for orientation in piece_possible_orientations[piece_num]:
piece_def = pieces[piece_num][orientation]
# Try anchoring at each block of the piece
for dir_idx in range(4):
for pieceBlock in piece_def[dir_idx + 1]:
center = [x - pieceBlock[0], y - pieceBlock[1]]
# validate all squares of the piece from this center
# ... (bounds + edge + corner checks)
legal_placements.append([center[0], center[1],
piece_num, orientation])
return legal_placementsFor RL training the board is flattened to a 1-D integer array and concatenated with the piece inventory. The observation is always returned from the current agent's perspective. Player 2's view is a board-flipped mirror of Player 1's, so a single policy network can handle both roles.
# Observation space definition (per agent)
self.observation_spaces = {
i: spaces.Dict({
# 196 board cells (0/1/2) + 42 piece-availability bits
"observation": spaces.MultiDiscrete([3]*196 + [2]*42, dtype=np.int8),
# One binary flag per possible (x, y, piece, orientation) combo
"action_mask": spaces.MultiBinary(14 * 14 * 21 * 8),
})
for i in self.agents
}
def observe(self, agent):
# Map agent IDs -> board encoding perspective (1 or -1)
perspective = 1 if agent == 0 else -1
# Piece availability matrix: shape [2, 21], agent-first ordering
piecesMB = np.zeros([2, 21])
for piece in range(21):
if piece in self.board.inv[0]: piecesMB[0][piece] = 1
if piece in self.board.inv[1]: piecesMB[1][piece] = 1
if perspective == -1:
piecesMB = piecesMB[::-1] # swap so agent's own pieces come first
return {
# Flipped board (agent sees itself as "1") + inventory
"observation": np.concatenate((
np.array(self.board.get_flipped_board(perspective)).flatten(),
piecesMB.flatten()
)).astype(np.int8),
"action_mask": self.genActionMask(),
}The full action space has 14 × 14 × 21 × 8 = 32,928 discrete indices. The discreteToAction / actionToDiscrete pair provides a fast bijection between
integer indices and human-readable (x, y, piece, orientation) tuples.
# Total action space: 14 x-coords x 14 y-coords x 21 pieces x 8 orientations
ACTION_SPACE_SIZE = 14 * 14 * 21 * 8 # = 32,928
def discreteToAction(action: int) -> tuple:
"""Convert a flat integer index to (x, y, piece, orientation)."""
x = action // (14 * 21 * 8)
action %= (14 * 21 * 8)
y = action // (21 * 8)
action %= (21 * 8)
piece = action // 8
orientation = action % 8
return (x, y, piece, orientation)
def actionToDiscrete(action: tuple) -> int:
"""Convert (x, y, piece, orientation) back to a flat index."""
return (((action[0] * 14 + action[1]) * 21 + action[2]) * 8) + action[3]Models & RL Setup
Algorithm 1: Minimax
Depth-3 minimax with alpha-beta pruning and a hand-crafted board scoring function (9 weights). The scoring considers available corners, corner direction counts, opponent corner blocking, raw score, and remaining piece difficulty.
Alpha-beta pruning gave a ~4× speedup but couldn't overcome the game's extreme branching factor. Minimax can win against beginners but cannot consistently beat experienced humans.
Algorithm 2: MCTS v3
Monte Carlo Tree Search with Selection → Expansion → Simulation → Backpropagation. Runs 500–4,000 simulations per turn. The UCB score balances exploration vs. exploitation via a tunable constant.
Version 3 reached terminal states fast enough for win/loss feedback instead of relying solely on the board heuristic. Achieved 60–70% win rate against expert human players at ~700 simulations.
Algorithm 3: MCTS + CNN
Two convolutional networks augment the base MCTS: a value network estimates win probability P(win) ∈ [−1, 1] from a board state, replacing (partially) the hand-crafted heuristic; a policy network outputs a probability distribution over moves to focus tree expansion.
Value Network (Keras Sequential)
| Layer | Output | Params |
|---|---|---|
| Conv2D (32 filters, 3×3) | 14×14×32 | 1,184 |
| BatchNorm | 14×14×32 | 128 |
| Conv2D (64 filters, 3×3) | 12×12×64 | 18,496 |
| Flatten | 9,216 | 0 |
| Dense + Dropout | 256 | 2,359,552 |
| Dense (output) | 1 (tanh) | 257 |
Accuracy peaked at ~28%, intentionally loose to preserve MCTS exploration balance.
Policy Network V1 (Keras Sequential)
| Layer | Output | Params |
|---|---|---|
| Conv2D (32) | 14×14×32 | 1,184 |
| BatchNorm | — | 128 |
| Conv2D (128) | 9×9×128 | 147,584 |
| Conv2D (64) | 7×7×64 | 73,792 |
| Flatten → Dense (4096) | 4,096 | 12,849,152 |
| Dense (4096) + Dropout | 4,096 | 16,781,312 |
| Dense (8182) + Dropout | 8,182 | 33,521,654 |
| Dense (32928) | 32,928 | 269,449,824 |
Action masking zeroes illegal moves then re-normalizes probabilities.
Algorithm 4: PPO (MaskablePPO via Stable-Baselines 3)
The RL environment wraps Board inside a PettingZoo AECEnv, training both agents
sequentially. Stable-Baselines3's MaskablePPO applies the action mask at every step so
illegal moves are never sampled.
# Training entry point (PPO/aecTrain.py)
model = MaskablePPO(
MaskableActorCriticPolicy,
env, # SB3ActionMaskWrapper(BlokusEnv)
verbose=1,
device='cuda',
learning_rate=0.001, # Adam optimizer
ent_coef=0.005, # entropy bonus to encourage exploration
)
# Load checkpoint if continuing a training run
if i > 0:
model = model.load(f"final_models/final_model_{i-1}.zip")
model.set_env(env)
model.learn(total_timesteps=steps) # ~100,000 steps ~= 4,000 games
model.save(f"final_models/final_model_{i}.zip") Hyperparameters
- Learning rate: 0.001
- Entropy coef: 0.005
- Timesteps: ~100,000 (~4,000 games)
- Reward: +1 win / −1 loss
- Action masking: MultiBinary, 32,928 elements
Observation Space
- Board: 14×14 = 196 cells (0/1/2)
- Inventory: 2×21 = 42 binary flags
- Total obs: 238 discrete elements
- Action space: Discrete(32,928)
- Perspective: Agent-relative (flipped for P2)
Experiments & Results
Results come from ~2.5 million simulations across all non-PPO matchups, plus extensive human evaluation. The PPO model was evaluated separately (~50 games against non-random algorithms due to time constraints).
First-Player Advantage (Random vs. Random, n = 5,000)
Even with neither player having any strategy, playing first confers a statistically significant advantage. A χ² test confirms the observed win distribution differs from a 50/50 null hypothesis (p < 0.05).
Head-to-Head Win Rates (Player 1 vs Player 2, %)
Rows are Player 1, columns are Player 2. All values are Player 1's win rate. The PPO vs. Minimax cell is omitted because both are deterministic and produce identical games regardless of sample size.
| P1 \ P2 | Random | Minimax | PPO | MCTS | MCTS + CNN |
|---|---|---|---|---|---|
| Random | 52.7 | — | — | — | — |
| Minimax | 90.6 | 98.4 | — | — | — |
| PPO | 98.4 | — | — | 61.3 | — |
| MCTS | 100.0 | 98.8 | 100.0 | 0.0 | 8.7 |
| MCTS + CNN | 100.0 | 98.2 | 51.6 | 66.1 | — |
MCTS + CNN is the strongest agent: 66% win rate as P1 vs. MCTS, and the only model to consistently defeat experienced human players.
PPO Win Rate vs. Random Agent Over Training
Win rate recorded every ~10,000 timesteps. Growth is generally monotone with one dip at ~40,000 steps (likely a poor exploration episode). Best checkpoint: 98.38% at ~100,000 steps.
Approximate reconstruction from paper Figure 11. Dip at ~40k is a known exploration artifact.
Win Rate vs. Random Baseline
All agents as Player 1 vs. a random opponent, showing the progression of algorithmic strength.
Discussion & Future Work
Key Takeaways
- MCTS + CNN was the only model to beat expert human players consistently, validating the hypothesis that neural value/policy estimation meaningfully improves search quality.
- PPO underperformed relative to search-based methods with the given training budget; ~4,000 self-play games were insufficient for the 32,928-dimensional action space.
- First-mover advantage is real and significant. Even in random play, Player 1 wins 53% of the time with a larger average margin.
- The value network accuracy of ~28% was intentionally imperfect: higher accuracy disrupted MCTS's exploration-exploitation balance and hurt game performance.
Hardest Engineering Challenges
- Branching factor at scale. Generating legal moves efficiently without full board scans required maintaining incremental "possible squares" lists, a non-trivial bookkeeping problem.
- Action masking with 32,928 actions. Training PPO without masking produced garbage policy gradients; implementing PettingZoo + SB3 masking correctly took significant iteration.
- Evaluation validity. With 2.5M simulations, it was easy to overfit comparisons to specific random seeds. Running many short batches and averaging was essential.
- Sparse terminal rewards. Most Blokus games take 20–40 turns, making reward credit assignment a genuine challenge for the PPO model.
Future Directions
Short term
- Use PPO as a second-phase policy network initializer for MCTS + CNN (AlphaGo-style self-play loop)
- Runtime optimization: profile board.py and move generation for 10–100× speedup in simulation
- Larger evaluation dataset against human players across skill levels
Longer term
- Full AlphaZero-style training on expert game data (needs larger compute)
- AlexNet or ResNet backbone for the value / policy heads
- Curriculum learning: start with small piece subsets and scale up
- Generalize to full 4-player Blokus (20×20 board, 4 inventories)