← Back to projects

ML/AI · Case Study

Blokus Duo: Deep Search & Reinforcement Learning for a Combinatorial Board Game

A research project comparing four algorithmic approaches: Minimax, MCTS, CNN-augmented MCTS, and PPO, applied to Blokus Duo, a two-player polyomino strategy game with a branching factor that dwarfs Chess and Go.

  • Python
  • PyTorch
  • Keras
  • PettingZoo
  • Stable-Baselines3
  • PPO
  • MCTS
  • CNN
  • Gymnasium

Role: ML engineer + research author  |  Year: 2025

Abstract

Project Overview

This paper evaluates several algorithmic approaches to playing Blokus Duo, a two-player variant of the strategy game Blokus. We implemented and compared four methods: minimax, Monte Carlo Tree Search (MCTS), MCTS enhanced with a convolutional neural network (CNN), and Proximal Policy Optimization (PPO). Each algorithm was trained and tested through large-scale self-play simulations and evaluated against both random agents and human players. Our experiments show that minimax occasionally performed well against humans but was computationally impractical due to the game's large branching factor. PPO achieved moderate success against random play but failed to develop meaningful long-term strategies. Standard MCTS proved stronger but was consistently outperformed by MCTS with CNN, which achieved win rates of 66% as player 1 and 48% as player 2 against the standard MCTS agent. Importantly, the CNN-enhanced MCTS was the only model able to consistently defeat experienced human players. All code was written in Python using Jupyter notebooks, with training and evaluation results stored and shared via GitHub.

Jake Frischmann, Etash Jhanji, Micheal Huang, Sebastian Liu, Alden Bressler, Delia Brown · PGSS Team Project Research Paper

Background & Game Description

What is Blokus Duo?

Blokus Duo is a two-player variant of the polyomino placement game Blokus. Each player owns 21 unique pieces, ranging from a single square up to five-square "pentominoes," and must place them on a 14×14 board, one per turn.

Every placement after the first must touch a corner of one of the player's own previously placed pieces but must not share an edge with any of their own pieces. Pieces may be freely rotated and flipped. The player who places the most total squares wins.

Player 1 starts at position (5, 10) and Player 2 at (10, 5) on the 14×14 board.

Why is it hard for AI?

Blokus Duo has a staggering branching factor. At the start there are ~508 possible moves; by the second turn, ~2,500; and by the third turn, ~3,800. Compare this to Chess (~20–60 per turn) and Go (~250 per turn).

A conservative estimate of only 500 possible moves over 15 turns yields more possible games than atoms in the human body, meaning exhaustive search is completely infeasible.

Winning requires long-horizon spatial reasoning: securing territory, blocking opponent corners, and managing piece difficulty across the entire game arc.

Board Layout (14×14)

Each cell is either empty (0), occupied by Player 1 (1), or occupied by Player 2 (2). The starting positions are marked ★.

Player 1 territory (teal)   Player 2 territory (orange)   ★ Starting position

Piece Inventory

Each player holds 21 unique polyominoes, ranging from 1 to 5 squares. The project code assigns them IDs 1–21:

IDNameSquaresOrientations
1i111
2i222
3i332
4quadruple line42
5quintuple line52
6z444
7t444
8l448
9square41
10w54
11p58
12f58
13t554
14x51
15z554
16v554
17u54
18v334
19n58
20y58
21l558

Environment & Code Architecture

The game engine is built around a central Board class (board.py) that manages all game state. The PettingZoo BlokusEnv wrapper (PPO/blokusEnv.py) exposes the environment to RL frameworks.

board.py · Board.__init__ Source: ~/repos/BlokusDuo/board.py

The board is a 14×14 list of integers. Each player tracks their available "possible squares," the corner positions where a new piece could legally be anchored, which are incrementally updated rather than recomputed from scratch each turn.

# Board is stored as a 14x14 grid of integers:
#   0 = empty  |  1 = Player 1  |  2 = Player 2
self.board = [[0 for _ in range(s)] for _ in range(s)]

# Each player's reachable corner positions:
# format: [x, y, [NE_open, SE_open, SW_open, NW_open]]
self.possible_squares = [
    [[4, 4, [True, True, True, True]]],    # P1 starts at (5,10)
    [[s-5, s-5, [True, True, True, True]]] # P2 starts at (10,5)
]

# Each player's remaining piece inventory (IDs 0-20)
self.inv = [
    list(range(21)),  # Player 1
    list(range(21))   # Player 2
]

self.score = [0, 0]   # tiles placed by each player
board.py · calculateLegalMoves Source: ~/repos/BlokusDuo/board.py

Move generation is the performance-critical path. Rather than scanning the whole board, it iterates only over pre-tracked "possible squares" (corner attachment points) and checks each piece/orientation combination against placement rules.

def calculateLegalMoves(self, only_fives_rounds=0):
    legal_placements = []
    # Iterate only over proven corner attachment points
    for x, y, [NE, SE, SW, NW] in self.possible_squares[self.turn - 1]:
        if self.board[y][x] != 0:
            continue   # already occupied
        if self.turn in self.getEdgesValues(x, y):
            continue   # would share an edge with own piece

        for piece_num in self.inv[self.turn - 1]:
            for orientation in piece_possible_orientations[piece_num]:
                piece_def = pieces[piece_num][orientation]
                # Try anchoring at each block of the piece
                for dir_idx in range(4):
                    for pieceBlock in piece_def[dir_idx + 1]:
                        center = [x - pieceBlock[0], y - pieceBlock[1]]
                        # validate all squares of the piece from this center
                        # ... (bounds + edge + corner checks)
                        legal_placements.append([center[0], center[1],
                                                 piece_num, orientation])
    return legal_placements
PPO/blokusEnv.py · BlokusEnv.observe Source: ~/repos/BlokusDuo/PPO/blokusEnv.py

For RL training the board is flattened to a 1-D integer array and concatenated with the piece inventory. The observation is always returned from the current agent's perspective. Player 2's view is a board-flipped mirror of Player 1's, so a single policy network can handle both roles.

# Observation space definition (per agent)
self.observation_spaces = {
    i: spaces.Dict({
        # 196 board cells (0/1/2) + 42 piece-availability bits
        "observation": spaces.MultiDiscrete([3]*196 + [2]*42, dtype=np.int8),
        # One binary flag per possible (x, y, piece, orientation) combo
        "action_mask": spaces.MultiBinary(14 * 14 * 21 * 8),
    })
    for i in self.agents
}

def observe(self, agent):
    # Map agent IDs -> board encoding perspective (1 or -1)
    perspective = 1 if agent == 0 else -1
    # Piece availability matrix: shape [2, 21], agent-first ordering
    piecesMB = np.zeros([2, 21])
    for piece in range(21):
        if piece in self.board.inv[0]: piecesMB[0][piece] = 1
        if piece in self.board.inv[1]: piecesMB[1][piece] = 1
    if perspective == -1:
        piecesMB = piecesMB[::-1]   # swap so agent's own pieces come first

    return {
        # Flipped board (agent sees itself as "1") + inventory
        "observation": np.concatenate((
            np.array(self.board.get_flipped_board(perspective)).flatten(),
            piecesMB.flatten()
        )).astype(np.int8),
        "action_mask": self.genActionMask(),
    }
PPO/blokusEnv.py · Action encoding Source: ~/repos/BlokusDuo/PPO/blokusEnv.py

The full action space has 14 × 14 × 21 × 8 = 32,928 discrete indices. The discreteToAction / actionToDiscrete pair provides a fast bijection between integer indices and human-readable (x, y, piece, orientation) tuples.

# Total action space: 14 x-coords x 14 y-coords x 21 pieces x 8 orientations
ACTION_SPACE_SIZE = 14 * 14 * 21 * 8  # = 32,928

def discreteToAction(action: int) -> tuple:
    """Convert a flat integer index to (x, y, piece, orientation)."""
    x = action // (14 * 21 * 8)
    action %= (14 * 21 * 8)
    y = action // (21 * 8)
    action %= (21 * 8)
    piece = action // 8
    orientation = action % 8
    return (x, y, piece, orientation)

def actionToDiscrete(action: tuple) -> int:
    """Convert (x, y, piece, orientation) back to a flat index."""
    return (((action[0] * 14 + action[1]) * 21 + action[2]) * 8) + action[3]

Models & RL Setup

Algorithm 1: Minimax

Depth-3 minimax with alpha-beta pruning and a hand-crafted board scoring function (9 weights). The scoring considers available corners, corner direction counts, opponent corner blocking, raw score, and remaining piece difficulty.

Alpha-beta pruning gave a ~4× speedup but couldn't overcome the game's extreme branching factor. Minimax can win against beginners but cannot consistently beat experienced humans.

Algorithm 2: MCTS v3

Monte Carlo Tree Search with Selection → Expansion → Simulation → Backpropagation. Runs 500–4,000 simulations per turn. The UCB score balances exploration vs. exploitation via a tunable constant.

Version 3 reached terminal states fast enough for win/loss feedback instead of relying solely on the board heuristic. Achieved 60–70% win rate against expert human players at ~700 simulations.

Algorithm 3: MCTS + CNN

Two convolutional networks augment the base MCTS: a value network estimates win probability P(win) ∈ [−1, 1] from a board state, replacing (partially) the hand-crafted heuristic; a policy network outputs a probability distribution over moves to focus tree expansion.

Value Network (Keras Sequential)

LayerOutputParams
Conv2D (32 filters, 3×3)14×14×321,184
BatchNorm14×14×32128
Conv2D (64 filters, 3×3)12×12×6418,496
Flatten9,2160
Dense + Dropout2562,359,552
Dense (output)1 (tanh)257

Accuracy peaked at ~28%, intentionally loose to preserve MCTS exploration balance.

Policy Network V1 (Keras Sequential)

LayerOutputParams
Conv2D (32)14×14×321,184
BatchNorm128
Conv2D (128)9×9×128147,584
Conv2D (64)7×7×6473,792
Flatten → Dense (4096)4,09612,849,152
Dense (4096) + Dropout4,09616,781,312
Dense (8182) + Dropout8,18233,521,654
Dense (32928)32,928269,449,824

Action masking zeroes illegal moves then re-normalizes probabilities.

Algorithm 4: PPO (MaskablePPO via Stable-Baselines 3)

The RL environment wraps Board inside a PettingZoo AECEnv, training both agents sequentially. Stable-Baselines3's MaskablePPO applies the action mask at every step so illegal moves are never sampled.

# Training entry point (PPO/aecTrain.py)
model = MaskablePPO(
    MaskableActorCriticPolicy,
    env,                         # SB3ActionMaskWrapper(BlokusEnv)
    verbose=1,
    device='cuda',
    learning_rate=0.001,         # Adam optimizer
    ent_coef=0.005,              # entropy bonus to encourage exploration
)

# Load checkpoint if continuing a training run
if i > 0:
    model = model.load(f"final_models/final_model_{i-1}.zip")
    model.set_env(env)

model.learn(total_timesteps=steps)   # ~100,000 steps ~= 4,000 games
model.save(f"final_models/final_model_{i}.zip")

Hyperparameters

  • Learning rate: 0.001
  • Entropy coef: 0.005
  • Timesteps: ~100,000 (~4,000 games)
  • Reward: +1 win / −1 loss
  • Action masking: MultiBinary, 32,928 elements

Observation Space

  • Board: 14×14 = 196 cells (0/1/2)
  • Inventory: 2×21 = 42 binary flags
  • Total obs: 238 discrete elements
  • Action space: Discrete(32,928)
  • Perspective: Agent-relative (flipped for P2)

Experiments & Results

Results come from ~2.5 million simulations across all non-PPO matchups, plus extensive human evaluation. The PPO model was evaluated separately (~50 games against non-random algorithms due to time constraints).

First-Player Advantage (Random vs. Random, n = 5,000)

Even with neither player having any strategy, playing first confers a statistically significant advantage. A χ² test confirms the observed win distribution differs from a 50/50 null hypothesis (p < 0.05).

52.7%P1 Win Rate
40.5%P2 Win Rate
6.8%Draws
51.9 / 50.3Avg Score P1 / P2
+4.6P1 avg "win by"
+3.0P2 avg "win by"

Head-to-Head Win Rates (Player 1 vs Player 2, %)

Rows are Player 1, columns are Player 2. All values are Player 1's win rate. The PPO vs. Minimax cell is omitted because both are deterministic and produce identical games regardless of sample size.

P1 \ P2RandomMinimaxPPOMCTSMCTS + CNN
Random52.7
Minimax90.698.4
PPO98.461.3
MCTS100.098.8100.00.08.7
MCTS + CNN100.098.251.666.1

MCTS + CNN is the strongest agent: 66% win rate as P1 vs. MCTS, and the only model to consistently defeat experienced human players.

PPO Win Rate vs. Random Agent Over Training

Win rate recorded every ~10,000 timesteps. Growth is generally monotone with one dip at ~40,000 steps (likely a poor exploration episode). Best checkpoint: 98.38% at ~100,000 steps.

0%25%50%75%100%010k20k30k40k50k60k70k80k90k100kPPO Win Rate vs. Random Agenttimesteps

Approximate reconstruction from paper Figure 11. Dip at ~40k is a known exploration artifact.

Win Rate vs. Random Baseline

All agents as Player 1 vs. a random opponent, showing the progression of algorithmic strength.

Random
52.7%
Minimax
90.6%
PPO
98.4%
MCTS
100%
MCTS + CNN
100%

Discussion & Future Work

Key Takeaways

  • MCTS + CNN was the only model to beat expert human players consistently, validating the hypothesis that neural value/policy estimation meaningfully improves search quality.
  • PPO underperformed relative to search-based methods with the given training budget; ~4,000 self-play games were insufficient for the 32,928-dimensional action space.
  • First-mover advantage is real and significant. Even in random play, Player 1 wins 53% of the time with a larger average margin.
  • The value network accuracy of ~28% was intentionally imperfect: higher accuracy disrupted MCTS's exploration-exploitation balance and hurt game performance.

Hardest Engineering Challenges

  • Branching factor at scale. Generating legal moves efficiently without full board scans required maintaining incremental "possible squares" lists, a non-trivial bookkeeping problem.
  • Action masking with 32,928 actions. Training PPO without masking produced garbage policy gradients; implementing PettingZoo + SB3 masking correctly took significant iteration.
  • Evaluation validity. With 2.5M simulations, it was easy to overfit comparisons to specific random seeds. Running many short batches and averaging was essential.
  • Sparse terminal rewards. Most Blokus games take 20–40 turns, making reward credit assignment a genuine challenge for the PPO model.

Future Directions

Short term

  • Use PPO as a second-phase policy network initializer for MCTS + CNN (AlphaGo-style self-play loop)
  • Runtime optimization: profile board.py and move generation for 10–100× speedup in simulation
  • Larger evaluation dataset against human players across skill levels

Longer term

  • Full AlphaZero-style training on expert game data (needs larger compute)
  • AlexNet or ResNet backbone for the value / policy heads
  • Curriculum learning: start with small piece subsets and scale up
  • Generalize to full 4-player Blokus (20×20 board, 4 inventories)