← Back to writing

ML/RL

What I Learned Building PPO + MCTS for Blokus Duo

2026-02-02 · 8 min

A practical writeup on training self-play agents, evaluating them honestly, and figuring out where search actually helped.

The idea of combining reinforcement learning with search is not new. AlphaGo and its successors proved that pairing a policy network with Monte Carlo Tree Search produces stronger agents than either approach alone. What that combination actually looks like in implementation, and what can go wrong along the way, is harder to learn from a paper than from building it yourself.

I built a self-play training framework for Blokus Duo as a research project, combining Proximal Policy Optimization with MCTS and a convolutional policy-value network. This is what I learned.

The Blokus Duo Problem

Blokus Duo is played on a 14×14 board. Each player has 21 polyomino pieces, shapes ranging from a single square up to five-cell pentominoes. You place pieces touching your own pieces diagonally but not orthogonally, and the goal is to cover as much board space as possible before neither player can place any more pieces.

The action space is large and structured. Each piece has multiple rotation and reflection variants, and each variant can be placed anywhere that satisfies the adjacency rules. Early in the game there are hundreds of legal moves; late in the game the board fills and options compress. The game does not have reversals. Once a piece is placed it is gone, which means long-horizon consequences matter from the first move.

This combination of a large early action space, permanent placement decisions, and a sparse terminal reward makes naive deep RL fragile. Policy gradient methods will learn something, but without lookahead they consistently develop weak positional intuitions. The policy learns to place pieces, not to control territory.

PPO and MCTS: What Each One Does

PPO improves the agent's policy over many games by collecting rollouts, estimating advantages, and updating the network with a clipped objective that prevents large destabilizing updates. It is well-suited for this problem because self-play generates unlimited training data and the clipping makes training stable enough to run long experiments.

The weakness of PPO alone is that it learns from aggregate game outcomes rather than from the local quality of individual positions. A policy trained purely with PPO develops reasonable general intuitions but struggles with positions that require looking several moves ahead to evaluate correctly.

MCTS addresses this directly. Given a position, MCTS runs forward simulations using the current policy as a prior for move selection, accumulates value estimates from the policy-value network across many simulations, and returns a more informed action distribution. The result is a local planner that the policy network cannot replicate from learned weights alone, especially early in training when the policy is weak.

The board state is encoded as stacked binary planes representing piece ownership and empty space, fed into a convolutional network with separate policy and value heads. The policy head outputs a distribution over legal actions; the value head predicts the current player's expected score differential.

The Training Discipline Problem

Most of the implementation friction was not in the model architecture. It was in evaluation.

Reinforcement learning is deceptive about progress. A policy can appear to improve in self-play, with win rates shifting and loss curves trending down, but these signals only measure relative performance against recent versions of itself. If the training distribution degrades, or if both agents adopt a stable but weak equilibrium, the metrics look fine while actual strategic quality stagnates.

I dealt with this by maintaining a fixed pool of reference agents: random play, greedy by cells covered, and a frozen early-training checkpoint. Every fixed number of training steps, the current agent played a block of games against each reference. Progress against random and greedy is easy to fake through distributional shift; progress against the frozen checkpoint is harder to fake. This gave me a baseline that remained meaningful as training ran.

I also treated MCTS simulation counts as a training variable. Early in training, high simulation counts during data collection produce noisy targets because the policy being used as a prior is weak. The MCTS tree is only as good as the estimates feeding it. Starting with low simulation counts and scaling them up as the policy strengthened produced more stable early training than using high counts throughout.

The Feedback Loop

The most useful frame I developed was treating search and learning as a feedback loop rather than as separate components.

As PPO training stabilizes, the policy becomes a better prior for MCTS, and the search explores more relevant branches rather than wasting simulations on clearly weak moves. Simultaneously, as MCTS quality improves, the action distributions it produces during data collection become more informative training targets for the policy update. Each component gradually lifts the other.

The instability point in this loop is when the policy improves faster than the value network can track, or vice versa. I observed this as episodes of erratic evaluation loss, with the value head making confident but wrong predictions about positions that had shifted out of the training distribution. The fix was not architectural; it was slowing the policy update rate relative to the value update rate, giving the value function time to track the shifting distribution.

What Transferred

The strategic patterns that emerged were distinct from baseline heuristics. Agents trained with this system developed preferences for corner-anchored openings, used the diagonal adjacency rule to reserve space rather than immediately filling it, and responded to opponent territory patterns in ways that greedy-by-coverage play does not. Whether that constitutes "understanding" strategy or pattern matching at sufficient depth is a philosophical question; from an evaluation standpoint the behavior was measurably different.

The bigger lesson was methodological: rigorous evaluation and stable training loops matter more than model architecture choices at this scale. The architecture I used was conventional. The evaluation discipline was not, and that is what made the results meaningful and communicable.