Vanilla Counterfactual Regret Minimization for Engineers

Counterfactual regret minimization is a technique used in imperfect-information game AI. In perfect-information games like chess and go, each player knows the entire state of the game at all times. In imperfect-information games, each player is missing information. Imperfect-information games are important since real-life strategic applications are often more easily modeled as imperfect-information games. Applications include security models, economic models, auctions, negotiations, and more.

However, imperfect-information games are tougher to solve than perfect-information games. In a perfect-information game, each subtree’s solution is independent of the rest. You can solve the game by traversing the game tree once. In an imperfect-information game, there are many dependencies. An imperfect-information game with the equivalent state-complexity as a perfect-information game will be more difficult to solve.

Poker is the classic example of an imperfect-information game. Most recent research on imperfect-information games has focused on poker. It’s the perfect proving ground. To humans, poker is so difficult to reason about that it is most often categorized as a game of complete luck. This is wrong. In fact, heads-up limit hold’em (HU LHE) has been solved [1]. In 2017, two AIs defeated professional poker players over a statistically significant sample for the first time in heads-up no-limit hold’em (HU NLHE). This was a significant event. The version of HU NLHE played in the ACPC has 10¹⁶⁴ game states [2]. To put that into perspective, there are estimated to be 10⁸² atoms in the universe. If you had one trillion computers each crunching one trillion states per second for the next one trillion years you’d still have only reached 10⁴³ states. We may never truly solve a game with a state-complexity of 10¹⁶⁴.

Instead, recent AIs create strategies with approximately near-zero exploitability. We assume that poker is a two-person zero-sum game where players alternate positions. Given these assumptions, this means that a near-zero exploitability strategy can do no worse than break-even. Even if an opponent can see your cards and see the future, this is still true. If you have a set of near-zero exploitability strategies this is called a Nash equilibrium.

One minor note is that a Nash equilibrium in one sense is not an optimal strategy in poker. This is because current poker AIs treat each hand as a separate game. This is similar to a poker cash game as opposed to a tournament. This form of poker does not have a binary outcome. Instead, a Nash equilibrium sets a lower-bound on how much you can lose. It is a defensive strategy. Two players following Nash equilibrium strategies would lose money over time since poker is not truly a zero-sum game due to rake. That being said, neural networks and deep reinforcement learning approaches have had limited success while Nash equilibrium approximations have been shown to perform very well in practice.

Counterfactual regret minimization (CFR) is an algorithm that approximates a Nash equilibrium. All modern poker AIs use a variant of CFR. The simplest variant and the one discussed here is Vanilla CFR. Intuitively, CFR works by repeatedly playing against itself while minimizing regret. Regret is a numeric value describing how much you regret a decision. It can be positive, negative, or zero. A positive regret indicates that you would have rather taken a different decision. A negative regret indicates that you are happy with your decision and a zero regret indicates that you are indifferent. CFR minimizes regret over many iterations until the average strategy over all iterations converges. The average strategy is the approximated Nash equilibrium. If the average overall regret of two players is $ϵ$ $ϵ$ , then the strategy can be exploited by at most $2 * ϵ$ $2 * ϵ$ [3].

Regret minimization algorithms have existed for a long time. However, regret minimization algorithms work best when the game can be represented as a matrix, also known as normal-form. Imperfect-information games like poker are not easily represented this way. And when they are, regret minimization algorithms are too slow. Poker is best represented as an imperfect-information extensive-form game. This means that the game is represented as a tree structure. Each node in the tree represents either a player’s decision, a chance event, or a terminal outcome. Edges represent actions taken.

CFR was created to minimize regret for extensive-form games. It does this by introducing a new value called counterfactual regret. The algorithm minimizes a separate counterfactual regret value for every information set, where an information set is the set of nodes in the game-tree that are indistinguishable for a given player. In poker, an information set for player $i$ $i$ contains all the nodes representing the different private card combinations an opponent can hold. Minimizing counterfactual regret for each information set works because it has been proven that the average overall regret is less than or equal to the sum of the counterfactual regrets [3]. Therefore, minimizing the counterfactual regret minimizes overall regret.

It’s easiest to see CFR in action by implementing CFR on a simpler form of poker. The canonical form of simple poker is Kuhn Poker. In this form, there are three cards. Each player is dealt a single card and antes 1 chip (an ante is a forced bet at the start of every hand). There is a single round of betting. Players can check, bet 1 chip, call, or fold. Below is an image of the complete game tree. The dotted lines connect nodes that are in the same information set.

The following implementation is written in Python 3. It is not written for performance. A practical CFR implementation is resource-intensive. For that reason, CFR implementations to be used on NLHE poker variants are often written in C++. Instead, this implementation is optimized for readability. We use the NumPy library to make vector operations easier to read, although it will make the algorithm slower in this case.

1import numpy as np
2
3# Number of actions a player can take at a decision node.
4_N_ACTIONS = 2
5_N_CARDS = 3

First, we import NumPy and define constants for the number of cards and number of possible actions. In our algorithm each player only has two possible actions: check and bet. This simplifies our code. This is possible because a fold can be treated as a check, and a call can be treated as a bet without loss of generality.

 8def main():
 9    """
10    Run iterations of counterfactual regret minimization algorithm.
11    """
12    i_map = {}  # map of information sets
13    n_iterations = 10000
14    expected_game_value = 0
15
16    for _ in range(n_iterations):
17        expected_game_value += cfr(i_map)
18        for _, v in i_map.items():
19            v.next_strategy()
20
21    expected_game_value /= n_iterations
22
23    display_results(expected_game_value, i_map)

main() is our entry point to the program. i_map defined on line 12 is a dictionary, or hash map data structure, that contains information set objects keyed by a string. Each iteration of cfr() returns the expected game value for that iteration. We are able to approximate the true game value by averaging over all iterations. The true game value is the expected amount that a player will win while following the average strategy. On line 19 we calculate the strategy for the next iteration for each information set.

26def cfr(i_map, history="", card_1=-1, card_2=-1, pr_1=1, pr_2=1, pr_c=1):
27    """
28    Counterfactual regret minimization algorithm.
29
30    Parameters
31    ----------
32
33    i_map: dict
34        Dictionary of all information sets.
35    history : [{'r', 'c', 'b'}], str
36        A string representation of the game tree path we have taken.
37        Each character of the string represents a single action:
38
39        'r': random chance action
40        'c': check action
41        'b': bet action
42    card_1 : (0, 2), int
43        player A's card
44    card_2 : (0, 2), int
45        player B's card
46    pr_1 : (0, 1.0), float
47        The probability that player A reaches `history`.
48    pr_2 : (0, 1.0), float
49        The probability that player B reaches `history`.
50    pr_c: (0, 1.0), float
51        The probability contribution of chance events to reach `history`.
52    """
53    if is_chance_node(history):
54        return chance_util(i_map)
55
56    if is_terminal(history):
57        return terminal_util(history, card_1, card_2)
58
59    n = len(history)
60    is_player_1 = n % 2 == 0
61    info_set = get_info_set(i_map, card_1 if is_player_1 else card_2, history)
62
63    strategy = info_set.strategy
64    if is_player_1:
65        info_set.reach_pr += pr_1
66    else:
67        info_set.reach_pr += pr_2
68
69    # Counterfactual utility per action.
70    action_utils = np.zeros(_N_ACTIONS)
71
72    for i, action in enumerate(["c", "b"]):
73        next_history = history + action
74        if is_player_1:
75            action_utils[i] = -1 * cfr(i_map, next_history,
76                                       card_1, card_2,
77                                       pr_1 * strategy[i], pr_2, pr_c)
78        else:
79            action_utils[i] = -1 * cfr(i_map, next_history,
80                                       card_1, card_2,
81                                       pr_1, pr_2 * strategy[i], pr_c)
82
83    # Utility of information set.
84    util = sum(action_utils * strategy)
85    regrets = action_utils - util
86    if is_player_1:
87        info_set.regret_sum += pr_2 * pr_c * regrets
88    else:
89        info_set.regret_sum += pr_1 * pr_c * regrets
90
91    return util

cfr() contains the bulk of the logic. The function recursively performs a depth-first traverse across the game tree. In Vanilla CFR we traverse the entire game tree every iteration. cfr() performs different tasks depending on if the node is a chance node, a terminal node, or a decision node. It takes seven parameters.

i_map is the hash map of information sets.
history is a string that represents our current location in the game tree. Each character represents an action we have taken and an edge we have followed.
- ‘r’ is a random chance event.
- ‘c’ is a check action.
- ‘b’ is a bet action.
card_1 is the private card of player one.
card_2 is the private card of player two.
pr_1 is player one’s contribution to the reach probability—the probability that we’ve reached history.
pr_2 is player two’s contribution to the reach probability.
pr_c is the contribution of chance events to the reach probability.

Information about CFR algorithms mostly resides inside of academic research papers. This means that if you want to read about the latest CFR algorithms you'll most likely need to understand the notation that describes the math behind the algorithms. Understanding the math is not necessary to understand the algorithm, but it may provide insights into how the algorithm works. I'll keep math related sections in these aside blocks. Details about the notation used in CFR algorithms can be found here and here.

You may be wondering what the variables pr_1, pr_2, and pr_c are used for. CFR algorithms use the notation $π^{σ} (h)$ $π^{σ} (h)$ to describe the probability of reaching hand history $h$ $h$ while following the strategy $σ$ $σ$ . $π^{σ} (h)$ $π^{σ} (h)$ is called the reach probability. The reach probability can be decomposed into the individual contributions of each player and chance events. If you multiply all of these contributions together, you'll get the reach probability. A player's contribution is 1 if they do not contribute. This is described as

$π^{σ} = \prod i \in N \cup {c} π_{i}^{σ} (h)$

$\prod$ $\prod$ is the product operator. $N \cup {c}$ $N \cup {c}$ is the union of the set of players and the set of chance events [3]. $π_{i}^{σ}$ $π_{i}^{σ}$ is the contribution of player $i$ $i$ . $π_{- i}^{σ}$ $π_{- i}^{σ}$ is the product of all player contributions except player $i$ $i$ . This includes the contribution of chance events. pr_1 is $π_{1}^{σ}$ $π_{1}^{σ}$ , pr_2 is $π_{2}^{σ}$ $π_{2}^{σ}$ , and pr_c is $π_{c}^{σ}$ $π_{c}^{σ}$ . These variables are used to calculate reach probabilities which are used in the definition of counterfactual regret.

At line 53, we check if we are at a chance node with is_chance_node(). If so, we call chance_util().

94def is_chance_node(history):
95    """
96    Determine if we are at a chance node based on tree history.
97    """
98    return history == ""
99
100
101def chance_util(i_map):
102    expected_value = 0
103    n_possibilities = 6
104    for i in range(_N_CARDS):
105        for j in range(_N_CARDS):
106            if i != j:
107                expected_value += cfr(i_map, "rr", i, j, 1, 1, 1/n_possibilities)
108    return expected_value/n_possibilities

is_chance_node() determines if we are at a chance node by checking for an empty history. This works because in Kuhn Poker chance events only occur at the start of the game. If is_chance_node() returns true then chance_util() enumerates all possible combinations of chance nodes. There are six total possible combinations $(\frac{3}{1}) (\frac{2}{1}) = 6$ $(\frac{3}{1}) (\frac{2}{1}) = 6$ . For each possibility, we recursively call cfr(). The next history is "rr" to represent two random chance actions. Neither player has taken any actions, so their reach probabilities are 1. Each chance event has a uniformly random probability of occurring of 1/n_possibilities. The expected value over all possibilities is just the sum of the utility of each possibility divided by the number of possibilities.

At line 56 we check if we have reached a terminal node of the game tree with is_terminal().

112def is_terminal(history):
113    """
114    Returns true if the history is a terminal history.
115    """
116    possibilities = {"rrcc": True, "rrcbc": True,
117                     "rrcbb": True, "rrbc": True, "rrbb": True}
118    return history in possibilities
119
120
121def terminal_util(history, card_1, card_2):
122    """
123    Returns the utility of a terminal history.
124    """
125    n = len(history)
126    card_player = card_1 if n % 2 == 0 else card_2
127    card_opponent = card_2 if n % 2 == 0 else card_1
128
129    if history == "rrcbc" or history == "rrbc":
130        # Last player folded. The current player wins.
131        return 1
132    elif history == "rrcc":
133        # Showdown with no bets
134        return 1 if card_player > card_opponent else -1
135
136    # Showdown with 1 bet
137    assert(history == "rrcbb" or history == "rrbb")
138    return 2 if card_player > card_opponent else -2

is_terminal() checks if history is in the predefined set of known terminal histories. If True, terminal_util() returns the utility of the current player. Since players alternate turns, the current player is player one if the number of actions in history is even and player two otherwise. To calculate the utility of a terminal history there are three cases.

If the last player folded, the current player wins 1 chip.
If both players check during the betting round then players showdown with a pot of 2. The player with the higher card wins 1 chip.
The players showdown with a pot of 4. The player with the higher card wins 2 chips.

If we reach line 59 we are in an information set. On line 60 we check who’s turn it is. On line 61 we retrieve the information set from i_map using get_info_set().

141def card_str(card):
142    if card == 0:
143        return "J"
144    elif card == 1:
145        return "Q"
146    return "K"
147
148
149def get_info_set(i_map, card, history):
150    """
151    Retrieve information set from dictionary.
152    """
153    key = card_str(card) + " " + history
154    info_set = None
155
156    if key not in i_map:
157        info_set = InformationSet(key)
158        i_map[key] = info_set
159        return info_set
160
161    return i_map[key]
162
163
164class InformationSet():
165    def __init__(self, key):
166        self.key = key
167        self.regret_sum = np.zeros(_N_ACTIONS)
168        self.strategy_sum = np.zeros(_N_ACTIONS)
169        self.strategy = np.repeat(1/_N_ACTIONS, _N_ACTIONS)
170        self.reach_pr = 0
171        self.reach_pr_sum = 0
172
173    def next_strategy(self):
174        self.strategy_sum += self.reach_pr * self.strategy
175        self.strategy = self.calc_strategy()
176        self.reach_pr_sum += self.reach_pr
177        self.reach_pr = 0
178
179    def calc_strategy(self):
180        """
181        Calculate current strategy from the sum of regret.
182        """
183        strategy = self.make_positive(self.regret_sum)
184        total = sum(strategy)
185        if total > 0:
186            strategy = strategy / total
187        else:
188            n = _N_ACTIONS
189            strategy = np.repeat(1/n, n)
190
191        return strategy
192
193    def get_average_strategy(self):
194        """
195        Calculate average strategy over all iterations. This is the
196        Nash equilibrium strategy.
197        """
198        strategy = self.strategy_sum / self.reach_pr_sum
199
200        # Purify to remove actions that are likely a mistake
201        strategy = np.where(strategy < 0.001, 0, strategy)
202
203        # Re-normalize
204        total = sum(strategy)
205        strategy /= total
206
207        return strategy
208
209    def make_positive(self, x):
210        return np.where(x > 0, x, 0)
211
212    def __str__(self):
213        strategies = ['{:03.2f}'.format(x)
214                      for x in self.get_average_strategy()]
215        return '{} {}'.format(self.key.ljust(6), strategies)

i_map’s keys are strings created by the concatenation of a player’s private card and the game tree history. We represent information sets as an instance of the InformationSet class. get_info_set() generates the key and retrieves the information set. If the information set is not already in the map then we create a new instance and insert it.

In Vanilla CFR each information set is visited multiple times each iteration. InformationSet stores self.regret_sum, self.strategy_sum, self.strategy, self.reach_pr, and self.reach_pr_sum. self.regret_sum, self.strategy_sum, and self.strategy are arrays indexed by an action. In our algorithm check is index 0 and bet is index 1. self.regret_sum is the sum of counterfactual regrets for each action over all visits. self.strategy_sum is the sum of each visit’s strategy multiplied by the information set player’s reach probability. self.regret_sum is used to calculate the next strategy to try. self.strategy_sum and self.reach_pr_sum is used to calculate the average strategy. self.strategy is the strategy for the current iteration and self.reach_pr accumulates the probability of reaching an information set. As more iterations are performed, average regret will minimize and the average strategy will converge to a Nash equilibrium.

At the end of every iteration the self.next_strategy() updates self.strategy_sum, self.strategy, and resets self.reach_pr. The next strategy is generated in the calc_strategy() method. Intuitively, calc_strategy() chooses a strategy proportional to the positive elements of self.regret_sum. Any action with negative regret is ignored in the next strategy. If the sum of all elements of self.regret_sum is non-positive then the next strategy is to uniformly choose any action at random.

Each element of self.regret_sum stores the quantity

$\begin{matrix} T \sum t = 1 π_{- i}^{σ^{t}} (I) (u_{i} (σ^{t} ∣_{I \to a}) - u_{i} (σ^{t}, I)) \\ (1) \end{matrix}$

This quantity can used to calculate counterfactual regret.

$\begin{matrix} R_{i}^{T} (I, a) = \frac{1}{T} T \sum t = 1 π_{- 1}^{σ^{t}} (I) (u_{i} (σ^{t} ∣_{I \to a}, I) - u_{i} (σ^{t}, I)) \\ (2) \end{matrix}$

We never need to divide by $T$ $T$ to calculate the counterfactual regret because the next strategy $σ^{t + 1}$ $σ^{t + 1}$ is proportional to the positive counterfactual regret of each action. Any common factors will disappear. self.strategy_sum stores the quantity

$\begin{matrix} T \sum t = 1 π_{i}^{σ^{t}} (I) σ^{t} (I) (a) \\ (3) \end{matrix}$

This quantity appears in the equation for a player's average strategy.

$\begin{matrix} {¯ ¯ ¯ σ}_{i}^{t} (I) (a) = \frac{\sum_{t = 1}^{T} π_{i}^{σ^{t}} (I) σ^{t} (I) (a)}{\sum_{t = 1}^{T} π_{i}^{σ^{t}} (I)} \\ (4) \end{matrix}$

Where $\sum_{t = 1}^{T} π_{i}^{σ^{t}} (I)$ $\sum_{t = 1}^{T} π_{i}^{σ^{t}} (I)$ is self.reach_pr_sum. The self.calc_strategy() method can be described as

$σ_{i}^{T + 1} (I) (a) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} \frac{R_{i}^{T, +} (I, a)}{\sum_{a \in A (I)} R_{i}^{T, +} (I, a)} & if \sum_{a \in A (I)} R_{i}^{T, +} (I, a) > 0 \frac{1}{| A (I) |} & otherwise. \end{matrix}$

On line 63 we retrieve the information set’s current strategy. This strategy was calculated at the end of the previous iteration. On lines 64-67 we add the player’s reach probabililty to the information set’s reach probability. The information set’s reach probability is the sum of the reach probabilities of all histories in the information set. From lines 72-81 we recursively call cfr() for each possible action in the information set. We store the result of each call in an array of expected utilities. The history for the next recursive call is created by appending the next action to history. Each player chooses action $a$ $a$ with probability strategy[i]. This changes the current player’s reach probability. We multiply the expected utility from cfr() by -1 because cfr() returns the utility for the next turn. In a zero-sum two-player game a player’s utility is the opposite of the other player.

On line 84 we calculate the expected utility over all actions. On line 85 we calculate the regret for each action by subtracting util from each element of action_utils. On lines 87 and 89 we calculate counterfactual regret. The counterfactual regret is added to the information set’s self.regret_sum. Finally, we return the history’s expected utility.

p2_2 * pr_c and pr_1 * pr_c calculate $π_{- i}^{σ^{t}}$ $π_{- i}^{σ^{t}}$ as seen in equation (2). util is $π^{σ^{t}} (h, z) u_{i} (z)$ $π^{σ^{t}} (h, z) u_{i} (z)$ and each element of action_utils is $π^{σ^{t}} (h a, z) u_{i} (z)$ $π^{σ^{t}} (h a, z) u_{i} (z)$ .

Some CFR algorithms make reference to the counterfactual value. The counterfactual value $v_{i} (σ, I)$ $v_{i} (σ, I)$ is defined as

$v_{i} (σ, I) = \sum h \in I, z \in Z, h ⊏ z π_{- i}^{σ} (h) π^{σ} (h, z) u_{i} (z)$

where $Z$ $Z$ is the set of terminal histories reachable from I, $h$ $h$ is a proper prefix of z contained in I [4], and $π^{σ} (h, z)$ $π^{σ} (h, z)$ is the probability of reaching terminal history z from history h [5]. The counterfactual value at information set $I$ $I$ is the sum of the counterfactual values at each nonterminal history in $I$ $I$ .

$v_{i} (σ, I) = \sum h \in I v_{i} (σ, h)$

The counterfactual value at history $h$ $h$ is util * pr_c * pr_-i.

$v_{i} (σ, h) = \sum z \in Z, h ⊏ z π_{- i}^{σ} (h) π^{σ} (h, z) u_{i} (z)$

Note that self.regret_sum is often written as

$R_{i}^{T} (I, a) = \frac{1}{T} T \sum t = 1 π_{- i}^{σ^{t}} (I) (u_{i} (σ^{t} ∣_{I \to a}, I) - u_{i} (σ^{t}, I))$

Our program uses an equivalent formulation.

$R_{i}^{T} (I, a) = \frac{1}{T} T \sum t = 1 \sum h \in I, z \in Z π_{- i}^{σ^{t}} (h) (π^{σ^{t}} (h a, z) u_{i} (z) - π^{σ^{t}} (h, z) u_{i} (z))$

After all iterations are finished main() will call display_results().

227def display_results(ev, i_map):
228    print('player 1 expected value: {}'.format(ev))
229    print('player 2 expected value: {}'.format(-1 * ev))
230
231    print()
232    print('player 1 strategies:')
233    sorted_items = sorted(i_map.items(), key=lambda x: x[0])
234    for _, v in filter(lambda x: len(x[0]) % 2 == 0, sorted_items):
235        print(v)
236    print()
237    print('player 2 strategies:')
238    for _, v in filter(lambda x: len(x[0]) % 2 == 1, sorted_items):
239        print(v)

display_results() prints each player’s information sets using the __str__() method from the InformationSet class. __str__() prints the information set key followed by using self.get_average_strategy() to print the average strategy. self.average_strategy() calculates the average strategy from self.strategy_sum and sets near-zero probabilities to zero.

$ python cfr_vanilla.py

Running the program for 10,000 iterations produces the output:

 1player 1 expected value: -0.05666979368427769
 2player 2 expected value: 0.05666979368427769
 3
 4player 1 strategies:
 5J rr   ['0.79', '0.21']
 6J rrcb ['1.00', '0.00']
 7K rr   ['0.39', '0.61']
 8K rrcb ['0.00', '1.00']
 9Q rr   ['1.00', '0.00']
10Q rrcb ['0.45', '0.55']
11
12player 2 strategies:
13J rrb  ['1.00', '0.00']
14J rrc  ['0.67', '0.33']
15K rrb  ['0.00', '1.00']
16K rrc  ['0.00', '1.00']
17Q rrb  ['0.66', '0.34']
18Q rrc  ['1.00', '0.00']

If you run the program for 100,000 iterations it will produce a slightly better approximation. Some CFR implementations have run for up to 6 million core hours [6]. As you approach an infinite number of iterations the overall average regret will approach zero and the average strategy will approach a Nash equilibrium.

 1player 1 expected value: -0.055439677199254814
 2player 2 expected value: 0.055439677199254814
 3
 4player 1 strategies:
 5J rr   ['0.79', '0.21']
 6J rrcb ['1.00', '0.00']
 7K rr   ['0.38', '0.62']
 8K rrcb ['0.00', '1.00']
 9Q rr   ['1.00', '0.00']
10Q rrcb ['0.46', '0.54']
11
12player 2 strategies:
13J rrb  ['1.00', '0.00']
14J rrc  ['0.66', '0.34']
15K rrb  ['0.00', '1.00']
16K rrc  ['0.00', '1.00']
17Q rrb  ['0.67', '0.33']
18Q rrc  ['1.00', '0.00']

The game value of player one in Kuhn Poker is $- \frac{1}{18} = - 0.0 ¯ ¯¯¯¯¯¯ ¯ 555$ $- \frac{1}{18} = - 0.0 ¯ ¯¯¯¯¯¯ ¯ 555$ . After 100,000 iterations the algorithm approximates the game value to be -0.055439677199254814.

The strategies indicate that if player one has a J her first move should be to check 79% of the time and bet 21% of the time. If player two has a K and player one checked, he should bet 100% of the time. This is just one Nash equilibrium. Kuhn Poker has infinitely many. Vanilla CFR is a deterministic algorithm, but many variants are non-deterministic. A non-deterministic CFR may find a different Nash equilibrium.

That’s Vanilla CFR. The algorithm is relatively simple. I hope that this post has helped you to understand the basics of the algorithm.

Resources

Code

  1import numpy as np
  2
  3# Number of actions a player can take at a decision node.
  4_N_ACTIONS = 2
  5_N_CARDS = 3
  6
  7
  8def main():
  9    """
 10    Run iterations of counterfactual regret minimization algorithm.
 11    """
 12    i_map = {}  # map of information sets
 13    n_iterations = 10000
 14    expected_game_value = 0
 15
 16    for _ in range(n_iterations):
 17        expected_game_value += cfr(i_map)
 18        for _, v in i_map.items():
 19            v.next_strategy()
 20
 21    expected_game_value /= n_iterations
 22
 23    display_results(expected_game_value, i_map)
 24
 25
 26def cfr(i_map, history="", card_1=-1, card_2=-1, pr_1=1, pr_2=1, pr_c=1):
 27    """
 28    Counterfactual regret minimization algorithm.
 29
 30    Parameters
 31    ----------
 32
 33    i_map: dict
 34        Dictionary of all information sets.
 35    history : [{'r', 'c', 'b'}], str
 36        A string representation of the game tree path we have taken.
 37        Each character of the string represents a single action:
 38
 39        'r': random chance action
 40        'c': check action
 41        'b': bet action
 42    card_1 : (0, 2), int
 43        player A's card
 44    card_2 : (0, 2), int
 45        player B's card
 46    pr_1 : (0, 1.0), float
 47        The probability that player A reaches `history`.
 48    pr_2 : (0, 1.0), float
 49        The probability that player B reaches `history`.
 50    pr_c: (0, 1.0), float
 51        The probability contribution of chance events to reach `history`.
 52    """
 53    if is_chance_node(history):
 54        return chance_util(i_map)
 55
 56    if is_terminal(history):
 57        return terminal_util(history, card_1, card_2)
 58
 59    n = len(history)
 60    is_player_1 = n % 2 == 0
 61    info_set = get_info_set(i_map, card_1 if is_player_1 else card_2, history)
 62
 63    strategy = info_set.strategy
 64    if is_player_1:
 65        info_set.reach_pr += pr_1
 66    else:
 67        info_set.reach_pr += pr_2
 68
 69    # Counterfactual utility per action.
 70    action_utils = np.zeros(_N_ACTIONS)
 71
 72    for i, action in enumerate(["c", "b"]):
 73        next_history = history + action
 74        if is_player_1:
 75            action_utils[i] = -1 * cfr(i_map, next_history,
 76                                       card_1, card_2,
 77                                       pr_1 * strategy[i], pr_2, pr_c)
 78        else:
 79            action_utils[i] = -1 * cfr(i_map, next_history,
 80                                       card_1, card_2,
 81                                       pr_1, pr_2 * strategy[i], pr_c)
 82
 83    # Utility of information set.
 84    util = sum(action_utils * strategy)
 85    regrets = action_utils - util
 86    if is_player_1:
 87        info_set.regret_sum += pr_2 * pr_c * regrets
 88    else:
 89        info_set.regret_sum += pr_1 * pr_c * regrets
 90
 91    return util
 92
 93
 94def is_chance_node(history):
 95    """
 96    Determine if we are at a chance node based on tree history.
 97    """
 98    return history == ""
 99
100
101def chance_util(i_map):
102    expected_value = 0
103    n_possibilities = 6
104    for i in range(_N_CARDS):
105        for j in range(_N_CARDS):
106            if i != j:
107                expected_value += cfr(i_map, "rr", i, j,
108                                      1, 1, 1/n_possibilities)
109    return expected_value/n_possibilities
110
111
112def is_terminal(history):
113    """
114    Returns true if the history is a terminal history.
115    """
116    possibilities = {"rrcc": True, "rrcbc": True,
117                     "rrcbb": True, "rrbc": True, "rrbb": True}
118    return history in possibilities
119
120
121def terminal_util(history, card_1, card_2):
122    """
123    Returns the utility of a terminal history.
124    """
125    n = len(history)
126    card_player = card_1 if n % 2 == 0 else card_2
127    card_opponent = card_2 if n % 2 == 0 else card_1
128
129    if history == "rrcbc" or history == "rrbc":
130        # Last player folded. The current player wins.
131        return 1
132    elif history == "rrcc":
133        # Showdown with no bets
134        return 1 if card_player > card_opponent else -1
135
136    # Showdown with 1 bet
137    assert(history == "rrcbb" or history == "rrbb")
138    return 2 if card_player > card_opponent else -2
139
140
141def card_str(card):
142    if card == 0:
143        return "J"
144    elif card == 1:
145        return "Q"
146    return "K"
147
148
149def get_info_set(i_map, card, history):
150    """
151    Retrieve information set from dictionary.
152    """
153    key = card_str(card) + " " + history
154    info_set = None
155
156    if key not in i_map:
157        info_set = InformationSet(key)
158        i_map[key] = info_set
159        return info_set
160
161    return i_map[key]
162
163
164class InformationSet():
165    def __init__(self, key):
166        self.key = key
167        self.regret_sum = np.zeros(_N_ACTIONS)
168        self.strategy_sum = np.zeros(_N_ACTIONS)
169        self.strategy = np.repeat(1/_N_ACTIONS, _N_ACTIONS)
170        self.reach_pr = 0
171        self.reach_pr_sum = 0
172
173    def next_strategy(self):
174        self.strategy_sum += self.reach_pr * self.strategy
175        self.strategy = self.calc_strategy()
176        self.reach_pr_sum += self.reach_pr
177        self.reach_pr = 0
178
179    def calc_strategy(self):
180        """
181        Calculate current strategy from the sum of regret.
182        """
183        strategy = self.make_positive(self.regret_sum)
184        total = sum(strategy)
185        if total > 0:
186            strategy = strategy / total
187        else:
188            n = _N_ACTIONS
189            strategy = np.repeat(1/n, n)
190
191        return strategy
192
193    def get_average_strategy(self):
194        """
195        Calculate average strategy over all iterations. This is the
196        Nash equilibrium strategy.
197        """
198        strategy = self.strategy_sum / self.reach_pr_sum
199
200        # Purify to remove actions that are likely a mistake
201        strategy = np.where(strategy < 0.001, 0, strategy)
202
203        # Re-normalize
204        total = sum(strategy)
205        strategy /= total
206
207        return strategy
208
209    def make_positive(self, x):
210        return np.where(x > 0, x, 0)
211
212    def __str__(self):
213        strategies = ['{:03.2f}'.format(x)
214                      for x in self.get_average_strategy()]
215        return '{} {}'.format(self.key.ljust(6), strategies)
216
217
218def display_results(ev, i_map):
219    print('player 1 expected value: {}'.format(ev))
220    print('player 2 expected value: {}'.format(-1 * ev))
221
222    print()
223    print('player 1 strategies:')
224    sorted_items = sorted(i_map.items(), key=lambda x: x[0])
225    for _, v in filter(lambda x: len(x[0]) % 2 == 0, sorted_items):
226        print(v)
227    print()
228    print('player 2 strategies:')
229    for _, v in filter(lambda x: len(x[0]) % 2 == 1, sorted_items):
230        print(v)
231
232
233if __name__ == "__main__":
234    main()

Vanilla Counterfactual Regret Minimization for Engineers

Resources

Code

Other Articles

Bivi at 6000 Meters