Blackjack and Dynamic Programming

Author

Bas Machielsen

Published

January 11, 2026

Introduction

I’ve always been fascinated by casinos. Understanding things like roulette and blackjack was a real motivation for me when I was studying things like probability theory. Even though I now think it’s almost trivial, it was eye-opening to me when I first understood the “magic” of roulette by realizing that the odds of winning essentially don’t weigh up against the multiplier that you get when you win. From that moment on, I understood what “the house always wins” meant in a very concrete way.

Needless to say, when I discovered that, for a while I thought that people playing roulette must be among the dumbest people ever. Only later did I realize that (i) it really wasn’t that impressive of me to figure out a relatively simple piece of math and (ii) people play stuff like that for other reasons than picking the optimal bet.1

Anyway, I think I’ve been to a casino like once or twice after understanding probability (I don’t think I’ve ever gone there before) but I don’t recall ever having played. I guess I just didn’t think it’d be nice to face an unfair (i.e. negative expected value) bet. One possible exception to that might be Blackjack, which, depending on your strategy, the payoffs and the “entry price” might actually be a fair bet or even have a positive value in expectation.

In this post, I want to to mathematically analyze Blackjack using Dynamic Programming (DP). I’ll model the game as a Markov Decision Process (MDP), assuming that card probabilities remain constant (i.e. sampling with replacement).2 I’ll also assume, to keep things simple, that the dealer follows a fixed strategy, described below.

1. Game Setup and Notation

Let the set of card values be \(C = \{2, 3, \dots, 9, 10, 11\}\), where 10 represents \(\{10, J, Q, K\}\) and 11 represents an Ace.

The probability of drawing card \(c\), denoted \(p(c)\), is: \[ p(c) = \begin{cases} 1/13 & \text{if } c \in \{2, \dots, 9, 11\} \\ 4/13 & \text{if } c = 10 \end{cases} \]

The State Space

A state \(S\) is defined by the tuple \((x, u, d)\), where:

  1. \(x\): The player’s current sum (\(4 \leq x \leq 21\)). (You get dealt two cards).
  2. \(u\): Dummy variable for a usable ace (1 if the player has an Ace counted as 11, 0 otherwise).
  3. \(d\): The dealer’s visible up-card (\(2 \leq d \leq 11\)).
Definitions: Soft and Hard Hands

A hand is Soft if it contains an Ace that is currently counted as 11 without the total exceeding 21.

A hand is Hard if either:

  1. It contains no Aces.

  2. It contains Aces, but they are all forced to count as 1 to prevent the total from exceeding 21.

2. The Dealer’s Terminal Distribution

I said before that the dealer plays a fixed strategy: “Hit until 17.” We calculate the dealer’s outcome independently before solving for the player’s optimal policy.

Let \(D(h, u)\) be the probability that the dealer ends up with a final total \(h\), given they currently hold a sum \(s\) and usable ace status \(u\). However, for the player, we only observe the up-card \(d\). We need to compute \(P(d_{final} = k \mid d)\), where \(k \in \{17, 18, 19, 20, 21, \text{Bust}\}\).

This is a sub-problem soluble via recursion or a system of linear equations. Let \(\phi(s, u)\) be the probability distribution vector of the dealer’s final result starting from state \((s, u)\) (i.e. sum, ace status).

The Dealer’s transition rules are as follows:

  • Stand Condition: If \(s \geq 17\) (assuming Dealer stands on Soft 17), the dealer stops. \[\phi(s, u) = \mathbf{e}_s \quad (\text{vector with 1 at index } s)\]

  • Hit Condition: If \(s < 17\), the dealer draws a card \(c\). \[\phi(s, u) = \sum_{c \in C} p(c) \cdot \phi(s', u')\]

Where the transition logic \((s, u) \xrightarrow{c} (s', u')\) is:

  • \(s_{temp} = s + c\)
  • If \(u=1\) and \(s_{temp} > 21\): \(s' = s_{temp} - 10\), \(u' = 0\) (Soft ace becomes hard).
  • Else: \(s' = s_{temp}\).
  • \(u'\) becomes 1 if \(u=0\) and \(c=11\) (drawing an Ace), otherwise remains \(u\).

We calculate \(\phi(d, \text{has\_ace}(d))\) for all up-cards \(d\) to get the probability the dealer ends with total \(k\), denoted as \(\mathbb{P}(D_k | d)\).

Implementation in Python

Show the code.
import numpy as np
from functools import lru_cache

# Outcome Indices: [17, 18, 19, 20, 21, Bust]
# 0 -> 17
# 1 -> 18
# ...
# 4 -> 21
# 5 -> Bust

# Card probabilities (Infinite Deck)
# 2-9: 1/13, 10: 4/13, Ace: 1/13
CARD_PROBS = {c: 1/13 for c in range(2, 12)} # 2 to 11
CARD_PROBS[10] = 4/13

@lru_cache(maxsize=None)
def get_dealer_dist(s, u):
    """
    Returns probability vector [P(17), ..., P(21), P(Bust)]
    s: current sum
    u: boolean (1 if usable ace, 0 otherwise)
    """
    
    # 1. Base Case: Bust
    if s > 21:
        # Vector: [0, 0, 0, 0, 0, 1]
        res = np.zeros(6)
        res[5] = 1.0
        return res

    # 2. Base Case: Stand (Dealer stands on >= 17)
    if s >= 17:
        # Map 17->0, 18->1, ... 21->4
        res = np.zeros(6)
        res[s - 17] = 1.0
        return res

    # 3. Recursive Step (Hit)
    # The dealer must hit. We sum the probabilities of the next states.
    final_dist = np.zeros(6)
    
    for card, prob in CARD_PROBS.items():
        s_next = s + card
        u_next = u
        
        # Logic: Transition Rules
        
        # If we drew an Ace, we now have a usable ace (if we didn't before)
        if card == 11 and u == 0:
            u_next = 1
            
        # Logic: Soft Ace becomes Hard
        # If sum > 21 and we have a usable ace, count ace as 1 (subtract 10)
        if s_next > 21 and u_next == 1:
            s_next -= 10
            u_next = 0
            
        final_dist += prob * get_dealer_dist(s_next, u_next)
        
    return final_dist

# Execution and Formatting
# 1. Save object for later use

# Possible Dealer Up-Cards: 2 through 11 (Ace)
cards = list(range(2, 12))
prob_vectors = []   

for d in cards:
    # Initial state: 
    # Sum is the card value. 
    # Usable ace is True only if the up-card is 11 (Ace).
    has_ace = 1 if d == 11 else 0
    probs = get_dealer_dist(d, has_ace)
    prob_vectors.append(probs)

probabilities = np.array(prob_vectors)

# 2. Print table for display
def print_dealer_table():
    print(f"{'Up-Card':<8} | {'17':<7} {'18':<7} {'19':<7} {'20':<7} {'21':<7} {'Bust':<7}")
    print("-" * 60)
    
    # Possible Dealer Up-Cards: 2 through 11 (Ace)
    cards = list(range(2, 12))
    
    for d in cards:
        # Initial state: 
        # Sum is the card value. 
        # Usable ace is True only if the up-card is 11 (Ace).
        has_ace = 1 if d == 11 else 0
        
        probs = get_dealer_dist(d, has_ace)
        
        # Formatting for display
        card_name = "Ace" if d == 11 else str(d)
        row_str = f"{card_name:<8} | " + " ".join([f"{p:.4f} " for p in probs])
        print(row_str)

print_dealer_table()
Up-Card  | 17      18      19      20      21      Bust   
------------------------------------------------------------
2        | 0.1395  0.1346  0.1293  0.1237  0.1176  0.3553 
3        | 0.1347  0.1302  0.1253  0.1200  0.1144  0.3754 
4        | 0.1302  0.1256  0.1211  0.1162  0.1109  0.3960 
5        | 0.1223  0.1223  0.1177  0.1131  0.1082  0.4164 
6        | 0.1654  0.1063  0.1063  0.1017  0.0972  0.4232 
7        | 0.3686  0.1378  0.0786  0.0786  0.0741  0.2623 
8        | 0.1286  0.3593  0.1286  0.0694  0.0694  0.2447 
9        | 0.1200  0.1200  0.3508  0.1200  0.0608  0.2284 
10       | 0.1114  0.1114  0.1114  0.3422  0.1114  0.2121 
Ace      | 0.1262  0.1262  0.1262  0.1262  0.3570  0.1383 

3. The Player’s Bellman Equation

The objective is to maximize the expected reward \(V(x, u, d)\). The rewards are: Win (+1), Loss (-1), or Push (0).

Action Space

The player chooses \(a \in \{\text{Hit}, \text{Stand}\}\). \[ V(x, u, d) = \max \left( V_{\text{stand}}(x, d), \ V_{\text{hit}}(x, u, d) \right) \]

1. Value of Standing (\(V_{\text{stand}}\))

If the player stands on total \(x\), the value is determined by comparing \(x\) against the dealer’s final distribution calculated in Section 2.

\[ V_{\text{stand}}(x, d) = \sum_{k \in \{17, \dots, 21, \text{Bust}\}} \mathbb{P}(D_k | d) \cdot R(x, k) \]

Where the reward function \(R(x, k)\) is: \[ R(x, k) = \begin{cases} 1 & \text{if } k = \text{Bust} \\ 1 & \text{if } x > k \\ -1 & \text{if } x < k \\ 0 & \text{if } x = k \end{cases} \]

2. Value of Hitting (\(V_{\text{hit}}\))

If the player hits, they draw a card \(c\). We average the value of the next state over all possible cards.

\[ V_{\text{hit}}(x, u, d) = \sum_{c \in C} p(c) \cdot V_{\text{next}}(x', u', d) \]

Where the next state logic handles busting: \[ V_{\text{next}}(x', u', d) = \begin{cases} -1 & \text{if } x' > 21 \text{ (Bust)} \\ V(x', u', d) & \text{otherwise} \end{cases} \]

The update logic for \(x'\) and \(u'\) is the same as the dealer’s logic defined in Section 2.

4. State-Transition Matrix formulation

Since the game is acyclic (score always increases unless a soft hand becomes hard, in which case the “hard” state is closer to busting than the “soft” state), we can represent the hitting phase as a linear system or solve via backward induction.

Let \(\mathbf{V}\) be a vector representing the value of every possible player state \((x, u)\) for a fixed dealer card \(d\), and order the states such that higher totals appear later.

\[ \mathbf{V} = [V(4,0), V(5,0), \dots, V(20,0), V(12,1), \dots, V(20,1)]^T \]

The Bellman equation in vector form is: \[ \mathbf{V} = \max \left( \mathbf{R}_{\text{stand}}, \mathbf{T} \mathbf{V} + \mathbf{B} \right) \]

Where:

  • \(\mathbf{R}_{\text{stand}}\) is the vector of expected returns if we stand immediately in each state.
  • \(\mathbf{T}\) is the State-Transition Matrix for the “Hit” action. \(T_{ij}\) is the probability of moving from state \(i\) to state \(j\) upon drawing a card.
  • \(\mathbf{B}\) is the “Bust” vector, containing \(-1 \times (\text{Probability of busting from state } i)\).

Deriving Matrix \(\mathbf{T}\) is done as follows.

For a state \(i = (x, u)\) and target state \(j = (y, v)\):

\[ T_{ij} = \sum_{c \in C} p(c) \cdot \mathbb{I}(\text{next}(x,u, c) = (y,v)) \]

Because the score generally increases, \(\mathbf{T}\) is strictly upper triangular (assuming we order states by ‘hardness’ then by sum).

For example, from Hard 12 (\(x=12, u=0\)):

  • Draw 2 \(\to\) Hard 14 (\(p=1/13\))
  • Draw 9 \(\to\) Hard 21 (\(p=1/13\))
  • Draw 10 \(\to\) Bust (\(p=4/13\)) \(\to\) contributes to vector \(\mathbf{B}\), not \(\mathbf{T}\).
  • Draw A \(\to\) Hard 13 (\(p=1/13\)) 3

5. Summary of the Algorithm

To solve the game (i.e., find Basic Strategy):

  1. Compute Dealer Distribution: Calculate \(\mathbb{P}(D_k | d)\) for all up-cards \(d\) using the recursive relations for the dealer’s fixed rules.
  2. Compute Stand Values: For every player state \((x, u)\) and dealer card \(d\), compute \(V_{\text{stand}}(x, d)\).
  3. Backward Induction: Iterate backwards from \(x=21\) down to 4.
    • For \(x=21\), \(V(21, \dots) = V_{\text{stand}}(21, \dots)\) (Hitting is guaranteed bust or suboptimal).
    • For \(x < 21\): \[ V(x, u, d) = \max \left( V_{\text{stand}}, \sum p(c)V(\text{next state}) \right) \]
  4. Policy Extraction: The optimal policy \(\pi^*(x, u, d)\) is simply the argument of the maximum operator (Hit or Stand).

This DP formulation provides the exact expected value of every hand.

Python Implementation

Show the code.
import numpy as np
import pandas as pd

# 1. Constants and Dealer Distribution
# Rows: Dealer Up-card 2-9, 10, A. Cols: 17, 18, 19, 20, 21, Bust
P_DEALER = probabilities

# Card probabilities: 2-9=1/13, 10=4/13, A=1/13
# We treat index 0 as card value 2, index 9 as card value 11 (Ace)
CARD_PROBS = np.array([1/13]*8 + [4/13] + [1/13])
CARD_VALS  = np.arange(2, 12) 

# State Space Config
# Sums 0 to 31 (to handle bust calculation easily).
# 0: Hard, 1: Soft (Usable Ace)
# Shape: (Sum, Soft/Hard, DealerCard)
V = np.zeros((32, 2, 10)) 

# 2. Compute R_Stand (Vector with V_Stands)
# R_stand[player_sum, dealer_card_idx]
# We calculate this for all player sums 4-21.
# Logic: +1 if P_sum > D_sum, -1 if P_sum < D_sum, 0 if Tie, +1 if D busts.

R_stand = np.zeros((32, 10)) # Independent of Soft/Hard, only Sum matters

outcomes = np.array([17, 18, 19, 20, 21]) # Dealer non-bust totals

for s in range(4, 22):
    # Vectorized calculation against all dealer cards at once
    # 1. Win by Dealer Bust
    ev = 1.0 * P_DEALER[:, 5] 
    
    # 2. Compare sums against dealer non-bust outcomes
    # Broadcasting: (1, 5) compared to scalar s
    win  = (s > outcomes).astype(float)
    lose = -1.0 * (s < outcomes).astype(float)
    # tie is 0, implicit
    
    # Dot product of outcomes prob and win/loss vectors
    # P_DEALER[:, :5] is (10, 5). (win+lose) is (5,)
    ev += P_DEALER[:, :5] @ (win + lose)
    
    R_stand[s] = ev

# Initialize Value function with Stand values (and -1 for bust states > 21)
V[:22, :, :] = R_stand[:22, np.newaxis, :] # Broadcast to Hard/Soft
V[22:, :, :] = -1.0 # BUST states always return -1

# 3. Value Iteration (Solving the Bellman Equation)
# Iterate until the Value matrix stops changing. 
# Since graph is acyclic-ish, this converges instantly (2-3 passes).

for _ in range(10):
    V_prev = V.copy()
    
    # Calculate Expected Value of Hitting for all states 4..21
    # We do this for Hard (u=0) and Soft (u=1)
    
    for s in range(4, 22):
        for u in [0, 1]: # 0=Hard, 1=Soft
            
            # Expected value of next state (averaged over cards)
            ev_hit_vector = np.zeros(10) # For all dealer cards
            
            for c_idx, card in enumerate(CARD_VALS):
                prob = CARD_PROBS[c_idx]
                
                # Next State Logic
                s_next = s + card
                u_next = u
                
                # Transition: Draw Ace (if hard) -> becomes Soft
                if u == 0 and card == 11:
                    u_next = 1
                
                # Transition: Soft Busts -> becomes Hard (minus 10)
                if s_next > 21 and u_next == 1:
                    s_next -= 10
                    u_next = 0
                
                # Accumulate value (if s_next > 21 and u=0, V is -1)
                # s_next is capped at 31 to prevent index error, though >21 is -1
                s_idx = min(s_next, 31) 
                ev_hit_vector += prob * V_prev[s_idx, u_next, :]
            
            # Bellman Equation: V = max(R_stand, E[V_next])
            V[s, u, :] = np.maximum(R_stand[s], ev_hit_vector)

# 4. Extract Optimal Policy
# Compare V_stand vs V_hit one last time to define action
# Action 1 = Hit, 0 = Stand

policy_hard = np.zeros((18, 10), dtype=int) # 4 to 21
policy_soft = np.zeros((10, 10), dtype=int) # 12 to 21

# Hard Hands (4 to 21)
for s in range(4, 22):
    # Recalculate hit value (cleanest way to compare)
    ev_hit = np.zeros(10)
    for c_idx, card in enumerate(CARD_VALS):
        # ... (same transition logic as above) ...
        s_n, u_n = s + card, 0
        if card == 11: u_n = 1
        if s_n > 21 and u_n == 1: s_n -= 10; u_n = 0
        ev_hit += CARD_PROBS[c_idx] * V[min(s_n, 31), u_n, :]
    
    # If Hit > Stand, Policy = 1 (Hit)
    policy_hard[s-4] = (ev_hit > R_stand[s] + 1e-9).astype(int)

# Soft Hands (12 to 21)
for s in range(12, 22):
    ev_hit = np.zeros(10)
    for c_idx, card in enumerate(CARD_VALS):
        # ... (transition logic from soft) ...
        s_n, u_n = s + card, 1
        if s_n > 21: s_n -= 10; u_n = 0
        ev_hit += CARD_PROBS[c_idx] * V[min(s_n, 31), u_n, :]
        
    policy_soft[s-12] = (ev_hit > R_stand[s] + 1e-9).astype(int)

# 5. Display results
def print_policy(name, data, start_idx):
    print(f"\n--- {name} Strategy (H=Hit, .=Stand) ---")
    print("      | D: 2 3 4 5 6 7 8 9 T A")
    print("------+-----------------------")
    for i, row in enumerate(data):
        player_sum = start_idx + i
        row_str = " ".join(["H" if x else "." for x in row])
        print(f"{player_sum:>5} |    {row_str}")

print_policy("Hard Hand", policy_hard, 4)
print_policy("Soft Hand", policy_soft, 12)

--- Hard Hand Strategy (H=Hit, .=Stand) ---
      | D: 2 3 4 5 6 7 8 9 T A
------+-----------------------
    4 |    H H H H H H H H H H
    5 |    H H H H H H H H H H
    6 |    H H H H H H H H H H
    7 |    H H H H H H H H H H
    8 |    H H H H H H H H H H
    9 |    H H H H H H H H H H
   10 |    H H H H H H H H H H
   11 |    H H H H H H H H H H
   12 |    H H . . . H H H H H
   13 |    . . . . . H H H H H
   14 |    . . . . . H H H H H
   15 |    . . . . . H H H H H
   16 |    . . . . . H H H H H
   17 |    . . . . . . . . . .
   18 |    . . . . . . . . . .
   19 |    . . . . . . . . . .
   20 |    . . . . . . . . . .
   21 |    . . . . . . . . . .

--- Soft Hand Strategy (H=Hit, .=Stand) ---
      | D: 2 3 4 5 6 7 8 9 T A
------+-----------------------
   12 |    H H H H H H H H H H
   13 |    H H H H H H H H H H
   14 |    H H H H H H H H H H
   15 |    H H H H H H H H H H
   16 |    H H H H H H H H H H
   17 |    H H H H H H H H H H
   18 |    . . . . . . . H H H
   19 |    . . . . . . . . . .
   20 |    . . . . . . . . . .
   21 |    . . . . . . . . . .

6. Rule of Thumb

The rule of thumb that you should use when playing Blackjack in practice is this:

  • If you don’t have an ace:
    • And the dealer’s card is < 7:
      • You should hit up until 11, and stand at 12.
    • And the dealer’s card is \(\geq\) 7:
      • You should hit up until 16, and stand from 17.
  • If you have an ace:
    • Hit up until 17.

Footnotes

  1. Even though to be fair, I still think a lot of people honestly think that they’re facing a fair bet or something of the like.↩︎

  2. This post is based somewhat on this paper, this lecture and this blog post.↩︎

  3. Hard 12 + Ace = 13. It cannot become Soft 13 because 12+11 > 21.↩︎