Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Solutions: 27 — Working set vs cache

Exercise 1 — Compute your working sets

For the motion system (pos_x, pos_y, vel_x, vel_y, energy at float32):

Nbytescache regime (typical 2026 desktop)
1,00020 KBfits L1 (32-48 KB)
10,000200 KBspills to L2 (1-2 MB)
100,0002 MBborderline L2/L3
1,000,00020 MBfits L3 (16-32 MB)
10,000,000200 MBspills L3 to RAM

The cliff is at the L3 → RAM transition. The exact size depends on your CPU’s L3 (run lscpu from §1 exercise 1 to confirm).

For the starvation system (reads energy only — 4 bytes per creature):

Nbytesregime
100,000400 KBL1 cap on this CPU
1,000,0004 MBL2/L3 boundary
10,000,00040 MBspills L3

The starvation system fits more creatures per cache level than motion, because it touches fewer bytes per row. Smaller working set, larger headroom.

Exercise 2 — Find your cliff

uv run code/measurement/cache_cliffs.py

From §1 — gather column (random access):

N           gather (ns/elem)
10,000          1.62
100,000         2.24
1,000,000       3.69
10,000,000      7.60
100,000,000     7.78

Transitions visible: 10K → 100K (L1 → L2, ~1.4×), 100K → 1M (L2 → L3, ~1.6×), 1M → 10M (L3 → RAM, ~2.1×). The cliff is shallowest at the L1/L2 boundary and steepest at L3/RAM on this machine.

Exercise 3 — Reduce the working set

Splitting the motion’s row from (pos_x, pos_y, vel_x, vel_y, energy, birth_t, id, gen) = 36 bytes to (pos_x, pos_y, vel_x, vel_y, energy) = 20 bytes:

  • Motion’s working set at 1M creatures: 36 MB → 20 MB. Still fits L3.
  • At 2M: 72 MB → 40 MB. The 72-MB version spilled to RAM; the 40-MB version fits L3.

So the cliff moves outward by ~1.8× — exactly the bytes-ratio. But in pure SoA-in-numpy, motion already reads only the hot columns because each column is its own buffer; reading pos_x does not touch birth_t’s memory. The split is organisational, not a working-set reduction. The chapter’s framing applies: timing does not change.

The split does reduce working set in structured array layout, where reading arr['pos_x'] strides past birth_t’s bytes. Confirmed by exercise 4 of §26: structured array is 8× slower than SoA columns at the same N.

Exercise 4 — A wider dtype

energy = np.zeros(n, dtype=np.float64)         # was float32 — doubles the bytes

Working set per creature: 20 → 24 bytes (one column doubled). Cliff moves inward by ~20%. At N=1M, working set 24 MB → still fits typical L3. At N=1.5M, 36 MB → starts to spill. The motion timing rises proportionally to the bytes read (sequential access is bandwidth-bound; bytes moved is the cost).

This is §2’s narrowest-dtype discipline re-applied at scale. Choosing float32 over float64 doubles your population headroom in cache. The choice is not aesthetic — it is “how many creatures can my simulator host at L3-resident speed?”

Exercise 5 — Random vs sequential, your machine

From your cache_cliffs.py output:

sizegather/seq
10K2-4×
100K~10×
1M~20×
10M~40-50×
100M~50-80×

The 100M figure is your machine’s L1-to-RAM cost gap on this run. On modern desktops 50-80×; on Pi 4 / 2012 Intel, closer to 30-40×; on Apple Silicon, somewhere in between.

Memorise the number. When a colleague says “the data structure I wrote does random lookups; I think it’s fast,” ask them for N. If N puts the working set past L3, multiply their best-case estimate by your machine’s gather/seq ratio. That’s the real cost.

Exercise 6 — The L1 sweet spot (stretch)

L1 is ~48 KB on this CPU; 75% = 36 KB. At 20 bytes per row, that’s ~1,800 creatures. Closest power-of-10-ish: 1,500-2,000.

import time, numpy as np

for n in (1_500, 1_800, 2_000, 10_000):
    pos_x  = np.zeros(n, dtype=np.float32)
    pos_y  = np.zeros(n, dtype=np.float32)
    vel_x  = np.ones(n,  dtype=np.float32)
    vel_y  = np.ones(n,  dtype=np.float32)
    energy = np.zeros(n, dtype=np.float32)
    dt = 1/30.0
    # warm up
    for _ in range(50):
        pos_x += vel_x * dt; pos_y += vel_y * dt
    t = time.perf_counter()
    for _ in range(1_000):
        pos_x += vel_x * dt; pos_y += vel_y * dt
    elapsed = (time.perf_counter() - t) / 1_000
    print(f"N={n:>5}: motion {elapsed*1e6:.2f} µs ({elapsed*1e9/n:.2f} ns/elem)")

Expected pattern: N=1500 and N=1800 stay around 0.2 ns/elem (L1-resident). N=10,000 jumps to 0.5-0.8 ns/elem (L2-resident — 3-5× slower).

The L1-resident regime is where you want hot inner loops to live. Any code path that runs every tick over a small data set should be sized so the data fits L1 — that’s the difference between “fast” and “very fast.” For the simulator, this matters most for per-creature derived columns (an urgency_score of length N_hot) that are computed and consumed within a single system.