31 — Disjoint write-sets parallelize freely

Concept node: see the DAG and glossary entry 31.
Two systems can run in parallel if and only if their write-sets do not overlap. That is the rule. It is small. It is what §25’s single-writer ownership buys you.
Concretely: in the simulator’s tick, motion writes pos_x, pos_y, energy; food_spawn writes food. Their write-sets are disjoint. They can run on two different processes with no coordination — no locks, no atomics, no message-passing. The data layout makes the parallelism free.
The same shape works at finer grain. The simulator’s three appliers (apply_eat, apply_reproduce, apply_starve) all read pending_event and write disjoint things — apply_eat writes food, to_remove; apply_reproduce writes to_insert; apply_starve writes to_remove. Two of the three append to the same buffer. To parallelise them, give each its own segment of to_remove (one per process), then merge at cleanup. The merge is np.concatenate — O(N) in the merged total, free relative to the work that produced it.
Not threading. Not asyncio.
This is the chapter where the GIL question finally lands. The Python reflex when a chapter says “parallel” is to reach for threading.Thread or asyncio. Both are wrong for CPU-bound parallel work in CPython.
threading does not give you parallel CPU. The Global Interpreter Lock serialises Python bytecode execution: one thread runs Python at a time, regardless of how many threads you started. Numpy bulk operations release the GIL during their C-level work, so a threading.Thread running arr.sum() can overlap with another thread doing the same — but only during the sum()’s C call, not during any Python around it. For workloads dominated by Python orchestration of numpy ops, threading delivers token speedup at best.
asyncio is a scheduler for I/O-bound work. CPU-bound systems give it nothing to overlap. The event loop adds dispatch overhead and removes nothing.
The disciplined alternative is multiprocessing plus shared_memory. __main__ allocates the world’s columns in a shared-memory region. Worker processes attach to that region, get a numpy view onto the same bytes, and write to their slice only. There is no copying across the process boundary; the bytes are shared. The GIL is no longer in the picture because each process has its own GIL, and each process is doing pure C-level numpy work on its own partition.
The shape (full version in code/measurement/parallel_motion.py):
# Worker globals — set once per worker by the Pool initializer.
_arr = None
_shm = None
def init_worker(shm_name: str) -> None:
global _arr, _shm
_shm = shared_memory.SharedMemory(name=shm_name)
_arr = np.ndarray(SHAPE, dtype=DTYPE, buffer=_shm.buf)
# _arr now views the same bytes as __main__'s array.
def worker(args: tuple[int, int]) -> None:
start, end = args
# Each worker writes only its slice; the writes go directly to
# the shared bytes via the numpy view — no copy.
_arr[0, start:end] += _arr[1, start:end] * DT
# In __main__:
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
arr = np.ndarray(SHAPE, dtype=DTYPE, buffer=shm.buf)
# ... fill arr with the world's data ...
boundaries = [(i * chunk, (i + 1) * chunk) for i in range(n_workers)]
with Pool(processes=n_workers, initializer=init_worker, initargs=(shm.name,)) as pool:
pool.map(worker, boundaries)
The shape: __main__ owns the memory; workers attach via init_worker and hold a numpy view onto the shared bytes; each worker writes only its slice; no shared writes, no locks, no message-passing.
What it costs and what it buys
From code/measurement/parallel_motion.py, two workloads applied 100 times to 10,000,000 float32 creatures on this machine (8 physical cores, 16 logical with SMT):
Workload A — memory-bound (pos += vel * dt): 12 bytes accessed per element, 2 arithmetic ops. Memory traffic dominates.
| workers | wall (s) | speedup |
|---|---|---|
| serial | 1.842 | 1.00 |
| 1 | 1.840 | 1.00 |
| 2 | 0.433 | 4.25 |
| 4 | 0.456 | 4.03 |
| 8 | 0.459 | 4.01 |
| 16 | 0.414 | 4.45 |
Workload B — compute-bound (out += sin(x)**2 + cos(x)**2): same byte accesses, much heavier per-element CPU work.
| workers | wall (s) | speedup |
|---|---|---|
| serial | 7.749 | 1.00 |
| 1 | 7.778 | 1.00 |
| 2 | 2.575 | 3.01 |
| 4 | 1.608 | 4.82 |
| 8 | 1.412 | 5.49 |
| 16 | 1.427 | 5.43 |
Three readings.
1 worker matches serial. The pool round-trip cost is amortised across the run because the rig dispatches once per measurement (each worker runs all 100 ticks on its partition before returning) — a per-tick dispatch would add IPC overhead on top, capping speedup further. See exercise 6.
Memory-bound caps at ~4×. This is the aggregate memory-bandwidth ceiling on this machine. The 76 MB working set spills L3; once two cores are reading and writing flat-out, the DRAM bus is busy. Adding a third or fourth physical core helps slightly (some bandwidth comes from the cores’ own L1/L2), but past that, more workers compete for the same bandwidth. The ceiling is set by the memory subsystem, not by core count. On a chip with more memory channels (server CPUs, modern desktops with quad-channel DDR5), the ceiling is higher; on a single-channel laptop or a Raspberry Pi, lower.
Compute-bound caps at ~5.5×, with the plateau between 8 and 16 workers. The plateau location matches the physical core count (8 here); the SMT-doubled logical count of 16 adds essentially nothing because both threads on the same core are now contending for the same arithmetic units. Compute-heavy work scales close to the physical core count; SMT helps work that has gaps the second thread can fill (mostly memory-stall waits), and pure compute has no gaps to fill.
The two ceilings are different shapes for different reasons. Measure your specific workload — neither is “wrong,” they are different bottlenecks.
Three things this rule does for you
No locks. A lock is a tax paid by every reader and writer of the locked thing. With single-writer ownership, locks are unnecessary; with disjoint write-sets across processes, they remain unnecessary at the parallel boundary. The simulator at this scale has zero Lock, zero RLock, zero Semaphore in its inner systems. The whole concurrency-primitive vocabulary you see in tutorials does not apply once the architecture is right.
Speedup is structural, not promised. N processes with disjoint work give close to N× speedup until the bottleneck shifts. Memory-bound work hits the bandwidth ceiling first; compute-bound work runs out of physical cores; per-tick dispatch hits IPC overhead. The ceilings are real and measurable; they are not reasons to avoid the architecture, only reasons to know which ceiling your workload hits.
Tools without ceremony. The Python ecosystem’s standard tools — multiprocessing.Pool, concurrent.futures.ProcessPoolExecutor, multiprocessing.shared_memory — are stdlib. No third-party crate, no external service, no orchestrator. The rig in parallel_motion.py is ~150 lines. Build it once for your simulator; reuse it everywhere.
The single-writer rule (§25) was the precondition. Disjoint write-sets is the rule applied across systems. Together, parallelism becomes a scheduling decision, not a design decision.
A calibration note
Python multiprocessing is non-trivial. The clean speedup table above hides real complexity: pickle overhead at process boundaries, fork-vs-spawn semantics that vary by platform, signal handling, queue contention, the difficulty of reasoning about a system across N process boundaries when something goes wrong. The chapter has not lied — the architecture does work and the speedup is real — but it has presented the architecture without the operational cost.
This chapter teaches the principles, not a production recipe. Single-writer ownership, disjoint write-sets, partition-don’t-lock, shared-memory-not-pickle: these are correct at every scale. Python multiprocessing is a fine implementation of those principles when your tick is comfortably above the IPC floor (≥ ~16 ms per tick, partitions of ≥ 100K elements). It stops being fine when every percent matters — a physics engine at 1 kHz, a real-time control loop, anything where the operational complexity above eats budget that a compiled language would not.
The escalation order is short: numpy → maturin → leave Python. Maturin (Rust + PyO3) gives you the same parallelism architecture without the Python orchestration tax — the inner loop, the dispatch, and the data are all in compiled Rust, exposed to Python through a thin binding. Past maturin, the answer is not to add another Python-side library; it is to leave Python entirely and write the application in Rust. The Rust standard library is enough for most parallel work; you do not need to reach for a parallel-iteration crate to do this well.
From-scratch-then-price-the-crate (§41, §42) applies here too: build it in Python first to feel the architecture; price what the next tier gives you when the budget binds. The book teaches the architecture; the language is a tooling decision.
Exercises
You will need a multi-core machine. Most desktops and laptops qualify.
- Run the rig.
uv run code/measurement/parallel_motion.py. Read your speedup column. Find the worker count where the curve flattens — that is your bandwidth ceiling. - Threading falls short. Rewrite
parallel_motionusingthreading.Threadinstead ofmultiprocessing.Pool. Keep the same partition pattern. Time it. The speedup is real but smaller (numpy releases the GIL during bulk ops, so threads can overlap during the*= dtstep, but not during anything else). Compare to the multiprocessing version. - A failing case. Try to run motion and an
apply_eatsystem in parallel where both writeenergy. Without single-writer discipline, two processes writing the same shared-memory region produce undefined behaviour. Construct the case; observe the corruption (it may be silent — that is the failure mode). - Per-process segments. Modify the rig so that, instead of motion, each worker runs
apply_starveand produces its ownto_removesegment as a separate shared-memory array. After all workers finish,np.concatenatethe segments in__main__. Verify the merged result equals a single-process run. - Find the bandwidth ceiling. Run the rig at N = 100,000 (fits L2), N = 1,000,000 (fits L3), N = 10,000,000 (spills to RAM), N = 100,000,000 (deeply RAM-resident). Plot the memory-bound speedup vs N. The bandwidth-ceiling worker count shifts with N — small N is bandwidth-rich (per-core caches), large N is bandwidth-limited.
- Per-tick dispatch costs IPC. Modify the rig so each worker runs one tick per
pool.mapcall instead of all 100 in one call. Re-run. The speedup curve will plateau lower (~3-4× for memory-bound, ~4-5× for compute-bound on this machine) because every tick now pays for one IPC round-trip. The lesson: batch when the access pattern allows. The cost is small per call, real in aggregate. - Find your physical core count.
lscpu | grep 'Core(s) per socket'(Linux). Compare toos.cpu_count(). The compute-bound ceiling lives near the physical count, not the logical count. - (stretch)
concurrent.futurescomparison. Rewrite the rig usingconcurrent.futures.ProcessPoolExecutor.map. Confirm equivalent performance. The two are largely interchangeable; pick the one whose API your team prefers. - (stretch) A pure-Python anti-comparison. Implement the same motion system as a per-creature Python loop (
for i in range(N): pos[i] += vel[i] * dt). Run it serially. Run it underthreading.Threadwith 8 threads. Run it undermultiprocessing.Poolwith 8 workers. Note: the threading version is no faster than serial (GIL), the multiprocessing version is faster but still slower than the bulk-numpy serial version, because the bulk numpy version was already faster than any pure-Python form. Multiprocessing scales work that is already fast; it does not rescue work that was wrong-shaped.
Reference notes in 31_disjoint_writes_parallelize_solutions.md.
What’s next
§32 — Partition, don’t lock takes the next step: when one system must write a single table from multiple processes, you split the table, not the access.