29 — The wall at 10K → 1M
Concept node: see the DAG and glossary entry 29.

A simulator that runs cleanly at 10,000 creatures often grinds to a halt at 1,000,000. Not because the algorithm changed — because constant factors that were invisible at the smaller scale now bind.
This chapter is about finding the wall. The fixes are techniques you already have: hot/cold splits (§26), working-set discipline (§27), sort for locality (§28), pre-sized buffers, batched cleanup. The chapter’s job is to teach the reader to measure — to find which constant factors blew up.
Walls Python hits, named
- Pre-allocation skipped. A
to_insert: list[CreatureRow]that grew lazily was fine at 100 appends per tick (10K creatures × 1% reproduction). At 10K appends per tick (1M × 1%), Python listappendis amortised O(1) but each capacity doubling is an N-byte copy; at this scale the doublings dominate. Fix: pre-size with[None] * estimated_maxplus ann_insertscounter, the same pattern §22 already uses. - Linear scans in pure Python. A list comprehension
[c for c in creatures if c.id == target_id]was 0.1 ms at 10K, but tens of milliseconds at 1M. Fix: theid_to_slotmap (§23) plus parallel presence flags. In Python the linear-scan cost is sharper than in Rust — you pay interpreter dispatch on every iteration, ~5 ns per step from §1. - Cache spillover. A
creatureworking set at 10K is 200 KB (L2-resident). At 1M it is 20 MB (L3-resident). Per-element time triples. Fix: hot/cold splits + narrower numpy dtypes. - The pandas wall. A
pandas.DataFrameof 10M rows × 20 columns at default dtypes occupies 1.6 GB+ before any operation. ADataFrame.mergeallocates intermediate copies; agroupby.applymaterialises Python objects per row; both can OOM long before the data itself would. Fix: drop pandas. Either move to numpy SoA (when the working set still fits in RAM with explicit columns) or to sqlite (when it doesn’t, or won’t long-term).code/measurement/sqlite_performance_test.pyshows sqlite delivers ~830K-900K random lookups per second on disk — fast enough to be the production answer for many workloads that pandas was struggling with. The migration is usually a one-day project that gives back days of OOM debugging per quarter. - Per-tick allocation. A system that calls
np.zeros(N)per tick was fine when N was 10,000 (40 KB). At N = 1,000,000 it is 4 MB allocated and zero-filled every tick — the malloc cost alone is significant. Fix: allocate the buffer once at startup, fill or reuse in place. - Logging. A
print(f"creature {i} ate")per event was tolerable at 10K. At 1M events it is the simulator’s bottleneck —printflushes, formats, dispatches the GIL. Fix: write to a numpy event log per §37, flush in bulk; or simply turn it off.
The pattern: any cost that was O(1) per creature, multiplied by 1M, is no longer free. Anything that was O(N) per tick at 10K is now O(N²)-equivalent in wall time. The fixes are local — each cost is a single-line change — but finding them requires measurement.
Measurement tools
The right tool is a profiler. In Python, three good options:
cProfile(stdlib).python -m cProfile -o profile.out my_sim.pyrecords every Python-level function call. Read withpython -m pstats profile.outorsnakeviz. Fine for finding hot Python functions; opaque to numpy internals (numpy ops show up as one C call).py-spy(third-party).py-spy record -o flame.svg -- python my_sim.pyproduces a flame graph similar toperf. Sees the C stack inside numpy ops, whichcProfiledoes not. The right tool when the bottleneck is inside numpy.perf(Linux). The same tool the Rust edition uses.perf record -- python my_sim.py; perf reportreads at the OS level; sees everything but interprets nothing — you read raw symbols.
The same simulator at 10K and 1M produces different flame graphs; the wall is the difference.
Calibration
A useful exercise: run your simulator at 10K for 1,000 ticks; time it. Run at 1M for 100 ticks (same total entity-ticks); time it. The 1M version should take roughly 10× longer, not 100×. If it takes 100×, something has crossed a constant-factor wall and the profiler will show you what.
The fix is structural. Apply the techniques: hot/cold, working set, sort for locality, pre-sized buffers, batched cleanup, deterministic structures. Each is a chapter you have already read. The wall is the moment they all become non-optional.
Exercises
- Calibration. Run your simulator at N = 10,000 for 1,000 ticks. Time it. Note the wall-clock total.
- Scale up. Run at N = 1,000,000 for 100 ticks (same total entity-ticks). Time it. Compute the ratio.
- Profile with cProfile.
python -m cProfile -s cumulative my_sim.py | head -30. Identify the top three hottest functions. - Profile with py-spy.
py-spy record -o flame.svg -- python my_sim.py. Open the flame graph in a browser. Identify hot regions inside numpy thatcProfiledid not surface. - Pre-size cleanup buffers. Replace
to_insert = []plusto_insert.append(...)with a pre-sized array plus ann_insertscounter (the §22 pattern). Re-run; re-profile. The list-resize calls should disappear from the hot list. - Hot/cold split. Apply the §26 split organisationally. Re-run; re-profile. In numpy SoA you may see no change in the profile (per §26’s framing); in numpy structured-array form you should see a visible improvement.
- Use index maps. Replace any linear
np.where(arr == target)[0]lookup with the §23id_to_slotform. Re-run; re-profile. - The pandas wall, hands-on. Build a pandas DataFrame of 5M rows × 10 float64 columns. Note its memory (
df.memory_usage(deep=True).sum() / 1e6MB). Now move the same data into 10 numpyfloat32columns; note the memory ratio. Now move it into a sqlite table; note the disk size and a sample lookup time usingsqlite_performance_test.pyas a template. Decide consciously which form fits your workload. - (stretch) Find one new wall. Pick any system in your simulator and find one constant factor that scales worse than expected. The fix is usually one of the techniques above; identifying which one is the lesson.
Reference notes in 29_wall_10k_to_1m_solutions.md.
What’s next
§30 — Moving beyond the wall takes the next step: when even your fastest, tightest, hot/cold-split, sorted-for-locality simulator no longer fits in RAM, the architecture itself shifts.