Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

30 — Moving beyond the wall

Concept node: see the DAG and glossary entry 30.

Read / process / update under tight resource budget

At 100 million creatures with 24 bytes of hot data each, the working set is 2.4 GB. At a billion, 24 GB. Most desktops have 16-64 GB of RAM. The simulator can no longer hold its world and its history and the OS and whatever else and operate at speed.

The fix is streaming: only the relevant slice of the world is in memory at any one time; the rest lives on disk and is read on demand.

The shape:

@dataclass
class StreamingWorld:
    in_memory: Window     # a small contiguous range of recent state
    archive: Archive      # the rest, append-only on disk

A window of recent state lives in memory, indexed for cheap query. Older state lives on disk in append-only chunks; it is read into the window when a query needs it.

This pattern shows up wherever this scale matters:

  • Time-series databases (Prometheus, InfluxDB): recent metrics in RAM; older series compressed and disk-resident.
  • Game replay systems: the last 30 seconds replayable from a memory ring; the full match streamed from a server.
  • Event-sourced systems: recent state cached; the full event log on disk; replay reconstructs.
  • Database write-ahead logs: append to log; flush to data files; the data files become disk-resident; recent log + memory hold the active set.

The Python toolkit for streaming

Python gives you a small set of well-suited tools for this regime. Naming the right ones (and the wrong ones) is the chapter’s Python-edition contribution.

np.savez and np.savez_compressed. Save a dict of named numpy columns to one .npz file. The format is uncompressed (or zip-compressed) typed bytes — the same bytes already in memory. Load via np.load(path)["column_name"]. This is the canonical Python answer for “snapshot the world” and “load a chunk.” It is fast, schema-visible, and language-portable.

sqlite. When the data is queried by id, range, or join — the access patterns relational databases were built for — sqlite is the right backend. From §29 and code/measurement/sqlite_performance_test.py: ~830K-900K random lookups per second on disk, indistinguishable from memory at the level of a tick budget. The simulator’s archive can be a sqlite database with one table per column-family; queries are SELECT * FROM events WHERE tick BETWEEN ? AND ?.

The simlog as reference implementation. The logger at .archive/simlog/logger.py is exactly this architecture: pre-allocated numpy Containers as the in-memory window, double-buffered, with a background thread that dumps full containers to disk while the simulation continues writing into the swapped-in container. 700 lines, fully tested, exists as a vendored reference. When this chapter clicks, read it; it is the production version of the streaming pattern.

Chunked operations on disk-resident data. Some numpy primitives accept arbitrarily-large input via chunked iteration. .archive/numpy_unique_args_permutations.py explored np.unique’s parameters; the same shape extends to np.histogram, np.argsort (when paired with np.lexsort and stable merging across chunks), and any reduce-style operation — read N rows at a time, update accumulators, drop the chunk before reading the next.

One Python option deliberately not recommended. np.memmap lets numpy treat a disk file as if it were RAM, with the OS paging in only the pages that get accessed. It looks like a free win — and in practice the throughput rarely beats explicit np.fromfile of the chunk you actually want, because the OS’s prefetch heuristics don’t match the simulator’s access patterns. If you have it working today and the numbers look right, fine; the book does not recommend reaching for it as the default move.

The architectural shifts streaming entails

The log is the canonical state. The world’s tables are derivable from the log. If the log is complete and durable, every other in-memory representation is reconstructible. This is the structural framing of §37 — The log is the world: the log is not a record of state, it is the state.

Persistence is serialisation of tables. A snapshot is the world’s current SoA, written as the bytes those columns already hold — np.savez(path, pos_x=pos_x, pos_y=pos_y, ...). Recovery is np.load(path). There is no separate domain model; serialisation is transposition, not translation. This is §36.

Storage is a cost like any other. Reading from disk costs bandwidth and IOPS, just as reading from RAM costs cache-line loads. Storage systems with bandwidth (bytes per second) and IOPS (operations per second) limits must be counted against the tick budget. SQLite, network sockets, distributed file systems — all are storage systems with their own cost profiles. This is §38.

Cleanup amortises the write cost. The cleanup system from §22 already batches in-memory mutations to avoid mid-tick races. At streaming scale the same pattern earns its keep again, for a second reason: it batches disk writes. Without batching, 10,000 individual mutations per tick would mean 10,000 disk writes — at 100 µs per write, a full second of I/O per tick, far over budget. With cleanup, those 10,000 mutations become one durable batch per tick: a handful of disk pages flushed sequentially to the log. One syscall, one trip through the block layer, one DMA transfer — versus 10,000 of each. The cost is amortised across the batch, not paid per row. The architecture you assembled in §22 was already the streaming architecture in miniature; this section just lets you spell it out at scale.

The simulator at streaming scale is no longer a process running in memory; it is a pipeline between a memory window and a durable log, with the systems running on whatever slice of the world is currently mounted. Every read might fault to disk; every write is buffered into the next cleanup’s batch.

The transition from in-memory to streaming is the largest architectural shift in the book. Below this wall, the simulator is a single-process program with its working state in RAM. Above it, the simulator is closer to a database with its working state on disk and a small in-memory hot path. The techniques are different; the discipline is the same — layout, working set, ownership, determinism — applied at a different scale.

This wall is where most projects either re-architect or quietly accept slower-than-target performance. The book points at the wall and names the techniques; it does not pretend the techniques are free.

Exercises

  1. Compute your streaming threshold. Estimate your simulator’s per-creature footprint at full SoA. Divide your machine’s RAM (the half you can spare for the simulator) by that footprint. The result is roughly the N at which the simulator hits the streaming wall.
  2. Predict the cost. A disk read is ~100 µs (NVMe SSD), ~200-500 µs (SATA SSD), or ~10 ms (spinning disk). At a 33 ms tick budget, how many disk reads can a tick afford? How many might a system want to make?
  3. Snapshot a small world. Write a function snapshot(world, path) that calls np.savez_compressed(path, pos_x=world.pos_x, pos_y=world.pos_y, ...). Read it back with np.load. Confirm the simulator continues running indistinguishably.
  4. A windowed log. Implement an append-only log where recent entries live in a numpy ring buffer of fixed size, and overflow gets dumped to a sqlite table or .npz file. Verify queries inside the window are fast; queries outside the window pay the disk cost.
  5. Log-as-world. With the windowed log from exercise 4, reconstruct creature state at an earlier tick by replaying the log over the most recent snapshot whose tick is ≤ the requested one. Compare query speed to the in-memory case.
  6. Read the simlog seriously. .archive/simlog/logger.py is the windowed-log architecture, end to end. Trace the path of one log(time, value, ...) call: which container does it land in, when does the swap happen, when does the disk write occur. The 700 lines you read are 700 lines you do not have to write.
  7. Chunked numpy. Build a 2 GB numpy array on disk via np.save. Compute its mean by reading 100 MB chunks in sequence; compare wall time to loading the whole thing first. Note: at the I/O-bound limit, the chunked version pays slightly more in syscall overhead but caps memory.
  8. (stretch) Document your bound. Write down, for your simulator, the largest N you can run while staying inside a 33 ms tick budget. Include footprint, cache regime, and any disk-bound cost. Above this N, the simulator needs the streaming architecture.

Reference notes in 30_streaming_wall_solutions.md.

What’s next

You have closed Scale. The next phase is Concurrency, starting with §31 — Disjoint write-sets parallelize freely. The simulator is about to start running on more than one process — and the GIL stops being a limit the moment you stop fighting it.