ten thousand small processes

Wed Jun 25 02:42:29 PDT 2003

In case my subject line isn't clear enough: I'm not talking about
browsers. I'm not talking about programs that ``use up stack space,'' or
that would feel even slightly ``restricted'' by a 1GB limit on the heap,
or that use threads, or that allocate many file descriptors per process.

I'm talking about _small_ processes. I'm talking about programs that
might have quite a bit of code, and might read and write quite a lot of
data, but that don't use much memory per process. The problem is that
I'm talking about ten thousand of these processes running at once.

In this situation, if malloc() fritters away 24K of RAM in one process,
it's actually frittering away more than 234 _megabytes_ of RAM. This
memory is _not_ going to be ``used later.'' The behavior of malloc()
here is not ``quite good.'' For the program that prompted this thread,
malloc() achieved 4.3% fill, 95.7% external fragmentation.

In this situation, squeezing variables into a single page---rather than
spreading them among the first few bytes of several separate pages---
often produces huge improvements in CPU cache effectiveness. Yes, I know
that doesn't make a difference for your browser; I'm not talking about
your browser.

In this situation, if the kernel is maintaining pages of memory that
look like

   process 1: 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
   process 2: 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
   process 3: 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...

then it's wasting an incredible amount of space. As I said before:
Instead of allocating space for a bunch of nearly identical page tables
(or anything else showing this type of waste), why not overlap page
tables, with the changes copied on a process switch?

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago