Syscall overhead and VMs

Recently, a friend of mine at ARM has been kicking around the idea that syscalls are inefficient and that a more than reasonable speedup could be achieved across the board in *nix operating systems by providing a mechanism for reducing the overhead of syscalls.

This friend of mine is a bit of a joker, but the story he tells to motivate his idea is a good one.

Say I have a yard full of logs, and I want them chopped. So I walk out, find a guy with an axe, tell him that I have somew work for him and we come to an agreement on the price. So he goes out back, chops a log, I pay him, and just as he’s about to step off the curb and walk away I grab him, and mention that I have another log for him to chop on the same terms. When he finally gets home to his wife, who asks him how his day went, the log chopper goes “I worked for this crazy guy who had a yard full of wood to chop but told me what to do one log at a time”.

Unfortunately, this does indeed represent the state of syscalls in most unix based operating systems. There isn’t a command for saying “stat everything in this directory” for instance, although this is precisely the implementation of the ls command. Instead, you stat the target directory, yielding an array of files, which you then sequentially stat to get individual file properties. This is all well and good, until you realize that in doing this you’ve incurred O(N) system call context switches. It has to be O(N) on the number of files, because that’s what ls does, but the system call context switches are pure overhead.

My response to this idea was simply that what you really want to do is treat the system calls of the OS as its own virtual machine. A system call as currently envisioned is simply a single op against this standardized interface, with the system interrupt/context switch overhead implicit in executing that single op. Consequently, what we really want to do is not make a context switch into the OS with a single op in mind, instead we want a bytecode VM, Turing complete or otherwise, that represents the syscall interface as an interpreter or a JIT. Making a syscall consequently becomes an exercise in generating a bytecode program representing all the work you can do in one go in the context of system call privilage. So take the ls example. Rather than making N+1 stat calls, instead our “fast ls” could simply generate a bytecode program that would do all N+1 syscall ops in a single go by representing the function

(fn [d :- Dir]
  (->> (list d)
       (map stat)))

This gets really fun when you start thinking about what this means in the context of an operating system where the bytecode machine exposed by the OS for syscalls is the same as the bytecode machine exposed for general programming. Suddenly, you get the OS as almost an exokernel that you can call out to with lambda functions just as if it were nonprivilaged code. Given a JIT this really gets fun because as a program must start running at the highest privilage level it will ever require, it must be safe to hoist the entire program to that level. Consequently a JIT could actually inline out the syscall privilage check, performing it once at program start and then simply running the entire program with OS privilages in OS context. Clearly this works better for a managed language VM than for a hardware language VM with pointer crafting :P.

Interestingly, it turns out that this is not a new idea. Apparently some old IBM research machines actually had a full interpreting stack machine baked into the OS kernel to do exactly this, but I haven’t been able to track down public information on such an operating system.