Syscall overhead and VMs
25 Nov 2014Recently, a friend of mine at ARM has been kicking around the idea that syscalls are inefficient and that a more than reasonable speedup could be achieved across the board in *nix operating systems by providing a mechanism for reducing the overhead of syscalls.
This friend of mine is a bit of a joker, but the story he tells to motivate his idea is a good one.
Say I have a yard full of logs, and I want them chopped. So I walk out, find a guy with an axe, tell him that I have somew work for him and we come to an agreement on the price. So he goes out back, chops a log, I pay him, and just as he’s about to step off the curb and walk away I grab him, and mention that I have another log for him to chop on the same terms. When he finally gets home to his wife, who asks him how his day went, the log chopper goes “I worked for this crazy guy who had a yard full of wood to chop but told me what to do one log at a time”.
Unfortunately, this does indeed represent the state of syscalls in
most unix based operating systems. There isn’t a command for saying
“stat everything in this directory” for instance, although this is
precisely the implementation of the ls
command. Instead, you stat
the target directory, yielding an array of files, which you then
sequentially stat
to get individual file properties. This is all
well and good, until you realize that in doing this you’ve incurred
O(N)
system call context switches. It has to be O(N)
on the number
of files, because that’s what ls
does, but the system call context
switches are pure overhead.
My response to this idea was simply that what you really want to do is
treat the system calls of the OS as its own virtual machine. A system
call as currently envisioned is simply a single op against this
standardized interface, with the system interrupt/context switch
overhead implicit in executing that single op. Consequently, what we
really want to do is not make a context switch into the OS with a
single op in mind, instead we want a bytecode VM, Turing complete or
otherwise, that represents the syscall interface as an interpreter or
a JIT. Making a syscall consequently becomes an exercise in generating
a bytecode program representing all the work you can do in one go in
the context of system call privilage. So take the ls
example. Rather
than making N+1
stat calls, instead our “fast ls” could simply
generate a bytecode program that would do all N+1
syscall ops in a
single go by representing the function
This gets really fun when you start thinking about what this means in
the context of an operating system where the bytecode machine exposed
by the OS for syscalls is the same as the bytecode machine exposed for
general programming. Suddenly, you get the OS as almost an exokernel
that you can call out to with lambda functions just as if it were
nonprivilaged code. Given a JIT this really gets fun because as a
program must start running at the highest privilage level it will ever
require, it must be safe to hoist the entire program to that
level. Consequently a JIT could actually inline out the syscall
privilage check, performing it once at program start and then simply
running the entire program with OS privilages in OS context. Clearly
this works better for a managed language VM than for a hardware
language VM with pointer crafting :P
.
Interestingly, it turns out that this is not a new idea. Apparently some old IBM research machines actually had a full interpreting stack machine baked into the OS kernel to do exactly this, but I haven’t been able to track down public information on such an operating system.
^d