Oxcart going forwards

When I last wrote about Oxcart work pretty much went on hiatus due to my return to school. As there has been some recent interest in the status of Lean Clojure overall I thought I'd take the opportunity to review the state of Oxcart and the plan for Oxcart going forwards.

Oxcart and Clojure

The static linking approach and lack of run time dynamism found in Oxcart is explicitly at odds with the philosophy of core Clojure. Where Clojure was designed to enable live development and makes performance sacrifices to enable such development as discussed here, Oxcart attempts to offer the complement set of trade offs. Oxcart is intended as a pre-deployment static compiler designed to take a working application and to the greatest extent possible wring more performance out of unchanged (but restricted) Clojure as PyPy does for Python. As Oxcart explicitly avoids the dynamic bindings which Clojure embraces, Alex Miller, the Clojure Community Manager, has repeatedly stated that he expects to see little cross pollination from Oxcart and related work to Clojure itself.

This would be all well and good, were it not for the existing behavior of Clojure's clojure.lang.RT class. As currently implemented in Clojure 1.6 and 1.7, RT uses its <initc> method to compile the following resources with clojure.lang.Compiler.

  • "clojure/core"
  • "clojure/core_proxy"
  • "clojure/core_print"
  • "clojure/genclass"
  • "clojure/core_deftype"
  • "clojure/core/protocols"
  • "clojure/gvec"
  • "clojure/instant"
  • "clojure/uuid"

These represent about 10799 lines of code, all of which could easily be statically compiled and most importantly tree shaken ahead of time by Oxcart or another tool rather than being loaded at boot time. This also means that the unavoidable booting of Clojure itself from source can easily dominate loading user programs especially after static compilation to raw classes. A quick benchmark on my machine shows that booting a Clojure 1.6 instance, loading a ns containing only a -main that only prints "hello world" takes ~2.8 seconds from source compared to ~2.5 seconds booting the same program compiled with Oxcart suggesting that the cost of booting Clojure is the shared ~2.5 second boot time. This is the test.hello benchmark in Oxcart's demos.

$ git clone git@github.com:oxlang/oxcart.git &&\
  cd oxcart &&\
  git checkout 0.1.2 &&\
  bash bench.sh test.hello
Running Clojure 1.6.0 compiled test.hello....
Hello, World!

real    0m1.369s
user    0m3.117s
sys     0m0.083s
Oxcart compiling test.hello....
Running Oxcart compiled test.hello....
Hello, World!

real    0m1.212s
user    0m2.487s
sys     0m0.073s

Then there's the test.load benchmark. This benchmark as-is pushes credulity because it compiles 502 functions of which only the -main which uses none of the other 501 will be invoked. This reflects more on program loading time than on the loading time of "clojure/core", but I still think instructive in the costs of boot time compilation, showing a ~7s boot time for Clojure compared to a ~2.5s boot time for Oxcart. As arbitrary slowdowns from macroexpansions which Thread/sleep would be entirely possible I consider this program within the bounds of "fairness".

A Fork in the Road

There are two solutions to this limitation, and both of them involve changing the behavior of Clojure itself. The first is my proposed lib-clojure refactor. Partitioning Clojure is a bit extreme, and in toying with the proposed RTUtil changes over here I've found that they work quite nicely even with a monolithic Clojure artifact. Unfortunately there seems to be little interest from Clojure's Core team (as judged via Alex's communications over the last few months) in these specific changes or in the static compilation approach to reducing the deployment overhead of Clojure programs. The second is to fork Clojure and then make lib-clojure changes which solves the problem of convincing Core that lib-clojure is a good idea but brings its own suite of problems.

Oxcart was intended to be my undergraduate thesis work. While the 16-25% speedup previously reported is impressive, Oxcart does nothing novel or even interesting under the hood. It only performs four real program transformations: lambda lifting, two kinds of static call site linking and tree shaking. While I suppose impressive for an undergrad, this project also leaves a lot on the table in terms of potential utility due to its inability to alter RT's unfortunate loading behavior. I also think there is low hanging fruit in doing unreachable form elimination and effect analysis, probably enough that Oxcart as-is would not be "complete" even were its emitter more stable.

I'm reluctant to simply fork Clojure, mainly because I don't think that the changes I've been kicking about for lib-clojure actually add anything to Clojure as a language. If I were to fork Clojure, it'd be for Oxlang which actually seeks to make major changes to Clojure not just tweak some plumbing. But writing a language so I can write a compiler is frankly silly so that's not high on the options list. The worst part of this is that forking Clojure makes everything about using Oxcart harder. Now you have dependencies at build time (all of "stock" Clojure) that don't exist at deployment time (my "hacked" Clojure). Whatever hack that requires either winds up complicating everyone's project.clj or in an otherwise uncalled for leiningen plugin just like lein-skummet. Tooling needs to be able to get around this too when every library you'd want to use explicitly requires [org.clojure/clojure ...] which totally goes away once Oxcart emits the bits you need and throws the rest out. Most of all I don't want to maintain a fork for feature parity as time goes on. However I also don't see any other a way to get around RT's existing behavior since the RTUtil refactor touches almost every java file in Clojure.

Flaws in the Stone

Oxcart itself also needs a bunch of work. While I think that Nicola has done an awesome job with tools.analyzer and tools.emitter.jvm I'm presently convinced that while it's fine for a naive emitter (what TEJVM is), it's a sub-optimal substrate for a whole program representation and for whole program transforms.

Consider renaming a local symbol. In the LLVM compiler infrastructure, "locals" and other program entities are represented as mutable nodes to which references are held by clients (say call sites or use sites). A rename is then simply an update in place on the node to be changed. All clients see the change with no change in state. This makes replacements, renames and so forth constant time updates. Unfortunately due to the program model used by tools.analyzer and tools.emitter.jvm, such efficient updates are not possible. Instead most rewrites degenerate into worst case traversals of the entire program AST when they could be much more limited in scope. Cutaway is one experiment in this direction, but it at best approximates what clojure.core.logic.pldb is capable of. I hope that over Christmas I'll have time to play with using pldb to store, search and rewrite a "flattened" form of tools.analyzer ASTs.

Oxcart is out of date with tools.emitter.jvm and tools.analyzer. This shouldn't be hard to fix, but I just haven't kept up with Nicola's ongoing work over the course of the last semester. This will probably get done over Christmas as well.

Oxcart doesn't support a bunch of stuff. As of right now, defmulti, defmethod, deftype, defprotocol, proxy, extend-type and extend-protocol aren't supported. I'm pretty sure all of these actually work, or could easily work, they just didn't get done in the GSoC time frame.

Finally and I think this is the thing that's really blocking me from working on Oxcart: it can't compile clojure.core anyway. This is a huge failing on my part in terms of emitter completeness, but it's a moot point because even if I can compile clojure.core with Oxcart RT is gonna load it anyway at boot time. I also suspect that this is an incompleteness in the project as a whole which probably makes it an unacceptable thesis submission although I haven't spoken with my adviser about it yet.

The Endgame

As of right now I think it's fair to call Oxcart abandoned. I don't think it's a worthwhile investment of my time to build and maintain a language fork that doesn't have to be a fork. I talked with Alexander, one of the clojure-android developers and a fellow GSoC Lean Clojure student/researcher about this stuff and the agreement we reached was that until 1.7 is released there's no way that the lib-clojure changes will even get considered and that the most productive thing we can do as community members is probably to wait for 1.8 planning and then try to sell lib-clojure and related cleanup work on the basis of enabling clojure-android and lean clojure/Oxcart. Practically speaking in terms of my time however, if it's going to be a few months until 1.7 and then a year until 1.8, that only gives leaves me my last semester of college to work on Oxcart against an official version of Clojure that can really support it. If that's what it takes to do Oxcart I'll likely just find a different thesis project or plan on graduating without a thesis.

  That said, serious interest in Oxcart as a deployment tool or another
  contributor would probably be enough to push me over the futility hump
  of dealing with a Clojure fork and get Oxcart rolling.)


The Future of the LispM

This is also addressed to @lokikil towards whom I have threatened to write this post on multiple occasions. It is expected to piss off loper-os. Above all, it represents my opinions. You have been warned.

Background Of Historical LispMs

LispMs occupy a transitional period of computing industry history between time shared mainframes and the single user workstations that would become today's desktop computers. Prior to John McCarthy's invention of time sharing, all computers had been single program machines which multiple users shared via batch scheduling. The introduction of time sharing opened the way to exploring how to make computers useful interactively for many users sitting at terminals. Under bach processing, users would present fully formed programs written by hand or with little tool assistance to computer techs who would run the programs and return printed output.

The concept of time sharing and interactivity lead to the development of more dynamic programming environments including the Read Eval Print Loop. However at the same time, the falling costs of integrated circuits were beginning to put the first workstations, single user machines featuring large memories, within the financial reach of well funded AI laboratories which previously dependent on time shared mainframes.

In order to improve interactivity and performance, Richard Greenblatt and Thomas Knight developed the cons machine, the first of the MIT Lisp machines. Priced at ~$50,000 in 1970s dollars cons and the successor cadr machine were single user workstations built primarily it seems to address the restrictions on memory and memory performance and subsequent detrimental impact on general program performance faced on time shared systems due to the large memory footprint of Lisp programs. Hardware/microcode support was added for maintaining the relatively complex Lisp environment structures, and for many of the primitive lookup/load/store operations on cons cell structures that characterize a traditional Lisp implementation's data access pattern. This combination of machine support and microcoding enabled Lisp "compilers" to be far simpler and offloaded the responsibility for maintaining complicated structures such as the environment from the compiler to the microcode system. Best of all as the microcode closely corresponded relatively directly to Lisp, on the rare occasions that truly writing microcode for say device drivers was called for doing so was easy because users could transparently invoke and define microcode from their native Lisp environment.

Seeing a business opportunity, in the early '80s a number of MIT AI lab researchers departed and founded Symbolics Inc. and Lisp Machines Inc. for the purpose of building and selling Knight machine derived Lisp workstations. Ultimately however, Lisp machines did fall out of favor during the AI winter(s) and lost to what we now call the personal computer. I have seen the argument made that they were too big-ticket and missed the mass market as a result. Thus our story really begins.

of Modern Hardware

In the years since the demise of the dedicated LispMs, Intel and the other major chip manufacturers have progressed from single cycle to pipelined, cache accelerated, superscalar and out of order machines. Knowledge of these designs is assumed. I suggest my earlier introduction series to these topics. At a high level however, each step of this progression trades off chip area, transistors (complexity) and power budget to wring more instruction level parallelism out of existing programs.

The Symbolics 3600 technical report from '83 claims a best case cycle time of 180ns (~5MHz) and a worst case of 250ns (4MHz). Arbitrary reads from main memory are priced at 600ns (3x the cost of a single instruction). This low (by modern standards) slowdown between the processor and memory meant that one could write programs that did lots and lots of random memory access and still have reasonable performance expectations. However, as a stack machine architecture, there is no opportunity for the instruction level parallelism exploited by modern architectures as the exact state of the stack which is depended on by every instruction changes with every instruction. This forces all program execution to wait during slow operations like main memory reads, division and floating point math due to the sequential data dependency on the stack. This is a fundamental architectural failure common to all stack machines and skirted by modern software stack machines like the JVM by compiling literal stack instructions to a register machine in order to recover instruction level parallelism.

Modern processors claim instruction times of .3ns for a 3GHz machine, often running multiple instructions per cycle. Unfortunately while processors core clocks have gotten faster since the days of the 3600, memories have not fundamentally improved. On modern hardware, 100ns for a random main memory read seems to be more or less normal. This represents a 27x speedup in processor speed mismatched with a 6x speedup in main memory latency for an effective 4.5x slowdown on memory reads. Now this is admittedly worst case memory latency. Thanks to the L1 and L2 cache layers, "memory" read times of 1ms to 3ms are not uncommon meaning that for cache efficient programs it is quite reasonable to expect less than a 2ns or 6 cycles for memory reads hitting in the L1 or L2 cache.

I mention this, because techniques like cdr coding serve to bring the average case of list access and construction down from their worst case behavior of generating massive heap fragmentation (and thus defeating caching) towards the behavior of nicely caching sequential data layouts (classical arrays) typically via tree like structures or SRFI-101. While traditional linked list structures due to potential heap fragmentation provide potentially atrocious cache locality and this cripple the performance of modern cache accelerated machines, more modern datastructures can simultaneously provide the traditional cons list pattern while improving cache locality and thus performance characteristics. Guy Steele himself has even come out arguing this point.

Without literal implementation as a stack machine traversing linked lists and providing special hardware support for binding and soforth, the LispM instantly looses much of the vaunted simplicity which makes it attractive to program directly and becomes simply another register machine in the mold of modern ARM or Intel machines with strange long running microcoded instructions reminiscent more of the VAXen than of modern RISC inspired processors. In short due to the many performance limitations of linked lists as data structures we as an industry will never again build machines as the original LispMs were for the sole purpose of traversing and manipulating linked list data structures. Modern hardware simply offers better performance with more general applicability.

Of Modern Languages

We've also come a long way in language design and implementation. Compilers, once slow, have gotten faster and smarter. Virtual machines like the JVM, JavaScript and the CLR are becoming widely used deployment targets. The ML and Haskell families of languages have introduced us to concepts of real abstract types and abstract effects which can be used to build programs coupled only by the abstract properties of the data being consumed, generated and effects produced. Type inference is even making such fancy behavior manageable by mere mortals, while providing language implementations with more and more information with which to perform both program level optimization and micro-optimization not possible in traditional naive lisps.

Of LispMs Future

While we'll never build a hardware LispM again, I suspect that we will see LispM like systems return one day in the not too distant future. Not as hardware, but as software or a virtual machine designed to run atop existing and entirely adequate modern hardware. Now in 10 years someone may make a commercial hardware LispM at which point I will be happy to eat my words, but as of right now I don't see it happening.

The JVM and the .net CLR have proven to be amazing platforms. While their allocation requirements perhaps prohibit their use for implementing operating systems and driver code (not that people haven't tried this sort of thing) they do offer excellent and standardized platforms. It is my personal belief that, by leveraging the flexibility that Lisp dialects have for both data driven DSLs and macro DSLs that it would be more or less trivial to implement an operating system using a DSL for generating platform specific assembly. As the "kernel" OS is implemented in the final "host" LispM language editing the kernel is possible albeit difficult due to the inherent difficulties of hot swapping operating systems.

Add a JIT (no this isn't easy), preferably implemented in the same assembler DSL as the host operating system, and the stage is set for building a LispM environment as a virtual machine running atop modern hardware and using modern JIT to achieve reasonable performance characteristics. From this vantage the way is clear to implementing the rest of the operating system and user land in a fully fledged lisp with good reloading and introspection support atop this kernel of a JIT and enough of an OS to provide memory and process management.

This is, I think, an entirely reasonable and practical project for a smallish (~3 or fewer) team and a single target architecture (Intel). There was a project, Dream, an r4rs x86 assembler dsl in which an r4rs interpreter and operating system were implemented. It booted and worked more or less fine, and while it lacked polish I think it serves as an excellent proof of concept that such a self hosting Lisp (assembler os runtime) triple is viable. I don't argue that such an implementation will be trivially portable to a wide variety of target hardware because as experience shows porting nominally portable Linux, FreeBSD or FreeRTOS is actually quite hard and porting a JIT can at best be as hard due to tight coupling with the specific behavior of the target machine but this is why we specify an abstract machine and then let implementations deal with the mess as the JVM does.

I also think that Lisp as a language family could stand to learn some new tricks if someone were to make the effort to build a full Lispy OS/VM thing.

Clojure is awesome. I've written a boatload of it. It's an amazingly simple and elegant tool, and its persistent/immutable datastructures are absolutely worth stealing as is its concept of hygenic macros via symbol qualification and probably the concept of protocols/interfaces which Clojure kinda inherits from its primary host Java.

Racket is also awesome although I can't claim to have written a bunch of it. Its approach to static compilation, strong support for documentation and examples via Scribble, its pattern matching and typechecking facilities are totally worth stealing.

Shen is interesting if only due to its build in prolog engine enabling search for arbitrary proofs with regards to arbitrary program properties. While criticized by the Haskell community for disregarding completeness being able to write and search for proofs with regards to arbitrary properties of programs not just types is I think an amazingly powerful one albeit an active research area.

But is it a LispM?

There are some people who will rail against this vision of mine that the "OS" is as low as we need to try and push a language. To my mind, the advantages of pushing a language to the hardware level are scant enough as argued above that it simply does not justify the investment or the delay. Even the OS may even be too low to push a language and that while the adventure of building an OS with a language to host the language ala Oberon because while integrating the language with the OS does offer maximal conceptual unity across the combined platform and provide a huge leap in ease of integrating pipelines of different applications it also forcibly discards programs not written in HOST_LANG for HOST_PLATFORM. This is a problem if only because the Unix model of computing as implemented on Linux with the help of the GNU user land is second only to the mach hiding inside of Mac OS X in terms of apparent developer adoption. Developers booting the next Dream for the first time won't find familiar text editors or tools. It will be a brave new world.

Maybe that's a good thing. Maybe we can escape the legacy of Unix with untyped byte streams and instead revive some of Plan 9 with a basis in lispy goodness and with more types everywhere. Maybe there's even a place in this for algebraic effects. Maybe we can join urbit in looking towards a functional, fully netwoked future. Maybe. If only we can look past the obsolete machines and operating systems of yesteryear with which we seem content to circlejerk and move forwards with appropriate context and vision.

Or maybe we'll just be stuck with Linux, Unix, OpenSSL and the rest of our somewhat suspect tower of legacy code out of sheer inertia and continue getting caught up in language of the month rages out of collective ADD.

 We have not wings, we cannot soar;
       But we have feet to scale and climb
 By slow degrees, by more and more,
       The cloudy summits of our time.

 The mighty pyramids of stone
       That wedge-like cleave the desert airs,
 When nearer seen, and better known,
       Are but gigantic flights of stairs.

 The distant mountains, that uprear
       Their solid bastions to the skies,
 Are crossed by pathways, that appear
       As we to higher levels rise.

 The heights by great men reached and kept
       Were not attained by sudden flight,
 But they, while their companions slept,
       Were toiling upward in the night.

~ Longfellow


Syscall overhead and VMs

Recently, a friend of mine at ARM has been kicking around the idea that syscalls are inefficient and that a more than reasonable speedup could be achieved across the board in *nix operating systems by providing a mechanism for reducing the overhead of syscalls.

This friend of mine is a bit of a joker, but the story he tells to motivate his idea is a good one.

Say I have a yard full of logs, and I want them chopped. So I walk out, find a guy with an axe, tell him that I have somew work for him and we come to an agreement on the price. So he goes out back, chops a log, I pay him, and just as he's about to step off the curb and walk away I grab him, and mention that I have another log for him to chop on the same terms. When he finally gets home to his wife, who asks him how his day went, the log chopper goes "I worked for this crazy guy who had a yard full of wood to chop but told me what to do one log at a time".

Unfortunately, this does indeed represent the state of syscalls in most unix based operating systems. There isn't a command for saying "stat everything in this directory" for instance, although this is precisely the implementation of the ls command. Instead, you stat the target directory, yielding an array of files, which you then sequentially stat to get individual file properties. This is all well and good, until you realize that in doing this you've incurred O(N) system call context switches. It has to be O(N) on the number of files, because that's what ls does, but the system call context switches are pure overhead.

My response to this idea was simply that what you really want to do is treat the system calls of the OS as its own virtual machine. A system call as currently envisioned is simply a single op against this standardized interface, with the system interrupt/context switch overhead implicit in executing that single op. Consequently, what we really want to do is not make a context switch into the OS with a single op in mind, instead we want a bytecode VM, Turing complete or otherwise, that represents the syscall interface as an interpreter or a JIT. Making a syscall consequently becomes an exercise in generating a bytecode program representing all the work you can do in one go in the context of system call privilage. So take the ls example. Rather than making N+1 stat calls, instead our "fast ls" could simply generate a bytecode program that would do all N+1 syscall ops in a single go by representing the function

(fn [d :- Dir]
  (->> (list d)
       (map stat)))

This gets really fun when you start thinking about what this means in the context of an operating system where the bytecode machine exposed by the OS for syscalls is the same as the bytecode machine exposed for general programming. Suddenly, you get the OS as almost an exokernel that you can call out to with lambda functions just as if it were nonprivilaged code. Given a JIT this really gets fun because as a program must start running at the highest privilage level it will ever require, it must be safe to hoist the entire program to that level. Consequently a JIT could actually inline out the syscall privilage check, performing it once at program start and then simply running the entire program with OS privilages in OS context. Clearly this works better for a managed language VM than for a hardware language VM with pointer crafting :P.

Interestingly, it turns out that this is not a new idea. Apparently some old IBM research machines actually had a full interpreting stack machine baked into the OS kernel to do exactly this, but I haven't been able to track down public information on such an operating system.

牛: the environment model

Disclaimer: Oxlang is vaporware. It may exist some day, there is some code, however it is just my thought experiment at polishing out some aspects of Clojure I consider warts by starting from a tabula rasa. The following represents a mostly baked scheme for implementing ns and require more nicely in static (CLJS, Oxcart) rather than dynamic (Clojure) contexts.

Unlike Clojure in which the unit of compilation is a single form, Oxlang's unit of compilation is that of a "namespace", or a single file. Oxlang namespaces are roughly equivalent to Haskell modules in that they are a comprised of a "header", followed by a sequence of declarative body forms.

In Clojure, the ns form serves to imperatively create and initialize a namespace and binding scope. This is done by constructing a new anonymous function, using it as a class loader context to perform cached compilation of depended namespaces. Subsequent forms are compiled as they occur and the results are accumulated as globally visible defs.

Recompiling or reloading a file does exactly that. The ns form is re-executed, incurring more side-effects, and all forms in the file are re-evaluated generating more defs. However this does not discard the old defs from the same file, nor purge the existing aliases and refers in the reloaded namespace. This can lead to interesting bugs where changes in imports and defs create name conflicts with the previous imports and cause reloading to fail. The failure to invalidate deleted defs also creates conditions where for instance during refactorings the old name for a function remains interred and accessible the program run time allowing evaluation of code which depends on the old name to succeed until the entire program is reloaded in a fresh run time at which point the missing name will become evident as a dependency fault.

Furthermore, the Var mechanism serves to enable extremely cheap code reloading because all bindings are dynamically resolved anyway. This means that there is exactly zero recompilation cost to new code beyond compilation of the new code itself since the Var look up operation is performed at invoke time rather than at assemble time.

Unfortunately in my Clojure development experience, the persistence of deleted symbols resulted in more broken builds than I care to admit. Building and maintaining a dependency graph between symbols is computationally inexpensive, is a key part of many language level analyses for program optimization and here critically provides better assurance that REPL development behavior is identical to program behavior in a cold program boot context.

In order to combat these issues, two changes must be made. First, re-evaluating a ns form must yield a "fresh" environment that cannot be tainted by previous imports and bindings. This resolves the import naming conflict issues by making them impossible. By modeling a "namespace" as a concrete "module" value having dependencies, public functions and private functions we can mirror the imperative semantics enabled by Clojure's defs and Vars simply by accumulating "definitions" into the "module" as they are compiled.

This model isn't a total gain however due to the second change, that reloading entirely (and deliberately) invalidates the previous definitions of every symbol in the reloaded namespace by swapping out the old namespace definition for the new one. This implies that other namespaces/modules which depend on a reloaded module must themselves be reloaded in topological sort order once the new dependencies are ready requiring dependency tracking and reloading infrastructure far beyond Clojure's (none). Naively this must take place on a file by file basis as in Scala, however by tracking file change time stamps of source files and the hash codes of individual def forms a reloading environment can prove at little cost that no semantic change has taken place and incur the minimum change cost. I note here the effectiveness of GHCI at enabling interactive development under equivalent per-file reloading conditions as evidence that this model is in fact viable for enabling the interactive work flow that we associate with Clojure development.

With "namespaces" represented as concrete immutable values, we can now define namespace manipulation operations such as require and def in terms of functions which update the "current" namespace as a first class value. A def when evaluated simply takes a namespace and returns a new namespace that "happens" to contain a new def. However the work performed is potentially arbitrary. refer, the linking part of require, can now be implemented as a function which takes some enumeration of the symbols in some other namespace and the "current" environment, then returns a "new" environment representing the "current" environment with the appropriate aliases installed.

This becomes interesting because it means that the return value of load need lot be the eval result of the last form in the target file, it can instead be the namespace value representing the final state of the loaded module. Now, given a caching/memoized load (which is require), we can talk about an "egalitarian" loading system where user defined loading paths are possible because refer only needs the "current" namespace, a "source" namespace and a spec. Any function could generate a "namespace" value, including one which happens to perform loading of an arbitrary file as computed by the user. See technomancy's egalitarian ns for enabling the hosting of multiple versions of a single lib simultaneously in a single Clojure instance is one possible application of this behavior.

It is my hope that by taking this approach the implementation of namespaces and code loading can be simplified greatly however one advantage of the Var structure is that it enables forwards and out of order declarations which is immensely useful while bootstrapping a language run time ex nihilo, as done here in the Clojure core. ns itself must exist in the "empty" namespace, otherwise as the "empty" namespace is used to analyze the first form in a file (stateless (abstractly) compiler ftw) the ns form itself will not be resolved and no program can be constructed. Ox could follow Clojure's lead and cheat by not implementing ns in Ox but rather bootstrapping it from Java or Clojure or whatever the implementing language turns out to be but I'd like to do better than that. This is a problem I haven't solved yet.


Compiler introduction of transients to pure functions

In Clojure and pure functinal languages, the abstraction is provided that values cannot be updated, only new values may be produced. Naively, this means that every update to a value must produce a full copy of the original value featuring the desired change. More sophisticated implementations may opt for structural sharing, wherein updated versions of some structure share backing memory with the original or source value on the substructures where no update is performed. Substructures where there is an update must be duplicated and updated as in the naive case, but for tree based datastructures this can reduce the cost of a typical update from O(N) on the size of the updated structure to O(log(N)) because the rest of the structure may be shared and only the "updated" subtree needs to be duplicated.

This means that tree based structures which maximize the ammount of sharable substructure perform better in a functional context because they minimize the fraction of a datastructure which must be duplicated during any given update.

Unfortunately however, such structural sharing still carries a concrete cost in terms of memory overhead, garbage collection and cache and performance when compared to a semantically equivalent update in place over a mutable datastructure. A mutable update is typically O(1), with specific exceptions for datastructures requiring amortized analysis to achieve near O(1) performance.

Ideally, we would be able to write programs such that we preserve the abstraction of immutable values, while enabling the compiler or other runtime to detect when intentional updates in place are occurring and take the opportunity to leverage the performance improvements consequent from mutable data in these cases while ensuring that no compiler introduced mutability can become exposed to a user through the intentional API as built by the programmer.

In such a "pure" language, there is only one class of functions, functions from Immutable values to Immutable values. However if we wish to minimize the performance overhead of this model four cases become obvious. λ Immutable → Immutable functions are clearly required as they represent the intentional API that a programmer may write. λ Mutable → Mutable functions could be introduced as implementation details within an λ Immutable → Immutable block, so long as the API contract that no mutable objects may leak is preserved.

Consider the Clojure program

(reduce (partial apply assoc)
    (map vector
       (range 10000)
       (range 10000)))

This program will sequentially apply the non-transient association operation to a value (originally the empty map) until it represents the identity mapping over the interval [0,9999]. In the naive case, this would produce 10,000 full single update coppies of the map. Clojure, thanks to structural sharing, will still produce 10,000 update objects, but as Clojure's maps are implemented as log₃₂ hash array mapped tries, meaning that only the array containing the "last" n % 32 key/value pairs must be duplicated, more the root node. This reduces the cost of the above operation from T(~10,000²) to T(10,000*64) ≊ T(640,000) which is huge for performance. However, a Sufficiently Smart Compiler could recognize that the cardinality of the produced map is max(count(range(10000)), count(range(10000))), clearly being 10000. Consequently an array map of in the worst case 10000 elements is required given ideal hashing, however assuming a load factor of 2/3 this means our brilliant compiler can preallocate a hashmap of 15000 entries (presumed T(1)), and then perform T(10000) hash insertions with a very low probability of having to perform a hash table resize due to accounting for hash distribution and sizing the allocated table to achieve a set load factor.

Clearly at least in this example the mutable hash table would be an immense performance win because while we splurge a bit on consumed memory due to the hash table load factor (at least compared to my understanding of Clojure's hash array mapped trie structure) the brilliantly compiled program will perform no allocations which it will not use, will perform no copying, and will generate no garbage compared to the naive structurally shared implementation which will produce at least 9,967 garbage pairs of 32 entry arrays.

The map cardinality hack is it's own piece of work and may or may not be compatible with the JVM due to the fact that most structures are not parametric on initial size and instead perform the traditional 2*n resizing at least abstractly. However, our brilliant compiler can deduce that the empty map which we are about to abuse can be used as a transient and made static when it escapes the scope of the above expression.

Consider the single static assignment form for the above (assuming a reduce definition which macroexpands into a loop) (which Clojure doesn't do).

    [1 ] = functionRef(clojure.core/partial)
    [2 ] = functionRef(clojure.core/apply)
    [3 ] = functionRef(clojure.core/assoc)
    [4 ] = invoke(2, 3, 4)                   ;; (partial apply assoc)
    [5 ] = {}
    [6 ] = functionRef(clojure.core/map)
    [7 ] = functionRef(clojure.core/vector)
    [8 ] = functionRef(clojure.core/range)
    [9 ] = 10000
    [10] = invoke(8, 9)                      ;; (range 10000)
    [11] = invoke(6, 7, 10, 10)              ;; (map vector [10] [10])
    [12] = functionRef(clojure.core/first)
    [13] = functionRef(clojure.core/rest)
    [14] = phi(5,  18)
    [15] = phi(11, 19)
    [16] = if(13, cont, end)
    [17] = invoke(12, 14)
    [18] = invoke(4, 14, 15)
    [19] = invoke(13, 15)
    [20] = jmp(loop)
    [21] = return(18)

Where the phi function represents that the value of the phi node depends on the source of the flow of control. Here I use the first argument to the phi functions to mean that control "fell through" from the preceeding block, and the second argument to mean that control was returned to this block via instruction 20.

This representation reveals the dataflow dependence between sequential values of our victim map. We also have the contract that the return, above labeled 21, must be of an Immutable value. Consequently we can use a trivial dataflow analysis to "push" the Immutable annotation back up the flow graph, giving us that 18, 14 and 5 must be immutable, 5 is trivially immutable, 18 depends on 14, which depends on 18 and 5, implying that it must be immutable as well. So far so good.

We can now recognize that we have a phi(Immutable, Immutable) on a loop back edge, meaning that we are performing an update of some sort within the loop body. This means that, so long as Transient value is introduced into the Immutable result, we can safely rewrite the Immutable result to be a Transient, and add a persistent! invocation before the return operation. Now we have phi(Immutable, Transient) → Transient which makes no sense, so we add a loop header entry to make the initial empty map Transient giving us phi(Transient, Transient) → Transient which is exactly what we want. Now we can rewrite the loop update body to use assoc! → Transient Map → Immutable Object → Immutable Object → Transient Map rather than assoc → Immutable Map → Immutable Object → Immutable Object → Immutable Map.

Note that I have simplified the signature of assoc to the single key/value case for this example, and that the key and value must both be immutable. This is required as the persistent! function will render only the target object itself and not its references persistent.

This gives us the final operation sequence

    [1 ] = functionRef(clojure.core/partial)
    [2 ] = functionRef(clojure.core/apply)
    [3 ] = functionRef(clojure.core/assoc!)
    [4 ] = invoke(2, 3, 4)                      ;; (partial apply assoc)
    [5 ] = functionRef(clojure.core/transient!)
    [6 ] = {}
    [7 ] = invoke(5, 6)
    [8 ] = functionRef(clojure.core/map)
    [9 ] = functionRef(clojure.core/vector)
    [10] = functionRef(clojure.core/range)
    [11] = 10000
    [12] = invoke(10, 11)                       ;; (range 10000)
    [13] = invoke(8, 9, 12, 12)                 ;; (map vector [12] [12])
    [14] = functionRef(clojure.core/first)
    [15] = functionRef(clojure.core/rest)
    [16] = phi(7,  20)
    [17] = phi(13, 21)
    [18] = if(17, cont, end)
    [19] = invoke(14, 17)
    [20] = invoke(4, 16, 17)
    [21] = invoke(15, 17)
    [22] = jmp(loop)
    [23] = functionRef(clojure.core/persistent!)
    [24] = invoke(23, 20)
    [25] = return(24)

Having performed this rewrite we've one. This transform allows an arbitrary loop using a one or more persistent datastructures a accumulators to be rewritten in terms of transients if there exists (or can be inferred) a matching Transient t → Transient t updater equivalent to the updater used. Note that if a non-standard library updater (say a composite updater) is used, then the updater needs to be duplicated and if possible recursively rewritten from a Persistent t → Persistent t by replacing the standard library operations for which there are known doubles with their Transient t counterparts until either the rewrite fails to produce a matching Transient t → Transient t or succeeds. If any such rewrite fails then this entire transform must fail. Also note that this transformation can be applied to subterms... so as long as the Persistent t contract is not violated on the keys and values here of the map in a nontrivial example their computation as well could be rewritten to use compiler persisted transients.

Now yes you could have just written

(into {}
   (map vector
      (range 10000)
      (range 10000)))

which would have used transients implicitly, but that requires that the programmer manually perform an optimization requiring further knowledge of the language and its implementation details when clearly a relatively simple transformation would not only reveal the potential for this rewrite but would provide it in the general case of arbitrarily many mutable accumulators rather than into's special case of just one.