Compiler introduction of transients to pure functions

17 Sep 2014

In Clojure and pure functinal languages, the abstraction is provided that values cannot be updated, only new values may be produced. Naively, this means that every update to a value must produce a full copy of the original value featuring the desired change. More sophisticated implementations may opt for structural sharing, wherein updated versions of some structure share backing memory with the original or source value on the substructures where no update is performed. Substructures where there is an update must be duplicated and updated as in the naive case, but for tree based datastructures this can reduce the cost of a typical update from O(N) on the size of the updated structure to O(log(N)) because the rest of the structure may be shared and only the “updated” subtree needs to be duplicated.

This means that tree based structures which maximize the ammount of sharable substructure perform better in a functional context because they minimize the fraction of a datastructure which must be duplicated during any given update.

Unfortunately however, such structural sharing still carries a concrete cost in terms of memory overhead, garbage collection and cache and performance when compared to a semantically equivalent update in place over a mutable datastructure. A mutable update is typically O(1), with specific exceptions for datastructures requiring amortized analysis to achieve near O(1) performance.

Ideally, we would be able to write programs such that we preserve the abstraction of immutable values, while enabling the compiler or other runtime to detect when intentional updates in place are occurring and take the opportunity to leverage the performance improvements consequent from mutable data in these cases while ensuring that no compiler introduced mutability can become exposed to a user through the intentional API as built by the programmer.

In such a “pure” language, there is only one class of functions, functions from Immutable values to Immutable values. However if we wish to minimize the performance overhead of this model four cases become obvious. λ Immutable → Immutable functions are clearly required as they represent the intentional API that a programmer may write. λ Mutable → Mutable functions could be introduced as implementation details within an λ Immutable → Immutable block, so long as the API contract that no mutable objects may leak is preserved.

Consider the Clojure program

(reduce (partial apply assoc)
    {}
    (map vector
	   (range 10000)
	   (range 10000)))

This program will sequentially apply the non-transient association operation to a value (originally the empty map) until it represents the identity mapping over the interval [0,9999]. In the naive case, this would produce 10,000 full single update coppies of the map. Clojure, thanks to structural sharing, will still produce 10,000 update objects, but as Clojure’s maps are implemented as log₃₂ hash array mapped tries, meaning that only the array containing the “last” n % 32 key/value pairs must be duplicated, more the root node. This reduces the cost of the above operation from T(~10,000²) to T(10,000*64) ≊ T(640,000) which is huge for performance. However, a Sufficiently Smart Compiler could recognize that the cardinality of the produced map is max(count(range(10000)), count(range(10000))), clearly being 10000. Consequently an array map of in the worst case 10000 elements is required given ideal hashing, however assuming a load factor of 2/3 this means our brilliant compiler can preallocate a hashmap of 15000 entries (presumed T(1)), and then perform T(10000) hash insertions with a very low probability of having to perform a hash table resize due to accounting for hash distribution and sizing the allocated table to achieve a set load factor.

Clearly at least in this example the mutable hash table would be an immense performance win because while we splurge a bit on consumed memory due to the hash table load factor (at least compared to my understanding of Clojure’s hash array mapped trie structure) the brilliantly compiled program will perform no allocations which it will not use, will perform no copying, and will generate no garbage compared to the naive structurally shared implementation which will produce at least 9,967 garbage pairs of 32 entry arrays.

The map cardinality hack is it’s own piece of work and may or may not be compatible with the JVM due to the fact that most structures are not parametric on initial size and instead perform the traditional 2*n resizing at least abstractly. However, our brilliant compiler can deduce that the empty map which we are about to abuse can be used as a transient and made static when it escapes the scope of the above expression.

Consider the single static assignment form for the above (assuming a reduce definition which macroexpands into a loop) (which Clojure doesn’t do).

	[1 ] = functionRef(clojure.core/partial)
	[2 ] = functionRef(clojure.core/apply)
	[3 ] = functionRef(clojure.core/assoc)
	[4 ] = invoke(2, 3, 4)                   ;; (partial apply assoc)
	[5 ] = {}
	[6 ] = functionRef(clojure.core/map)
	[7 ] = functionRef(clojure.core/vector)
	[8 ] = functionRef(clojure.core/range)
	[9 ] = 10000
	[10] = invoke(8, 9)                      ;; (range 10000)
	[11] = invoke(6, 7, 10, 10)              ;; (map vector [10] [10])
	[12] = functionRef(clojure.core/first)
	[13] = functionRef(clojure.core/rest)
loop:
	[14] = phi(5,  18)
	[15] = phi(11, 19)
	[16] = if(13, cont, end)
cont:
	[17] = invoke(12, 14)
	[18] = invoke(4, 14, 15)
	[19] = invoke(13, 15)
	[20] = jmp(loop)
end:
	[21] = return(18)

Where the phi function represents that the value of the phi node depends on the source of the flow of control. Here I use the first argument to the phi functions to mean that control “fell through” from the preceeding block, and the second argument to mean that control was returned to this block via instruction 20.

This representation reveals the dataflow dependence between sequential values of our victim map. We also have the contract that the return, above labeled 21, must be of an Immutable value. Consequently we can use a trivial dataflow analysis to “push” the Immutable annotation back up the flow graph, giving us that 18, 14 and 5 must be immutable, 5 is trivially immutable, 18 depends on 14, which depends on 18 and 5, implying that it must be immutable as well. So far so good.

We can now recognize that we have a phi(Immutable, Immutable) on a loop back edge, meaning that we are performing an update of some sort within the loop body. This means that, so long as Transient value is introduced into the Immutable result, we can safely rewrite the Immutable result to be a Transient, and add a persistent! invocation before the return operation. Now we have phi(Immutable, Transient) → Transient which makes no sense, so we add a loop header entry to make the initial empty map Transient giving us phi(Transient, Transient) → Transient which is exactly what we want. Now we can rewrite the loop update body to use assoc! → Transient Map → Immutable Object → Immutable Object → Transient Map rather than assoc → Immutable Map → Immutable Object → Immutable Object → Immutable Map.

Note that I have simplified the signature of assoc to the single key/value case for this example, and that the key and value must both be immutable. This is required as the persistent! function will render only the target object itself and not its references persistent.

This gives us the final operation sequence

	[1 ] = functionRef(clojure.core/partial)
	[2 ] = functionRef(clojure.core/apply)
	[3 ] = functionRef(clojure.core/assoc!)
	[4 ] = invoke(2, 3, 4)                      ;; (partial apply assoc)
	[5 ] = functionRef(clojure.core/transient!)
	[6 ] = {}
	[7 ] = invoke(5, 6)
	[8 ] = functionRef(clojure.core/map)
	[9 ] = functionRef(clojure.core/vector)
	[10] = functionRef(clojure.core/range)
	[11] = 10000
	[12] = invoke(10, 11)                       ;; (range 10000)
	[13] = invoke(8, 9, 12, 12)                 ;; (map vector [12] [12])
	[14] = functionRef(clojure.core/first)
	[15] = functionRef(clojure.core/rest)
loop:
	[16] = phi(7,  20)
	[17] = phi(13, 21)
	[18] = if(17, cont, end)
cont:
	[19] = invoke(14, 17)
	[20] = invoke(4, 16, 17)
	[21] = invoke(15, 17)
	[22] = jmp(loop)
end:
    [23] = functionRef(clojure.core/persistent!)
	[24] = invoke(23, 20)
	[25] = return(24)

Having performed this rewrite we’ve one. This transform allows an arbitrary loop using a one or more persistent datastructures a accumulators to be rewritten in terms of transients if there exists (or can be inferred) a matching Transient t → Transient t updater equivalent to the updater used. Note that if a non-standard library updater (say a composite updater) is used, then the updater needs to be duplicated and if possible recursively rewritten from a Persistent t → Persistent t by replacing the standard library operations for which there are known doubles with their Transient t counterparts until either the rewrite fails to produce a matching Transient t → Transient t or succeeds. If any such rewrite fails then this entire transform must fail. Also note that this transformation can be applied to subterms… so as long as the Persistent t contract is not violated on the keys and values here of the map in a nontrivial example their computation as well could be rewritten to use compiler persisted transients.

Now yes you could have just written

(into {}
   (map vector
      (range 10000)
	  (range 10000)))

which would have used transients implicitly, but that requires that the programmer manually perform an optimization requiring further knowledge of the language and its implementation details when clearly a relatively simple transformation would not only reveal the potential for this rewrite but would provide it in the general case of arbitrarily many mutable accumulators rather than into’s special case of just one.