Compiler introduction of transients to pure functions
17 Sep 2014In Clojure and pure functinal languages, the abstraction is provided
that values cannot be updated, only new values may be
produced. Naively, this means that every update to a value must
produce a full copy of the original value featuring the desired
change. More sophisticated implementations may opt for structural
sharing, wherein updated versions of some structure share backing
memory with the original or source value on the substructures where no
update is performed. Substructures where there is an update must be
duplicated and updated as in the naive case, but for tree based
datastructures this can reduce the cost of a typical update from
O(N)
on the size of the updated structure to O(log(N))
because the
rest of the structure may be shared and only the “updated” subtree
needs to be duplicated.
This means that tree based structures which maximize the ammount of sharable substructure perform better in a functional context because they minimize the fraction of a datastructure which must be duplicated during any given update.
Unfortunately however, such structural sharing still carries a
concrete cost in terms of memory overhead, garbage collection and
cache and performance when compared to a semantically equivalent
update in place over a mutable datastructure. A mutable update is
typically O(1)
, with specific exceptions for datastructures
requiring amortized analysis to achieve near O(1)
performance.
Ideally, we would be able to write programs such that we preserve the abstraction of immutable values, while enabling the compiler or other runtime to detect when intentional updates in place are occurring and take the opportunity to leverage the performance improvements consequent from mutable data in these cases while ensuring that no compiler introduced mutability can become exposed to a user through the intentional API as built by the programmer.
In such a “pure” language, there is only one class of functions,
functions from Immutable
values to Immutable
values. However if we
wish to minimize the performance overhead of this model four cases
become obvious. λ Immutable → Immutable
functions are clearly
required as they represent the intentional API that a programmer may
write. λ Mutable → Mutable
functions could be introduced as
implementation details within an λ Immutable → Immutable
block, so
long as the API contract that no mutable objects may leak is
preserved.
Consider the Clojure program
(reduce (partial apply assoc)
{}
(map vector
(range 10000)
(range 10000)))
This program will sequentially apply the non-transient association
operation to a value (originally the empty map) until it represents
the identity mapping over the interval [0,9999]. In the naive case,
this would produce 10,000 full single update coppies of the
map. Clojure, thanks to structural sharing, will still produce 10,000
update objects, but as Clojure’s maps are implemented as log₃₂
hash array mapped tries,
meaning that only the array containing the “last” n % 32 key/value
pairs must be duplicated, more the root node. This reduces the cost of
the above operation from T(~10,000²)
to T(10,000*64) ≊ T(640,000)
which is huge for performance. However, a
Sufficiently Smart Compiler
could recognize that the cardinality of the produced map is
max(count(range(10000)), count(range(10000)))
, clearly
being 10000. Consequently an array map of in the worst case 10000
elements is required given ideal hashing, however assuming a load
factor of 2/3 this means our brilliant compiler can preallocate a
hashmap of 15000 entries (presumed T(1)
), and then perform
T(10000)
hash insertions with a very low probability of having to
perform a hash table resize due to accounting for hash distribution
and sizing the allocated table to achieve a set load factor.
Clearly at least in this example the mutable hash table would be an immense performance win because while we splurge a bit on consumed memory due to the hash table load factor (at least compared to my understanding of Clojure’s hash array mapped trie structure) the brilliantly compiled program will perform no allocations which it will not use, will perform no copying, and will generate no garbage compared to the naive structurally shared implementation which will produce at least 9,967 garbage pairs of 32 entry arrays.
The map cardinality hack is it’s own piece of work and may or may not be compatible with the JVM due to the fact that most structures are not parametric on initial size and instead perform the traditional 2*n resizing at least abstractly. However, our brilliant compiler can deduce that the empty map which we are about to abuse can be used as a transient and made static when it escapes the scope of the above expression.
Consider the single static assignment form for the above (assuming a reduce definition which macroexpands into a loop) (which Clojure doesn’t do).
[1 ] = functionRef(clojure.core/partial)
[2 ] = functionRef(clojure.core/apply)
[3 ] = functionRef(clojure.core/assoc)
[4 ] = invoke(2, 3, 4) ;; (partial apply assoc)
[5 ] = {}
[6 ] = functionRef(clojure.core/map)
[7 ] = functionRef(clojure.core/vector)
[8 ] = functionRef(clojure.core/range)
[9 ] = 10000
[10] = invoke(8, 9) ;; (range 10000)
[11] = invoke(6, 7, 10, 10) ;; (map vector [10] [10])
[12] = functionRef(clojure.core/first)
[13] = functionRef(clojure.core/rest)
loop:
[14] = phi(5, 18)
[15] = phi(11, 19)
[16] = if(13, cont, end)
cont:
[17] = invoke(12, 14)
[18] = invoke(4, 14, 15)
[19] = invoke(13, 15)
[20] = jmp(loop)
end:
[21] = return(18)
Where the phi function represents that the value of the phi node depends on the source of the flow of control. Here I use the first argument to the phi functions to mean that control “fell through” from the preceeding block, and the second argument to mean that control was returned to this block via instruction 20.
This representation reveals the dataflow dependence between sequential
values of our victim map. We also have the contract that the return,
above labeled 21, must be of an Immutable
value. Consequently we can
use a trivial dataflow analysis to “push” the Immutable
annotation
back up the flow graph, giving us that 18, 14 and 5 must be immutable,
5 is trivially immutable, 18 depends on 14, which depends on 18 and 5,
implying that it must be immutable as well. So far so good.
We can now recognize that we have a phi(Immutable, Immutable)
on a
loop back edge, meaning that we are performing an update of some sort
within the loop body. This means that, so long as Transient
value is
introduced into the Immutable
result, we can safely rewrite the
Immutable
result to be a Transient
, and add a persistent!
invocation before the return operation. Now we have phi(Immutable,
Transient) → Transient
which makes no sense, so we add a loop header
entry to make the initial empty map Transient
giving us
phi(Transient, Transient) → Transient
which is exactly what we
want. Now we can rewrite the loop update body to use
assoc! → Transient Map → Immutable Object → Immutable Object → Transient Map
rather than
assoc → Immutable Map → Immutable Object → Immutable Object → Immutable Map
.
Note that I have simplified the signature of assoc to the single
key/value case for this example, and that the key and value must both
be immutable. This is required as the persistent!
function will
render only the target object itself and not its references
persistent.
This gives us the final operation sequence
[1 ] = functionRef(clojure.core/partial)
[2 ] = functionRef(clojure.core/apply)
[3 ] = functionRef(clojure.core/assoc!)
[4 ] = invoke(2, 3, 4) ;; (partial apply assoc)
[5 ] = functionRef(clojure.core/transient!)
[6 ] = {}
[7 ] = invoke(5, 6)
[8 ] = functionRef(clojure.core/map)
[9 ] = functionRef(clojure.core/vector)
[10] = functionRef(clojure.core/range)
[11] = 10000
[12] = invoke(10, 11) ;; (range 10000)
[13] = invoke(8, 9, 12, 12) ;; (map vector [12] [12])
[14] = functionRef(clojure.core/first)
[15] = functionRef(clojure.core/rest)
loop:
[16] = phi(7, 20)
[17] = phi(13, 21)
[18] = if(17, cont, end)
cont:
[19] = invoke(14, 17)
[20] = invoke(4, 16, 17)
[21] = invoke(15, 17)
[22] = jmp(loop)
end:
[23] = functionRef(clojure.core/persistent!)
[24] = invoke(23, 20)
[25] = return(24)
Having performed this rewrite we’ve one. This transform allows an
arbitrary loop using a one or more persistent datastructures a
accumulators to be rewritten in terms of transients if there exists
(or can be inferred) a matching Transient t → Transient t
updater
equivalent to the updater used. Note that if a non-standard library
updater (say a composite updater) is used, then the updater needs to
be duplicated and if possible recursively rewritten from a Persistent
t → Persistent t
by replacing the standard library operations for
which there are known doubles with their Transient t
counterparts
until either the rewrite fails to produce a matching Transient t →
Transient t
or succeeds. If any such rewrite fails then this entire
transform must fail. Also note that this transformation can be applied
to subterms… so as long as the Persistent t
contract is not
violated on the keys and values here of the map in a nontrivial
example their computation as well could be rewritten to use compiler
persisted transients.
Now yes you could have just written
(into {}
(map vector
(range 10000)
(range 10000)))
which would have used transients implicitly, but that requires that the programmer manually perform an optimization requiring further knowledge of the language and its implementation details when clearly a relatively simple transformation would not only reveal the potential for this rewrite but would provide it in the general case of arbitrarily many mutable accumulators rather than into’s special case of just one.
^d