Hacking like it's 2288

I'm an a diehard Fallout franchise fan. While school currently precludes me from playing, I took some time to automate solving the hacking minigame which is used to unlock computer terminals and secured areas in-game.

When "hacking", a player is presented with a list of words one of which is the "password" and has five (or fewer) tries to try and find the password. Every time the user selects a candidate, a response of how many characters match between the answer and the candidate is given so that the user can refine their guess.

The strategy for playing this game is uninteresting at best, you choose an initial word either heuristically or randomly, and then based on the number of characters which that word shares with the answer you can refine the whole list of words by removing all words which do not share exactly that many characters with the word you tried. This simple guess and refine algorithm one need only iterate until only one answer remains.

While an obvious algorithm, it's a little tedious to do this by hand because there's no good way to keep track of all the candidates which have been removed from consideration. Also computing string overlaps is boring. Given an algorithmic problem, write an algorithmic solition! What do we need to solve this problem in Clojure?

Well first we need our word "intersection" scoring function. To do this, we'll take the frequencies of the left word, the frequencies of the right word. Since each character can only correspond to another single character, this means that given N as on the laft and M bs on the right, we have (min N M) common characters. We can simply compute this common character count for every character in the left string (characters occuring only in the right string won't factor in anyway) and take the sum over those counts.

(defn letters-in-common [left right]
  (let [fl (frequencies left)
        fr (frequencies right)]
    (->> (for [[k vl] fl
               :let   [vr (get fr k 0)]]
           (min vl vr))
         (apply +))))

(letters-in-common "foo" "foooo")
;; => 3
(letters-in-common "cat" "dog")
;; => 0
(letters-in-common "cat" "rat")
;; => 2

So now we need a guess making function which we can write in terms of our scoring function. What word do we want to choose as our guess? Well we want to guess the word which will give us the most information about all the other words, that is to say is most similar to as many other words as possible.

Once we've made a guess we can use the word similarity score we get back to refine our dictionary and contrain our search space.

This must be the optimal choice by the usual greedy choice argument, since any other word we could guess would tell us less about the other words, and in the case of a tie (two words with equal similarity to all other words) we can't know without guessing which one is the better choice so we can choose randomly.

(defn make-guess
  "Given a population of words, locates and returns a word with maximum
  \"matching potential\", that is the word which shares the most
  characters with every other word.

  If there is a tie, since the two words have equal potentials it
  doesn't matter which one we choose and so the choice is arbitrary.

  Makes use of scoped memoization for performance on lots of words,
  but realistically this should never be a factor."
  (let [-score-fn (memoize
                   (fn [col]
                     (let [[l r] (seq col)]
                       (letters-in-common l r))))
        score-fn  (fn [word]
                    (let [s (sorted-set word)]
                      (->> (for [w words
                                 :when (not= w word)]
                             (-score-fn (conj s w)))
                           (apply +))))]
    (last (sort-by score-fn words))))

Now we need our dictionary pruning function. If a word does not have exactly n letters in common with the last guess where n is the reported similarity of the last guess with the answer, that word cannot be a solution.

We know that for multiple guesses, no word could be an answer which does not have the reported similarity with all previous guesses. Thus we don't accidentially exclude any possible answers by simply filtering the dictionary as such after each guess.

(defn trim-pop [words guess commonality]
  (for [w     words
        :when (= commonality (letters-in-common w guess))]

(trim-pop ["foo" "bar" "cat" "rat"] "cat" 2)
;; => ("rat")

So lets put these two to use! We'll wire up make-guess to letters-in-common which is our scoring function anyway and just let it run on a small dictionary and a goal word. Just to make the point that this algorithm does converge.

;; An example hacking loop. The guess and refine algorithm with a
;; sample input.
      answer "LOWER"]
  (loop [words words]
    (when (= (count words) 1)
      (first words)
      (let [guess  (make-guess words)
            score  (letters-in-common guess answer)
            words' (trim-pop words guess score)]
        (println "Guessed" guess "reduced population to" words')
        (recur words')))))

Yay! So that totally works.

And because I don't think anyone wants to manually run all that state through a REPL while trying to play Fallout 4, here's a quick REPL wrapper to help with hacking.

(defn autohack [words]
  (if-not (= 1 (count words))
    (let [guess (make-guess words)
          _     (println "Try>" guess)
          score (read)]
      (recur (trim-pop words guess score)))
    (println "It's" (first words))))

Happy wandering!


Shitposting Vol 1

Subtitle: That Time @Snowden Shot Back.

I win.

Formalism In Software 'Engineering'

This post was originally authored for the UTCS Facebook group, in the context of discussing the role of formalism in software "engineering" as regards this article from BuisnessInsider arguing that "The title “engineer” is cheapened by the tech industry."

Another article in the same vein.

The argument at hand is that "Software Engineering" isn't "real engineering" because we play it fast, loose and don't use formalisms to the extent that other fields do. So lets consider formalisms in the context of computer science.

They're pretty great. We can prove that functions are with respect to the proof criteria done and correct for all time. Sorting algorithms, Boyer-Moore, data structures and soforth do well defined things correctly and fall in this category.

On the other hand, most of the work we've done in the direction of formalisms is to be polite ultimately useless. We can't do general proof, forget automate general proof (by application of the halting problem). Even without the halting problem we have logic incompleteness. So really I think that saying we have formalisms in general software engineering is pretty silly. The classical "formal" tools we have suffer from immense limitations and while some heros may succeed with them anyway, they are far from the realms of mortals.

So what of the claim that we do "software engineering"? If you dig very far into software engineering literature, you'll find that it's kinda a sham. I'll direct you here to Parnas among other authors such as Brooks who was previously mentioned to support this point. The fundamental problem in software engineering isn't designing or writing software. That's easy, we teach undergrads and nontechnical people to program all the time. The real problem we face is coping with the long term consequences of change in a software system.

Users don't really understand what it is that they want most of the time. If you look at the development of large software systems, you'll find a huge amount of learning occurs as you build something. As you write a program, you are forced to explore the problem domain thereof. In doing so, you will unavoidably learn more about the problem and discover things you should have designed differently. Worse, users seeing intermediary results will get a better understanding for what it is that they want and start asking for it. So we are fated to loose from both sides. The only way to win is to reduce your iteration time both in terms of cleaning up your mess and getting something in front of your users faster. This is not somehow a problem unique to software development, it is shared by all engineering. The real draw of modern automated manufacturing is just this, that design improvements can be rolled out faster.

What is unique to software is that when we "deliver" I think it's silly to say that whatever we've built is "done". When we say that we're "done" with a ship and the last rivet is fitted, the cost of changing that ship is enormous. Similarly when the last brick of a bridge is laid, the cost to tear it up and start over again is prohibitive. In contrast, we work with programmable machines which practically speaking never decay and whose behavior is anything but fixed at the time of "manufacture".

Because of the huge change costs, vast amounts of time in traditional engineering is spent gathering and evaluating design criteria because it must be. The price of getting it wrong is simply too high. Unless you're a Defense contractor you can't afford to deliver the wrong thing late and over budget (ayyyy). Instead you have to figure out what the right thing is and how to build it right the first time on time and under budget if possible. Case in point, during the development of the digital telephone, Bell Labs spent a huge amount of money doing user studies. They examined different dialing keypad layouts and discovered the layout we use today. They tested human hearing to understand how much (read how little) information was required to support acceptable calls. And on and on, mostly done ahead of time due to the huge investments involved.

We have the opposite problem. It costs us next to nothing to change our design as our design criteria evolve. This means that we can afford to play fast and loose. We can always patch it later. Look at Tesla who's famously OTA updating their fleet of cars to of all things add features that weren't present when their customers purchased the cars. It's literally more time (read cost) efficient for us to bang out a release and get it in front of users to try and refine our conception of what they want than it is for us to gather requirements and do an in depth design.

An interesting case study is Chris Granger's programming language Eve which was "designed" (maybe discovered is the better word) by both doing a ton of research on prior art and user study after user study to try and understand how users were learning and making use of Eve. Critically, changes were made not just to the UI but to the semantics of the language as exposed to users to try and smooth out the pain points while preserving utility. This stands in stark contrast to langauges like Pascal which came more or less fully formed into the world and have changed little in response to requirements or user studies.

Similarly modern "continuous deployment" and "continuous testing" infrastructure is laudable because it helps programmers get software out the door faster. This tighter feedback loop between users and programmers helps us refine the "right thing" and deliver that thing. Faster is better. Sound familiar?

Move fast and break things

This is where we need to come back to formalism. Because this really isn't working out for facebook and it won't work out for you. While I will yet write an article on breaking changes and the necessity of really breaking software, the majority of breaking changes in the above sense are bugs and are accidental rather than essential. We are, as the other engineers accuse us, really bad at doing software development. We need a way to cope with change and ensure stability at the same time. If we choose our abstractions well through experience and art, we can get some of this. But there are very few people with the art to make design decisions which will mitigate real unknown future change. What we really need are tools which help us control and understand the impacts of our changes.

One of these tools is obviously testing and code coverage. We can understand the impact of our changes or our failure to change in terms of what tests we've broken or failed to fix. We can also understand changes in terms of its impact on the coverage of our programs. If we add a bunch of new code that isn't tested, then our tests explore less of the behavior of our programs and are less valuable and the indicator of test coverage can tell us this. If we delete a bunch of code that wasn't tested or wasn't used, then we've reduced the state space of our program (which is a great thing!) and our coverage should go up to reflect this.

These are all really powerful tools, but they are more the practices and methodologies which have gotten us this bad name of not being real engineering. What of proof as a tool?

Our medium being inherently different and the cost mechanics also being different I think that any bemoaning of software correctness going out the window is plainly silly. Of course software won't be "correct". Much software is too complex for any meaningful "correctness" criterium to be established. It's far more efficient to build something that sorta works and iterate than to obsess over getting it right. Proof, small software shops that can iterate quickly are eating the big slow movers lunches if you care to look at the last few decades of our industry's history.

This suggests that formalism needs to take a different place in software engineering. It needs to exist not as an engine for absolute correctness, but as a tool to be used to help enable iteration. Type directed programming and error prediction in Haskell is I think a great and under explored use of this. If you want to make a change, make that change and run the typechecker (proof engine with regards to whatever properties you encoded in the type system) and all the errors you get are the places where your existing program needs to be changed in order to accomodate the desired change. This is not general correctness, this is not halting, this is just a tool for letting you as an engineer track the criteria which make what you're building good enough and whether they are preserved.

This conception of lightweight formalism(s) is I think the future of the industry. Real theorem proving has its place in hardware design and in other corners of the industry where the cost benefit tradeoffs are again different, but it's simply too effort intensive to be practicable for the "average" programmer working on yet another Uber for X.

So. Parnas. Fake it. Ship it. Think about it. Fake it again and again. Use formalisms to help you fake it. But formalism for any classical value of formalism is likely nonsense unless you have good reason otherwise.

What are we doing here by all this pushing around of constraints?

I think this is engineering.


Transactional Versioning

Software versioning and managing software dependencies is hard. Unfortunately in this day and age it is also an unavoidable problem. Rarely do we build software in isolation from the rest of the world. Instead we build software which depends upon other software. We make use of libraries, we make assumptions about platforms and runtimes. In order to develop, build and deploy software in a consistent and repeatable manner these dependencies are things which we must manage. Otherwise we fall into the hells of unrepeatable builds and inconsistent deployments where the behavior of software varies from machine to machine and debugging is hard.

Traditionally we have tracked dependencies by assigning names or version numbers to a particular program or library. The versioned piece of software is called an artifact, which term interchanges with the (name, version) pair that identifies it.

By writing out all the (name, version) tuples which we used when developing some software, we can tell users what it is that we expect them to have and imply how to provision their systems. And that works great. We provide software and a dependency list, the dependencies themselves provide dependency lists and soforth. A tool such as Maven or any other package manager then can pull down one dependency, resolve its dependencies and recursively pull down dependencies until all deps are satisfied and the system is ready to go.

Unfortunately this model has a clear failing: what happens when you have a conflict between dependencies? That is, what happens when dependency a requires (c, 3) that is artifact c at version 3, and dependency b requires (c, 4)? This is called a version conflict. Most languages are not able to accomodate having several versions of a single named resource loaded at once and so a choice must be made. Who wins?

In a perfect world, we want to take the pair (c, 3) and the pair (c, 4) and discover that the 4 version is "fully featured" with respect to the 3 version which lets us just use the 4 version. But how to do this? Version "numbers" come in a huge number of formats, so there isn't an obvious way to compare them, and many formats encode different information. Some artifacts use a "build number", being literally the count of builds done to date. Other artifacts use as a "build number" the git hash from which the artifact was built. These two just tell the developers what was built and how to rebuild it. The Linux Kernel and other projects use "numbers" with several decimal places where each one is a counter with the leftmost being the most significant and the each one somehow mapping to the feature set and / or patch number of the program.

No build tool knows about all of these schemes, or can given two version numbers generally decide if one can be exchanged for the other. Although perhaps one could build an extensible build system where artifacts can contain programs to compare themselves.

The Semantic Versioning scheme, better known as semver, defines a seemingly simple version scheme which clearly defines the relationship between artifact versions. Semver versions are tripples (major, minor, patch). Patch changes aren't supposed to add anything, although they may, but they cannot remove any user visible part of the artifact. This means that a build (major, minor, patch) should interchange with any other build (major, minor, patch'), although a higher patch number is expected to imply improvements and bugfixes. Minor versions are intended to denote feature checkpoints and user visible additions. So the 1.2.0 is supposed to superset the 1.1.0 release in terms of features visible to users, and this version sorts as greater than any patch number on an earlier minor version. Finally the major version is used to denote significant changes to the public API offered by an artifact. Deletions of user visible symbols can only occur here.

These properties ensure that, if respected by the developers, a user may safely and blindly upgrade through arbitrarily many patches and minor versions since nothing on which they rely may be removed. Upgrading across major versions however may require arbitrarily much work if the API changed or some depended on symbol was removed. This is perfect for resolving version conflicts. Given two versions in conflict, the dependency resolution system simply chooses the higher version one so long as it is in the same major version family. If the two are not in the same major version family, then we cannot automatically resolve the dependencies and the user or developer must take action.

However we want to minimize the set of circumstances under which the users must call upon the developers and under which the developers must take action. Consider a major release which adds a whole new API but removes nothing. This is a legal version under the semver scheme. Unfortunately as a major version bump denotes that deletions may have occurred, our dependency resolution engine can do nothing even though the new version commutes with the old one.

We can deal with this particular case by modifying the semver scheme, consider the following "reidver" specification.

  • a+1.b.c denotes that a deletion or a rename has occurred and the API has changed
  • a.b+1.c denotes that an addition has occurred, the API has been supersetted
  • a.b.c+1 denotes that a patch has occurred, the API has not changed

Note that a "rename" encompasses both the meaning of a symbol changed and the literal name of the symbol changed (a deletion and an addition). But even this has its failures, since if a symbol which is not used in any dependency is deleted, that is still a breaking change which we cannot automatically upgrade across.

Superior as it is, even something like this is ultimately futile. We as programmers are awful about tracking what constitutes a breaking change compared to what constitutes an invisible bugfix. "releases" are often large, often incorperating multiple changes. Unless a strict discipline of single changes and small commits is enforced reviewing the change history may be difficult. Forgetting that you refactored a single symbol or changed a signature is an easy mistake to make, and one which is absolutely fatal to the semantics of something like semver or reidver.

If we back up and consider this from a set theoretic standpoint, really we can freely upgrade across versions so long as the reached subset of the used artifact(s) hasn't changed. That is, for every artifact upon which we depend we can consider the subset of the public API defined in that artifact which are reached by any other artifact. So long as none of these sets experience negative changes between versions, we can feely upgrade. If a deletion occurs, then in the general case there is nothing we can do and the programmer(s) must take action. Really what we want is not even this "reidver" thing, but rather some set theoretic or transactional view of the contents of an artifact from which we can make exactly this decision about whether two artifacts do commute.

Language support for something like this is essential, as programmers simply canot be trusted to maintain the requisite semantics. The version system can never be allowed to lie. Otherwise all hell breaks loose and we're back in our current tarpit with with dependencies which cannot be updated or worse nonrepeatable builds.

Given that, this could actually work. It solves all the above issues with resolving replacements, and also makes forking easier since it exposes to the dependency resolution engine that one artifact is derived from another. The version comparision system, representing what is classically the whole change history of the artifact clearly encodes that the fork is a superset of the base version. It just requires a very different view of verisoning than that currently in practice and language participation in the versioning scheme. Something like a Datomic code loading server & VM.

Disclaimer: This is not a novel idea. I rediscovered this with some hints from Chas Emeric who got there first. I just wrote the essay.


Assist threads

It's been a while since I've written about hardware proper rather than just lisp machine weirdness, but I'm doing research reading in computer architecture this semester so that's gotta change.

Today's subject is the caches and assist threads. We are at a crossing point in the history of computer architecture design. Transistors are still getting smaller, but the end of the line is in sight and it has been clear for some time that the next big performance wins will come from parallelism. We can build chips with dozens, hundreds of cores even, the problem is how can we make use of all of those cores?

Two answers to this question which architects have messed with over the years are the "split architecture", and the modern derivative thereof "assist threads".

The Windup

As I previously wrote, caches have been a must in computer architectures for a long long time. DRAM has always been slow and compared to CPUs has been slow to improve as well leading to a huge latency mismatch between cores and memory. However analysis of program behavior shows that there is a huge amount of data locality in the way that we tend to write programs. That is we rarely actually work with several gigabytes of data "at the same time" (in a few thousand clock cycles). Instead there are small hot regions of memory such as program segments and data segments which are accessed over and over again.

Intel and other chip manufacturers have long exploited this behavior with two and three level nested cache systems which succeed in reducing memory read times by 100x or more (1), (2).

However caches much like memories are not infinite no matter how harmful we may consider this and in fact derive much of their decreased access time from comparatively small size. (memory read times increase with the layout edge lengths, approximately twice the square root of their size) And this introduces the problem of what do we store in the cache so that we maximize the cache utilization?

Replacement Policies

There is an entire field of cache replacement policies, which attempt to predict which entries in a cache will not be accessed again. An ideal cache with perfect knowledge of the future access pattern will never replace an item which will be used again so soon that its eviction and replacement will cost more time than simply leaving it in the cache. I'll say more about this another time, but this is something modern hardware has gotten very good at.

Data Prefetching

As previously explained in other works, caches are great when you can read from them, but can't save you when the data you want isn't cached. Because missing in the cache can cost hundreds of cycles, it'd be great if we could use some hardware or software mechanism(s) to predict what data will be needed in the cache and dedicate some fraction of the cache to storing speculated values. This is called prefetching (3) turns out to work quite well.

The obvious algorithm, stream prefetching, tries to recognize sequential scans as generated by memcpy, strcmp etc. and prefetch cache lines before you read them. While scanning a region this can eliminate the dreaded latency of reading main memory and in the worst case fetches too much data creating some tiny amount of cache pollution at the end of the scan.

Stride prefetching tries to recognize a similar streaming pattern akin to walking an array of records accessing some single field in each record. That is memory reads where the nth read is some linear function addr(n)=base+n*offset for offset greater than the size of a cache line.

The fundamental limitation with these and other memory access prediction technologies however is that they deal very poorly with indirection. The classical example of this is that in object oriented programs many layers of object indirection traversal may be required before the CPU can actually begin performing computation. Furthermore as objects may in the worst case occur anywhere in memory dependent on the behavior of the memory allocation engine and/or garbage collector as in general there exists no stride or stream which you could predict. There exist techniques (4) among them for predicting such accesses, but they have not seen wide adoption.

Split Architectures

In the '90s, computer architects proposed what was called a "split architecture", a design where each CPU features an additional programmable cache control processor. The concept was that for any program P you could write a program Q such that Q "knows" the memory access pattern(s) of P and can provide perfect cache management for P.

This is in theory superior to any hardware based prediction mechanism, because the programmer or the compiler can predict arbitrary memory access patterns no matter how irregular or seemingly unstructured.

Lots of really smart researchers messed with this idea, and nobody managed to get it to really work in part because predicting the memory behavior of a program is a lot like predicting its branching behavior. It can't really be done without a total understanding of the inputs which the program will receive. Even if it can be done, as with software branch prediction especially on general purpose programs hardware mechanisms will quickly (in a few hundred cycles or less) learn all the behavior which software techniques are able to predict ahead of time and thereafter are able to predict. So split architectures have been largely a failure.

The Pitch

Which brings us at long last to the idea of an assist thread. Writing the cache control program called for in a split architecture is really hard in the best case. I would argue that it's generally intractable, and certainly something most programmers shouldn't mess with. But we already have the program we want to run and we have more than one CPU core with nothing better to do. So what if we ran two or more instances of our program, with one actually doing the work and the others speculatively executing loads? We could see what the other instances are loading and decide how to populate the cache for the "real" instance based on what the others load.

This is way easier for both the compiler and the programmer than the split architecture, because we just punt on the whole cache management problem to the hardware cache management mechanisms we've developed (which are quite good). And it does a really good job of predicting memory accesses! What other computation than the program which we wish to run will exhibit or predict the same memory access pattern?

Now is this a win or not compared to having some general predictive hardware mechanism? I honestly don't know. Engineering is about taking constant factors and pushing them around to locally optimize for what we can do here and now. Here and now both from a manufacturing and power consumption standpoint I'm hard pressed to believe that it makes sense to take a pair of beefy out of order processors and throw one of them away essentially to accelerate the other.

However if we imagine some machine with hundreds of inexpensive cores, I think that this technique begins to make a lot of sense. Given the choice between two middling CPUs which could team up to approximate a better CPU and a better CPU which invests a bunch of chip area in predictive hardware structures clearly the two CPUs is a more generally useful machine configuration because it can run both the set of tasks which the better core would be useful for and a potentially interesting set of much smaller tasks on which the large core would be wasted.

This theme of many, many cheap cores and their relative worth to the large cores which we have today is something which will come up over and over again because engineering is in the constants and the constants are making the future of big cores less and less viable.