Clojure Cup

30 Sep 2013

Incase you somehow missed the news, Clojurecup is an anual community wide hackathon for Clojure developers which occured for the first time this year. Sponsored by Deveo, AuthenticJobs, BrowserStack and others the Clojure Cup is a 48 hour sprint to the finish to build a webapp targeted to the “average internet user”.

I got involved with a Cup team through Austin’s Clojure Meetup, which I attended for the first time earlier this month. After discussing some ideas at the meeting, the other ATX clojurists and I had an extended IRC discussion which finally settled on two ideas: Cloutjure and What The Fn.

What The Fn

Norman Richards had the idea for a mathematical guessing game not unlike 20 questions where one user writes a function and by examining its output the other players must attempt to deduce the source for the function. As I wasn’t on that team I really can’t expound on their architecture, but it’s worth checking out over at its Clojurecup site.

Cloutjure

Cloutjure was originally my idea to clone the github report card site, with an emphasis on the Clojure community. The original idea was simply to be a vanity page, using IRC logs and Github profile data to rack and stack Clojurists but as James Gatannah, Gary Deer, Sam Griffith and I brainstormed about it the more value we saw in being able to do other things with such a community ranking metric. Most prominent among which was the idea to add a reputation based ~votekick command to the Clojurebot #Clojure bot due to the presence of a troll going by the name chord may he burn.

The Architecture

I spent the first evening of the Cup building an IRC log parser which used a combination of clj-tagsoup, clj-http and mongodb to parse the IRC logs stored over at http://Clojure-log.n01se.net/. The ultimate result of which was a monster MongoDB table messages of 1,690,290 message records constituting every IRC message sent to fn/#Clojure as logged by n01se. From this collection, I subsequently constructed three more tables: links, people and link-refs.

messages were structured as records

message: {:author  "a name"
          :date    {:years :months :days
                    :hours :minutes :seconds}
           :message "hello, world!
          :hash    (sha256sum :message)
         }

Which allowed for later analysis, including simple counting, text parsing for URL extraction (something I actually did get to play with), “thank you” counting and all kinds of other interesting hat tricks.

The only one of these which I did manage to do was URL extraction. I wrote a script which forced the creation of 512 worker threads, all splitting text by whitespace into “words”, and then running every word through the Apache commons URLValidator class to see what happened. The concept was that valid links would be stored in one table creatively named links which would be kept deduplicated by literal URL string, and that whenever someone mentioned a link, an entry would be created in the link-refs table. The idea was that by storing all the links seperately from “edges” from people to links I would be able to do more fine grained analysis of what links were “hot” and soforth. The ultimate Mongo structure was:

links:  {:url      "some url"
         :site     hostname(url)
         :path     path(url)
        }

link-ref:
       {:author   (:_id author)
        :date     (:date message)
        :url      (:_id link)
       }

Unfortunately the first pass through the logs I made didn’t go so well. The URLValidator was apparently quite hard to satisfy, and only found two legitimate links in the entire logfile set. This clearly being an erronious result I took the condition that some TLD must be a substring of a URL as my test condition. This pass worked better, finding 51,784 unique links in the entire dataset however there is a significant ammount of polution in this number. ‘.cl’ is a valid TLD, which means that any mention of a .clj or .cljs filename got picked up as a link. Also a whole lot of classnames got sucked in as links.

The Unfinished Work

The next thing on the todo list was to de-alias the message set which I had sucked in by creating a mapping from IRC names to “people”. The concept being that as ultimately GitHub accounts and IRC handles map to people we want to be able to explicitly name them and enumerate their community contibutions.

Also on tap was using the GitHub API to enumerate every project in the Clojure language, and then parsing (okay fine blindly evaling) each project.clj file in order to create a weighted project interdependance graph.

The most important unfinished things were the website and the actual scoring query. If you go look at the cloutjure page you’ll see an empty list of 100 entries. This list was supposed to be populated with a naive ordering of the top 100 people in the community based on some entirely unscientific scoring function but didn’t even manage to get that wired together so despite having all the data our webpage sits sad and blank.

The Postmortem

A significant amount of the blame for this project’s failure rests on my shoulders. I had the only “ready to roll” dataset going into the Clojure Cup, and I essentially dictated the use of MongoDB when Neo4j’s REST API driver proved seemingly too slow to handle the volume of data I was trying to pump in and out for the IRC logs. Due to having tests and other school committments when it looked like we weren’t going to finish reasonably I punched out and left the rest of the team hanging so I could deal with my academics. Of such is life.

Were I to do the Cup again, the first thing I’d want to make sure of is that my teamates and I share not only a vision and an elevator pitch, but also that to the greatest extend possible the APIs which we plan on using and the toolset involved are common and understood by at least two people per component. It’s not that we lost a whole lot of time in UI prototyping, or that we wasted a bunch of time trying to figure out different database structures just that the same time could have been spent the weekend or two before, rather than during the first few hours of our Saturday morning efforts.

Sadly Cloutjure doesn’t work yet, but seeing as I have SC13 and Hack Texas 13 comming up I’m sure it’s only a matter of time until this data sees the light of day. That said if someone else wants to play with the dataset I’m more than happy to make it available within reason.