Clojure Cup
30 Sep 2013Incase you somehow missed the news, Clojurecup is an anual community wide hackathon for Clojure developers which occured for the first time this year. Sponsored by Deveo, AuthenticJobs, BrowserStack and others the Clojure Cup is a 48 hour sprint to the finish to build a webapp targeted to the “average internet user”.
I got involved with a Cup team through Austin’s Clojure Meetup, which I attended for the first time earlier this month. After discussing some ideas at the meeting, the other ATX clojurists and I had an extended IRC discussion which finally settled on two ideas: Cloutjure and What The Fn.
What The Fn
Norman Richards had the idea for a mathematical guessing game not unlike 20 questions where one user writes a function and by examining its output the other players must attempt to deduce the source for the function. As I wasn’t on that team I really can’t expound on their architecture, but it’s worth checking out over at its Clojurecup site.
Cloutjure
Cloutjure was originally my idea to clone
the github report card site, with an emphasis
on the Clojure community. The original idea was simply to be a vanity
page, using IRC logs and Github profile data to rack and stack
Clojurists but as James Gatannah,
Gary Deer, Sam Griffith and I
brainstormed about it the more value we saw in being able to do other
things with such a community ranking metric. Most prominent among
which was the idea to add a reputation based ~votekick
command to
the Clojurebot #Clojure bot
due to the presence of a troll going by the name chord
may he burn.
The Architecture
I spent the first evening of the Cup building an IRC log parser which
used a combination of
clj-tagsoup,
clj-http and
mongodb to parse the IRC logs stored over at
http://Clojure-log.n01se.net/. The
ultimate result of which was a monster MongoDB table messages
of
1,690,290 message records constituting every IRC message sent to
fn/#Clojure as logged by n01se. From this collection, I subsequently
constructed three more tables: links
, people
and link-refs
.
messages
were structured as records
message: {:author "a name"
:date {:years :months :days
:hours :minutes :seconds}
:message "hello, world!
:hash (sha256sum :message)
}
Which allowed for later analysis, including simple counting, text parsing for URL extraction (something I actually did get to play with), “thank you” counting and all kinds of other interesting hat tricks.
The only one of these which I did manage to do was URL extraction. I
wrote a script which forced the creation of 512 worker threads, all
splitting text by whitespace into “words”, and then running every word
through the Apache commons URLValidator class to see what
happened. The concept was that valid links would be stored in one
table creatively named links
which would be kept deduplicated by
literal URL string, and that whenever someone mentioned a link, an
entry would be created in the link-refs
table. The idea was that by
storing all the links seperately from “edges” from people to links I
would be able to do more fine grained analysis of what links were
“hot” and soforth. The ultimate Mongo structure was:
links: {:url "some url"
:site hostname(url)
:path path(url)
}
link-ref:
{:author (:_id author)
:date (:date message)
:url (:_id link)
}
Unfortunately the first pass through the logs I made didn’t go so well. The URLValidator was apparently quite hard to satisfy, and only found two legitimate links in the entire logfile set. This clearly being an erronious result I took the condition that some TLD must be a substring of a URL as my test condition. This pass worked better, finding 51,784 unique links in the entire dataset however there is a significant ammount of polution in this number. ‘.cl’ is a valid TLD, which means that any mention of a .clj or .cljs filename got picked up as a link. Also a whole lot of classnames got sucked in as links.
The Unfinished Work
The next thing on the todo list was to de-alias the message set which I had sucked in by creating a mapping from IRC names to “people”. The concept being that as ultimately GitHub accounts and IRC handles map to people we want to be able to explicitly name them and enumerate their community contibutions.
Also on tap was using the GitHub API to enumerate every project in the
Clojure language, and then parsing (okay fine blindly evaling) each
project.clj
file in order to create a weighted project
interdependance graph.
The most important unfinished things were the website and the actual scoring query. If you go look at the cloutjure page you’ll see an empty list of 100 entries. This list was supposed to be populated with a naive ordering of the top 100 people in the community based on some entirely unscientific scoring function but didn’t even manage to get that wired together so despite having all the data our webpage sits sad and blank.
The Postmortem
A significant amount of the blame for this project’s failure rests on my shoulders. I had the only “ready to roll” dataset going into the Clojure Cup, and I essentially dictated the use of MongoDB when Neo4j’s REST API driver proved seemingly too slow to handle the volume of data I was trying to pump in and out for the IRC logs. Due to having tests and other school committments when it looked like we weren’t going to finish reasonably I punched out and left the rest of the team hanging so I could deal with my academics. Of such is life.
Were I to do the Cup again, the first thing I’d want to make sure of is that my teamates and I share not only a vision and an elevator pitch, but also that to the greatest extend possible the APIs which we plan on using and the toolset involved are common and understood by at least two people per component. It’s not that we lost a whole lot of time in UI prototyping, or that we wasted a bunch of time trying to figure out different database structures just that the same time could have been spent the weekend or two before, rather than during the first few hours of our Saturday morning efforts.
Sadly Cloutjure doesn’t work yet, but seeing as I have SC13 and Hack Texas 13 comming up I’m sure it’s only a matter of time until this data sees the light of day. That said if someone else wants to play with the dataset I’m more than happy to make it available within reason.
^d