Shelving; Building a Datalog for fun! and profit?
17 May 2018This post is my speaker notes from my May 3 SF Clojure talk (video) on building Shelving, a toy Datalog implementation
Databases?
- 1960s; the first data stores. These early systems were very tightly coupled to the physical representation of data on tape. Difficult to use, develop, query and evolve.
- CODASYL, a set of COBOL patterns for building data stores; essentially using doubly linked lists
- IBM’s IMS which had a notion of hierarchical (nested) records and transactions
- 1969; E. F. Codd presents the “relational data model”
Relational data
Consider a bunch of data
[{:type ::developer
:name {:given "Reid"
:family "McKenzie"
:middle "douglas"
:signature "Reid Douglas McKenzie"}
:nicknames ["arrdem" "wayde"]}
{:type ::developer
:name {:given "Edsger"
:family "Djikstra"}
:nicknames ["ewd"]}]
In a traditional (pre-relational) data model, you could imagine laying out a C-style struct in memory, where the name
structure is mashed into the developer
structure at known byte offsets from the start of the record.
Or perhaps the developer
structure references a name
by its tape offset and has a length tagged array of nicknames trailing behind it.
The core insight of the relational data model is that we can define “joins” between data structures. But we need to take a couple steps here first.
Remember that maps are sequences of keys and values. So to take one of the examples above,
{:type ::developer
:name {:given "Edsger"
:family "Djikstra"}
:nicknames ["ewd"]}
;; <=> under maps are k/v sequences
[[:type ::developer]
[:name [[:given "Edsger"]
[:family "Djikstra"]]]
[:nicknames ["ewd"]]]
;; <=> under kv -> relational tuple decomp.
[[_0 :type ::developer]
[_0 :name _1]
[_0 :nickname "ewd"]
[_1 :given "Edsger"]
[_1 :family "Djikstra"]]
We can also project maps to tagged tuples and back if we have some agreement on the order of the fields.
{:type ::demo1
:foo 1
:bar 2}
;; <=>
[::demo1 1 2] ;; under {0 :type 1 :foo 2 :bar}
Finally, having projected maps (records) to tuples, we can display many tuples as a table where columns are tuple entries and rows are whole tuples. I mention this only for completeness, as rows and columns are common terms of use and I want to be complete here.
foo | bar |
---|---|
1 | 2 |
3 | 4 |
Okay so we’ve got some data isomorphisms. What of it?
Well the relational algebra is defined in terms of ordered, untagged tuples.
Traditionally data stores didn’t include their field identifiers in the storage implementation as an obvious space optimization.
That’s it. That’s the relational data model - projecting flat structures to relatable tuple units.
Operating with Tuples
The relational algebra defines a couple operations on tuples, or to be more precise sets of tuples. There are your obvious set theoretic operators - union, intersection and difference, and there’s three more.
cartesian product
let R, S be tuple sets ∀r∈R,∀s∈S, r+s ∈ RxS
Ex. {(1,) (2,)} x {(3,) (4,)} => {(1, 3,) (1, 4,) (2, 3,) (2, 4,)}
projection (select keys)
Projection is bettern known as select-keys. It’s an operator for selecting some subset of all the tuples in a tuple space. For instance if we have R defined as
A | B | C |
---|---|---|
a | b | c |
d | a | f |
c | b | d |
π₍a,b₎(R) would be the space of tuples from R excluding the C column -
A | B |
---|---|
a | b |
d | a |
c | b |
selection
Where projection selects elements from tuples, selection selects tuples from sets of tuples. I dislike the naming here, but I’m going with the original.
To recycle the example R from above,
A | B | C |
---|---|---|
a | b | c |
d | a | f |
c | b | d |
σ₍B=b₎(R) - select where B=b over R would be
A | B | C |
---|---|---|
a | b | c |
Joins
Finally given the above operators, we can define the most famous one(s), join and semijoin.
join (R⋈S)
The (natural) join of two tuple sets is the subset of the set RxS where any fields COMMON to both r∈R and s∈S are “equal”.
Consider some tables, R
A | B | C |
---|---|---|
a | b | c |
d | e | f |
and S,
A | D | E |
---|---|---|
a | 1 | 3 |
d | 2 | 3 |
We then have R⋈S to be
A | B | C | D | E |
---|---|---|---|---|
a | b | c | 1 | 2 |
d | e | f | 2 | 3 |
semijoin
This is a slightly restricted form of join - you can think of it as the join on some particular column. For instance, if R and S had several overlapping columns, the (natural) join operation joins by all of them. For instance we may want to have several relations between two tables - and consequently leave open the possibility of several different joins.
In general when talking about joins for the rest of this presentation I’ll be talking about natural joins over tables designed for only overlapping field so the natural join and the semijoin collapse.
Enter Datalog
Codd’s relational calculus as we’ve gone through is a formulation of how to view data and data storage in terms of timeless, placeless algebraic operations. Like the Lambda Calculus or “Maxwell’s Laws of Software” as Kay has described the original Lisp formulation, it provides a convenient generic substrate for building up operational abstractions. It’s the basis for entire families of query systems which have come since.
Which is precisely what makes Datalog interesting! Datalog is almost a direct implementation of the relational calculus, along with some insights from logic programming. Unfortunately, this also means that it’s difficult to give a precise definiton of what datalog is. Like lisp, it’s simple enough that there are decades worth of implementations, re-implementations, experimental features and papers.
Traditionally, Datalog and Prolog share a fair bit of notation so we’ll start there.
In traditional Datalog as in Prolog, “facts” are declared with a notation like this. This particular code is in Souffle a Datalog dialect, which happened to have an Emacs mode. This is the example I’ll be trying to focus on going forwards.
State("Alaska")
State("Arizona")
State("Arkansas")
City("Juneau", "Alaska")
City("Phoenix", "Arizona")
City("Little Rock", "Arkansas")
Population("Juneau", 2018, 32756)
Population("Pheonix", 2018, 1.615e6)
Population("Little Rock", 2018, 198541)
Capital("Juneau")
Capital("Phoenix")
Capital("Little Rock")
Each one of these lines defines a tuple in the datalog “database”. The notation is recognizable from Prolog, and is mostly agreed upon.
Datalog also has rules, also recognizable from logic programming. Rules describe sets of tuples in terms of either other rules or sets of tuples. For instance
CapitalOf(?city, ?state) :- State(?state), City(?city, ?state), Capital(?city).
This is a rule which defines the CapitalOf
relation in terms of the State
, City
and Capital
tuple sets.
The CapitalOf
rule can itself be directly evaluated to produce a set of “solutions” as we’d expect.
?city
and ?state
are logic variables, the ?-
prefix convention being taken from Datomic.
That’s really all there is to “common” datalog. Rules with set intersection/join semantics.
Extensions
Because Datalog is so minimal (which makes it attractive to implement) it’s not particularly useful. Like Scheme, it can be a bit of a hair shirt. Most Datalog implementations have several extensions to the fundimental tuple and rule system.
Recursive rules!
Support for recursive rules is one very interesting extension. Given recursive rules, we could use a recursive Datalog to model network connectivity graphs (1)
Reachable(?s, ?d) :- Link(?s, ?d).
Reachable(?s, ?d) :- Link(?s, ?z), Reachable(?z, ?d).
This rule defines reachability in terms of either there existing a link between two points in a graph, or there existing a link between the source point and some intermediate Z
which is recursively reachable to the destination point.
The trouble is that implementing recursive rules efficiently is difficult although possible. Lots of fun research material here!
Negation!
You’ll notice that basic Datalog doesn’t support negation of any kind, unless “positively” stated in the form of some kind of “not” rule.
TwoHopLink(?s, ?d) :- Link(?s, ?z), Link(?z, ?d), ! Link(?s, ?d).
It’s quite common for databases to make the closed world assumption - that is all possible relevant data exists within the database. This sort of makes sense if you think of your tuple database as a subset of the tuples in the world. All it takes is one counter-example to invalidate your query response if suddenly a negated tuple becomes visible.
Incremental queries / differentiability!
Datalog is set-oriented! It doesn’t have a concept of deletion or any aggregation operators such as ordering which require realizing an entire result set. This means that it’s possible to “differentiate” a Datalog query and evaluate it over a stream of incomming tuples because no possible new tuple (without negation at least) will invalidate the previous result(s).
This creates the possibility of using Datalog to do things like describe application views over incremental update streams.
Eventual consistency / distributed data storage!
Sets form a monoid under merge - no information can ever be lost. This creates the possibility of building distributed data storage and query answering systems which are naturally consistent and don’t have the locking / transaction ordering problems of traditional place oriented data stores.
The Yak
Okay. So I went and build a Datalog.
Why? Because I wanted to store documentation, and other data.
95 Theses
Who’s ready for my resident malcontent bit?
Grimoire
Grimoire has a custom backing data store - lib-grimoire - which provides a pretty good model for talking about Clojure and ClojureScript’s code structure and documentation.
https://github.com/clojure-grimoire/lib-grimoire#things
lib-grimoire was originally designed to abstract over concrete storage implementations, making it possible to build tools which generated or consumed Grimoire data stores. And that purpose it has served admirably for me. Unfortunately looking at my experiences onboarding contributors it’s clearly been a stumbling block and the current Grimoire codebase doesn’t respect the storage layer abstraction; there are lots of places where Grimoire makes assumptions about how the backing store is structured because I’ve only ever had one.
Grenada
https://github.com/clj-grenada/grenada-spec
In 2015 I helped mentor Richard Moehn on his Grenada project. The idea with the project was to take a broad view of the Clojure ecosystem and try to develop a “documentation as data” convention which could be used to pack documentation, examples and other content separately from source code - and particularly to enable 3rdparty documenters like myself to create packages for artifacts we don’t control (core, contrib libraries). The data format Richard came up with never caught on I think because the scope of the project was just the data format not developing a suite of tools to consume it.
What was interesting about Grenada is that it tried to talk about schemas, and provide a general framework for talking about the annotations provided in a single piece of metadata rather than relying on a hard-coded schema the way Grimoire did.
cljdoc
https://github.com/martinklepsch/cljdoc
In talking to Martin about cljdoc and some other next generation tools, the concept of docs as data has re-surfaced again. Core’s documentation remains utterly atrocious, and a consistent gripe of the community yearly survey over survey.
Documentation for core is higher hit rate than documentation for any other single library, so documenting core and some parts of contrib is a good way to get traction and add value for a new tool or suite thereof.
Prior art
- Datomic is closed source
- Lots of prior art with no persistence model
- Datahike is married to a particular storage model
You can bolt persistence ala carte onto most of the above with transit or just use edn, but then your serialization isn’t incremental at all.
Building things is fun!
Design goals
- Must lend itself to some sort of “merge” of many stores
- Point reads
- Keyspace scans
- Must have a self-descriptive schema which is sane under merges / overlays
- Must be built atop a meaningful storage abstraction
- Design for embedding inside applications first, no server
Building a Datalog
Storage models!
Okay lets settle on an example that we can get right and refine some.
Take a step back - Datalog is really all about sets, and relating a set of sets of tuples to itself. What’s the simplest possible implementation of a set that can work? An append only write log!
[[:state "Alaska"]
[:state "Arizona"]
[:state "Arkansas"]
[:city "Juneau" "Alaska"]
[:city "Pheonix" "Arizona"]
...]
Scans are easy - you just iterate the entire thing.
Writes are easy - you just append to one end of the entire thing.
Upserts don’t exist, because we have set semantics so either you insert a straight duplicate which doesn’t violate set semantics or you add a new element.
Reads are a bit of a mess, because you have to do a whole scan, but that’s tolerable. Correct is more important than performant for a first pass!
Schemas!
So this sort of “sequence of tuples” thing is how core.logic.pldb works. It maintains a map of sequences of tuples, keyed by the tuple “name” so that scans can at least be restricted to single tuple “spaces”.
Anyone here think that truely unstructured data is a good thing?
Yeah I didn’t think so.
Years ago I did a project - spitfire - based on pldb. It was a sketch at a game engine which would load data files for a the Warmachine table top game pieces and provide with a rules quick reference and ultimately I hoped a full simulation to play against.
As with most tabletop war games, play proceeds by executing a clock, and repeatedly consulting tables of properties describing each model. Which we recognize as database query.
Spitfire used pldb to try and solve the data query problem, and I found that it was quite awkward to write to in large part because it was really easy to mess up the tuples you put into pldb. There was no schema system to save you if you messed up your column count somewhere. I built one, but its ergonomics weren’t great.
Since then, we got clojure.spec(.alpha) which enables us to talk about the shape and requirements on data structures. Spec is designed for talking about data in a forwards compatible way, unlike traditional type systems which intentionally introduce brittleness to enable evolution.
While this may or may not be an appropriate trade-off for application development, it’s a pretty great trade-off for persisted data and schemas on persisted, iterated data!
https://github.com/arrdem/shelving#schemas
- Values (sha512 hash identity (hasch is excellent))
- Records (place identity)
(s/def :demo/name string?)
(s/def :demo.state/type #{:demo/state})
(s/def :demo/state
(s/keys :req-un [:demo/name
:demo.state/type]))
(defn ->state [name]
{:type :demo/state, :name name})
(s/def :demo/state string?)
(s/def :demo.city/type #{:demo/city})
(s/def :demo/city
(s/keys :req-un [:demo.city/type
:demo/name
:demo/state]))
(defn ->city [state name]
{:type :demo/city, :name name, :state state})
(s/def :demo/name string?)
(s/def :demo.capital/type #{:demo/capital})
(s/def :demo/capital
(s/keys :req-un [:demo.capital/type
:demo/name]))
(defn ->capital [name]
{:type :demo/capital, :name name})
(def *schema
(-> sh/empty-schema
(sh/value-spec :demo/state)
(sh/value-spec :demo/city)
(sh/value-spec :demo/capital)
(sh/automatic-rels true))) ;; lazy demo
Writing!
#‘shelving.core/put!
- Recursively walk spec structure
- depth first
- spec s/conform equivalent
- Generate content hashes for every tuple
- Recursively insert every tuple (skipping dupes)
- Insert the topmost parent record with either with a content hash ID or a generated ID depending on record/value semantics.
- Create schema entries in the db if automatic schemas are on and the target schema/spec doesn’t exist in the db.
Okay so lets throw some data in -
(def *conn
(sh/open
(->MapShelf *schema "/tmp/demo.edn"
:load false
:flush-after-write false)))
;; => #'*conn
(let [% *conn]
(doseq [c [(->city "Alaska" "Juneau")
(->city "Arizona" "Pheonix")
(->city "Arkansas" "Little Rock")]]
(sh/put-spec % :demo/city c))
(doseq [c [(->capital "Juneau")
(->capital "Pheonix")
(->capital "Little Rock")]]
(sh/put-spec % :demo/capital c))
(doseq [s [(->state "Alaska")
(->state "Arizona")
(->state "Arkansas")]]
(sh/put-spec % :demo/state s))
nil)
;; => nil
Schema migrations!
Can be supported automatically, if we’re just adding more stuff!
- Let the user compute the proposed new schema
- Check compatibility
- Insert into the backing store if there are no problems
Query parsing!
Shelving does the same thing as most of the other Clojure datalogs and rips off Datomic’s datalog DSL.
(sh/q *conn
'[:find ?state
:in ?city
:where [?_0 [:demo/city :demo/name] ?city]
[?_0 [:demo/city :demo.city/state] ?state]
[?_1 [:demo/capital :demo/name] ?city]])
This is defined to have the same “meaning” (query evaluation) as
(sh/q *conn
'{:find [?state]
:in [?city]
:where [[?_0 [:demo/city :demo/name] ?city]
[?_0 [:demo/city :demo.city/state] ?state]
[?_1 [:demo/capital :demo/name] ?city]]})
How can we achieve this? Let alone test it reasonably?
Spec to the rescue once again! src/test/clj/shelving/parsertest.clj conform/unform “normal form” round-trip testing!
Spec’s normal form can also be used as the “parser” for the query compiler!
Query planning!
Traditional SQL query planning is based around optimizing disk I/O, typically by trying to do windowed scans or range order scans which respect the I/O characteristics of spinning disks.
This is below the abstractive level of Shelving!
Keys are (abstractly) unsorted, and all we have to program against is a write log anyway! For a purely naive implementation we really can’t do anything interesting, we’re stuck in an O(lvars) scan bottom.
Lets say we added indices - maps from ids of values of a spec to IDs of values of other specs they relate to. Suddenly query planning becomes interesting. We still have to do scans of relations, but we can restrict ourselves to talking about subscans based on relates-to information.
- Take all lvars
- Infer spec information from annotations & rels
- Topsort lvars
- Emit state-map -> [state-map] transducers & filters
TODO:
- Planning using spec cardinality information
- Simultaneous scans (rank-sort vs topsort)
- Blocked scans for cache locality
Goodies
API management!
- shelving.core is the intentional API for users
-
shelving.impl is the implementer’s API
- import-vars https://github.com/ztellman/potemkin/blob/master/src/potemkin/namespaces.clj#L77
Documentation generation!
Covered previously on the blog - I wrote a custom markdown generator and updater to help me keep my docstrings as the documentation source of truth, and update the markdown files in the repo by inserting appropriate content from docstrings when it changes.
More fun still to be had
What makes Datalog really interesting is that among the many extensions which have been proposed is support for recursive rules.
Negation!
- Really easy to bolt onto the parser, or enable as a query language flag
- Doesn’t invalidate any of the current stream/filter semantics
- Closed world assumption, which most databases happily make
Recursive rules!
More backends!
Transactions!
- Local write logs as views of the post-transaction state
- transact! writes an entire local write log all-or-nothing
- server? optimistic locking? consensus / consistency issues
Ergonomics!
The query DSL wound up super verbose unless you realy leverage the inferencer :c
Actually replacing Grimoire…
- Should have just used a SQL ;) but this has been educational