NoDatalog: NoSQL Done Right (Part 2)

In Part 1 of this post I covered the NoSQL movement, and described what I think is not ideal there.  A short recap would be the following points:

  1. Lack of standardization – each database provides its own API and query language, that is  often tightly coupled with its implementation.
  2. “No fancy queries” – users need to write their own de-normalization routines outside the database, which is prone to errors.
  3. Data migration – de-normalized data more sensitive to software updates than normalized data.  As a result, more throw-away code needs to be written just to perform data migrations.

In this post I will present what I call NoDatalog, and explain how it can, at least in theory, overcome these three issues.

Oh No, Datalog!

OK, so before I can talk about NoDatalog I need to talk a bit about Datalog, just like before I talked about NoSQL I gave a short recap on SQL…

So Datalog is a query language that was popular in certain circles around the 1980’s and 90’s.  It was similar in a lot of ways to SQL, and in fact, some of its implementations were actually compilers that translated Datalog queries to SQL so that they can work on relational databases.  I don’t like Datalog, and I don’t think anyone in the 21st century should bother himself or herself with it, but like many things that were done decades ago, Datalog is also worth knowing (at least to some level), so we can build bigger and better things using today’s technologies.

So Datalog is a query language for a class of databases known as deductive databases.  Deductive databases are databases that are based on the principles of logic programming, that is, they provide query results based on logic deduction.  In a deductive database, tables are replaced with predicates, records are replaced with facts, and views are replaced with rules.

Consider the tweeter-like system we described in Part 1.  For this system, instead of having three tables, an implementation based on a deductive-database would have three predicates: user/1, following/2 and tweet/2.  The syntax name/number provides the name of each predicate along with the number of arguments it takes.  Each predicate can have a large number of facts associated with it.  For example, the predicate following/2 can have the fact:

following(alice, bob).

associated with it, and the predicate tweet/2 can have the facts:

tweet(alice, 'Hello').
tweet(bob, 'Howdy').

associated with it (we do not need user/1 for this example).  Please note the syntax we used to express the facts.  This syntax is similar to a function call in a C-like language, but it is nothing like a function call.  This is just a way to provide a name of a predicate, along with the arguments that make up a certain fact.  For example, the fact following(alice, bob) means that Alice follows Bob, and the fact tweet(bob, ‘Howdy’) means that Bob tweeted “Howdy”.  If we translate this to SQL, each fact is actually a record, where the arguments are values to be given to each fields.  But unlike relational databases, that give a name to each field, in Datalog each field is given a position, just like an argument to a function in C.

So we can store our data.  But how do we query it?  How do we query a user’s timeline?

To do that, we need to define a rule, which like an SQL view, looks like a predicate (table) from the outside, but is implemented using a query.

timeline(A, B, T) :-
    following(A, B),
    tweet(B, T).

So timeline/3 is a predicate, but it is defined using the above rule.  The rule can be read as:

A's timeline contains tweet T by B if:
    A follows B, and
    B tweeted T.

(the “:-” can be read as “if”, and the “,” can be read as “and”).  A, B and T are what we call logic variables.  Datalog knows these are variables (and not concrete values) because they start with a capital letter.

If we perform the query:

? timeline(alice, B, T).

Datalog assigns A=alice, and then performs the query:

? following(alice, B), tweet(B, T).

It starts by trying to find facts that match

following(alice, B)

and finds

following(alice, bob).

So it assigns B=bob, and goes on to match:

tweet(bob, T)

and finds:

tweet(bob, 'Howdy').

so it sets T=’Howdy’.  This assignment (B=bob, T=’Howdy’) is returned to the user as a result of the query.

Remember I wrote predicates are nothing like functions?  Well in this example the timeline/3 predicate succeeded, and “returned” the values bob and ‘Howdy’ in a way that may look to a C-programmer as “pass by reference”.  However, if the fact tweet(bob, ‘Hi Alice’) would have also been in the database, our query would have had two results.  Similarly, if no fact matching tweet(bob, T) were in the database, the query would have had zero results.  A Datalog query can have any number of results, just like an SQL query may have any number of results.

So What??

So Datalog is a lot like SQL, but with different syntax.  Like in SQL, predicates must conform to a schema, that assigns types to its arguments, and as a result, to get useful information one needs to perform queries like the one above, and these take time… especially when the data is too big to fit on one machine.  Datalog is not SQL, but is also not NoSQL.  As a result (like I said before), no one in the 21st century should bother him/herself with it.  So why did I spend so many words describing it?

The reason is that if we take the principles it is based on, and just change the rules of the game just a little bit, we get something that scales nicely on two levels:

  1. Like NoSQL, can preserve a low latency at the face large data sets.
  2. Unlike NoSQL, does not complicate the software feeding data to the databases, keeping queries declarative (as in SQL and Datalog), focused on the “what” and not on the “how”.

In my latest paper, my co-author an I called the class of such databases NoDatalog, because they are to Datalog what NoSQL is to SQL.

NoDatalog

OK, so imagine that instead of writing the timeline rule as above, we would have written it like this:

following(A, B) -> tweet(B, T) -> timeline(A, B, T).

What changed?  timeline(A, B, T) moved from the beginning to the end, and both the “:-” and the “,” were replaced with arrows (“->”).  This is not much of a change, but it changes everything!

This rule is written in the Cloudlog language, which is a specific NoDatalog language that we present in the paper.  Unlike the the Datalog timeline rule, which works at query time, this rule works at update time.  It means that when a user inserts the fact:

follows(alice, bob).

to the database, the rule is activated, assigning A=alice, and B=bob.  As a result, whatever is on the right-hand-side of the arrow enters the database, after we apply this assignment:

tweet(bob, T) -> timeline(alice, bob, T).

This also is a rule, which now waits for tweets.  When a tweet like:

tweet(bob, 'Howdy').

enters the database, the new rule will be activated, assigning T=’Howdy’, entering the fact

timeline(alice, bob, 'Howdy').

to the database.  Then, when Alice queries for her timeline, the database does not have to work very hard.  All timeline entries are already waiting for her.  In fact, if our database uses the first argument of a predicate as its primary key, this query will be very efficient, regardless of how big the data is and how it is split across servers.

But what if Bob tweeted first, and only then Alice started following him, you may ask?  Or what happens if Alice stops following Bob (removes the fact following(alice, bob)) or Bob deletes a tweet?  Our database needs to be able to handle all these cases (4 overall):

bash> echo {adding,removing}-a-{fact,rule}

These four cases are handled once, in our database code-base.  Application developers do not have to burden themselves with these rules anymore.

Another possible problem is the possible race condition between adding a fact (e.g., a tweet) and a matching rule (e.g., as a result of some user starting to follow the user who tweeted at about the same time).  If we are not careful, these changes can miss each other, and the tweet will never appear in the follower’s timeline.  A NoDatalog database taking this approach should be carefully designed to handle such scenarios.  NoSQL leaves this responsibility to the application.

NoSQL Done Right

We have already covered item 2 on the list of “things we would like to fix” in NoSQL: NoDatalog compensates for “lack of fancy queries” with “fancy rules” applied when we do have the time… during (or right after) updates are made to the database.

This also addresses item 3 – data migrations.  When we upgrade our software, we replace rules in the database: we remove the old rule, and add a new one.  When we remove the old rule, the database searches for all the matching facts and removes the products of applying the rule to them.  Then when adding the new rule, all matching facts are matched to add new products (facts or rules).  If we perform the removal and addition in a single operation, the database can avoid removing and adding the same product in case where the change had no effect.  Where the rule change did affect the result, the result (fact or rule) is replaced in a single operation.  In any case, the application developers do not have to bother themselves with migration code – the database migrates the data on its own.

As for item 1, NoDatalog as a concept does not bring consolidation.  However, languages such as Cloudlog can bring this consolidation and standardization, if we are careful enough to decouple them from the underlying implementation.  What makes this possible here, and impossible in NoSQL is the fact that we rely on a well founded theory: First Order Logic, as well as a long history of logic programming languages.  We wrote our Cloudlog paper well before we had a working prototype of a database implementing it.  Since its conception, we experimented with a few very different design ideas for its implementation.  This leads me to conclude that as in SQL, there are several ways in which such a database can be implemented, and these ways are independent of the language, giving the language – be it Cloudlog or any other NoDatalog language that will catch on, life beyond one database implementation.

Wrapping Up

I started this discussion in the previous post, talking about NoSQL as the solution chosen by the Internet industry to overcome the limitations of relational databases, and about what is missing in this solution.  In this post I described Datalog and deductive databases, and argued why a new kind of deductive databases I called NoDatalog are the answer.

NoDatalog is a vision.  Currently we have a working prototype for what is probably the world’s first NoDatalog database, a database code name Cloudlog1.  As the name suggests, it is an implementation of the Cloudlog language we described in our paper.  The Cloudlog language and the design of the Cloudlog1 database are topics for different posts, before I can do this, I need to write some posts on the language in which Cloudlog1 was implemented in: Cedalion.

4 thoughts on “NoDatalog: NoSQL Done Right (Part 2)

Leave a comment