The Perfect Storm

In my previous post I talked about thoughts I’ve been having and a change in direction I’ve been considering.  In the three weeks that passed since writing this post I was busy researching the Clojure ecosystem for things I could use, to validate or invalidate this direction.  This felt like “the perfect storm” — ideas popping up and crashing down.  I came to learn a lot of new things and came to respect the immense amount of work done by the  Clojure community in recent years, and the deep thought that was put into this work.  Today, while I don’t yet have all the pieces figured out, I know, thanks to this community, what I want to achieve is feasible and achievable.

In this blog post I’ll capture my design ideas for what I currently call Cloudlog.clj.

Cloudlog.clj: A Servant of Two Masters

Cloudlog.clj is a servant of two masters.  Its first master is an obvious one — the application developer.

Master #1: The Developer

Cloudlog.clj is an application platform, and exists to help application developers in developing applications.  In that angle, Cloudlog.clj is intended to make it incredibly easy to create applications.

With Cloudlog.clj, developers will:

  • Need to learn only one programming language — Clojure, for both the server-side and the client-side.  Yes, you can’t find that many ready-made Clojure programmers out there, and it has all those weird parentheses, but it is worth the effort.  Besides, for both the client and the server you only use DSLs, so learning them is much easier than learning the entire language.
  • Define the business logic, not implement it. For reasons I describe below it is important that Cloudlog.clj limits user code to a pure subset of Clojure.  With this, the developer only defines transformations on data, in the form of logic rules applied to facts.

As consequence of that, the developer will not:

  • Have to worry about scaling the application.  OK, this is a strong statement — developers always need to worry about scaling… but here we have an opportunity to separate between the logic of the application (declarative rules) and its tuning (allocating computational resources, setting queue sizes, etc).
  • Have to write one-time code for data migration.  Data migration will be handled automatically once you submit a new version (that changes the data).  It’s just a matter of time and money (for compute resources in the cloud).
  • Have to worry about access control.  This is the where the real revolution is. Traditionally, access control is handled by the application’s business logic.  But Cloudlog.clj does not trust the application to do this well.  This is related to the other master.

Master #2: The End User

The most prominent thing that makes Cloudlog.clj special is that it is not designed to be used by developers.  It is intended to be provided as a service (PaaS) by a third-party, serving both application developers and end-users.  I already described the main idea in this blog-post, and in this paper.  This setting allows Cloudlog.clj to serve its second master — the end user.  You can say that any application platform or framework serves the end-user by helping the developer create better applications.  But Cloudlog.clj is designed to go well beyond this.  It is intended to also protect the user from the developer.

As I explain in the blog-post and the paper linked above, Secure Cloudlog assigns two attributes to each fact and rule: its reader-set and writer-set.  These are representations of sets (in the mathematical sense) of users who are allowed to read this fact or rule, or the fact or rule was written on their behalf.  Simple set operations are applied as rules are applied to facts, and simple logic is being used to see if a certain user may or may not perform a certain operation such as writing a fact, deleting it or making a query.  With these annotations, users have the final say regarding their own data.  They specify who can read it, and when they want a piece of data removed or modified, they have the power to do so.  One important consequence of all this is that the application itself does not have access to the data!  It only publishes a set of rules that define what Cloudlog.clj should do with the data.

Cloudlog.clj will implement this.  It will authenticate users and developers, and allow them to write facts (attributed to themselves), make queries (and only show them what they are allowed to see), and push software versions that only specify the logic of the program, not who is allowed to see what.

Making it Developer-Friendly

OK, so I mentioned you’ll have to know a bit of Clojure to use Cloudlog.clj.  I fear this will scare off some potential users, but these are probably the wrong users anyways.  If you’re not willing to spend a week learning a new language, you probably won’t be willing to adjust your entire way of thinking about application development to accommodate the deeper ideas behind Cloudlog.clj.

So having passed this barrier, we want to invent as little as we have to, and make Cloudlog.clj as much Clojure as it can be, and as little Cloudlog as it must be.  This way of thinking brought me to a few decisions:

  1. Cloudlog.clj will implement a DSL over Cloudlog, where rules will be interpreted by a macro.
  2. We will follow common coding and naming conventions common in Clojure, such as naming the rule-defining-macro defrule.
  3. Whenever arbitrary logic needs to be used (i.e., in what Cloudlog call “guards”) we will use plain Clojure.

Although just a small part of it is currently implemented, I can already give a small example of what an application should look like.  The example I’ll give is the micro-blogging application example I’ve been using time and time again.  The idea is simple: users tweet and follow other users.  Eventually we need to present users with their timelines, which are the aggregation of all tweets made by the users they follow.  With Cloudlog, tweets and following relationships are represented as facts.  In Cloudlog.clj, facts are represented as vectors, with the first argument being a fully-qualified keyword.  For example, the fact that the user @brosenan tweeted “Hello, World” at time 123456 can look like this:

[:twitter/tweeted "brosenan" "Hello, World" 123456]

Similarly, the fact that some user @foo follows @brosenan can be represented using this fact:

[:twitter/follows "foo" "brosenan"]

The aggregation of tweets made by users we follow into our timeline can be done using this rule:

(defrule followee-tweets-in-timeline
  [:twitter/follows A B :by [:user A]]
  [:twitter/twitted B T TS :by [:user B]]
  [::timeline A [:twitted B T] TS])

This rule, named followee-tweets-in-timeline, will create an entry in user A’s timeline for each tweet T made by each user B whom A follows.  Since a timeline can consist of different kinds of entries, we tag the event that user B twitted tweet T using the :twitted keyword.

The :by keyword asserts that the user making the statement is the one we expect.  If, for example, user @foo wrote a fact stating that user @bar twitted something, this rule would have ignored it.

Now imagine we want to also place in A’s timeline entries for all the tweets in which user A is mentioned.  To do so we need to search for all the  user  IDs mentioned in a tweet, and place them in the respective user’s timeline.  To do so we can use Clojure’s support for regular expressions, using a guard:

(defrule user-mention-in-timeline
  [:twitter/twitted B T TS :by [:user B]]
  (for [A (map second (re-seq #"@(\w+)" T))])
  [::timeline A [:mentioned-you B T] TS])

We used a Clojure for form to iterate over all the instances of @userid in tweet T.  The function re-seq returns all these by applying the regular expression @(\w+) repeatedly on the text.  The parentheses in the regular expression allow us to extract the user ID (e.g., “foo” from “@foo”), so each result is a pair (e.g., [“@foo” “foo”]).  We call (map second) to extract the second element of each pair.  In a regular for form in Clojure there is also a body, which is returned for every possible binding.  For example, in plain Clojure the expression:

(for [x ["foo" "bar"])]
  ["Hello" x])

will return:

(["Hello" "foo"] ["Hello" "bar"])

In Cloudlog.clj rules we omit the body.  The body is actually replaced by the rest of the rule.  Similar to for forms, let can be used to perform calculations, and where and where-not can be used to filter results.

It’s OK to be Lazy

Growing up I was always told that working hard is the only way to succeed.  Work hard at school -> get good grades -> get accepted to a good university -> work hard -> get good grades -> get a good job -> work hard…  During my PhD I find myself working hard, but at least I’m doing it on things I think matter.  But sometimes being lazy is a better policy than working hard.  Especially when it comes to developing software.

I’m designing Cloudlog.clj as a massively distributed system, that needs to be able to scale well, be fault tolerant and provide low latency and high throughput.  Dealing with all these requirements myself does not feel like the right thing to do with my time.  So what would be a better approach?  To use something that already exists!  In this case, platforms and frameworks for big-data analytics, for both the streaming and batch processing use cases.

As analytics take a growing part in our lives (for example, many of us get notifications to our smartphones for things that may interest us on a regular basis), there is a growing need for easy-to-use, highly scalable, high performance platforms for big-data analytics. What started with Hadoop fast became a rich ecosystem of platforms and frameworks, some based on Hadoop itself, some not.  Most of these platforms and frameworks, like Hadoop, address the batch-processing use-case.  They get a job with large but bounded input, they work for some time, and finally produce output.  Only a few of these platforms and frameworks address the other use-case — stream processing.  This use-case is for unbounded input, like feeds coming from users or from IoT devices, and need to be processed online.  This is not very different than what happens in a modern web application.  Tweets are pouring in, and the application needs to place them in timelines.

Apache Storm is probably the most prominent stream-processing platform in the analytics world.  It is written mostly in Clojure and supports both Java (and Scala and Kotlin) in-process, as well as a wealth of other languages such as Python through a script interface.  Clojure is supported with a special DSL.  From playing with it for a few hours I can say it has a low entry threshold, meaning that it is simple to do simple things with it.  I still do not know how easy or how hard it would be to do more complicated stuff.

Using Storm to implement Cloudlog.clj can off-load the burden of taking care of distributed systems aspects from me.  All I need to do is to define a few Bolts and a few Spouts, and define how rules translate into topologies.  It’s not trivial, but much easier than implementing a distributed system.  Also, since Storm is production-proven, using it will bring me much closer to production-readiness.

Unfortunately, Storm by itself does not cover all the aspects of what Cloudlog.clj needs to provide.  Data migration is an important aspect.  Imagine we modify one of the rules we defined in our Twitter-like application.  All the data it produced needs to be re-calculated.  My plan to avoid the down-time during the migration is to place calculated data in tables which have the version of the rules producing them as part of their name.  With this, different versions can co-exist.  This can be great for things like A/B testing and for being able to back-off quickly in case of a serious bug in the new version, but it also means that when creating a new version (of a rule) someone needs to go through all the historic data and produce the calculated data for it.  This is a batch job, which can be performed by Hadoop or any other batch processing platform.  As this ecosystem is much richer than the one for streaming, I am still on the lookout for the best fit there.  But again, it is good to be lazy.  I should not try to roll my own thing there.

Conclusion

In this post I “officially” announced the launch of a project I now call Cloudlog.clj (suggestions for better names are welcome).  While I know the general direction, much is still to be figured out.  As always, the biggest problem is allocating the time for it.

As I’m new to Clojure and to big-data platforms, I’d appreciate any feedback you may have.

Interesting times lie ahead…

 

Leave a comment