Transaction In Doubt

My notes on Software Engineering

Miss Manners and Waltzdb

While durable_rules is a framework for complex event coordination, its linguistic abstraction is also suitable for solving business rules problems such as constraint propagation. At its core durable_rules is powered by a variation of the Rete algorithm. Measuring and driving its performance using well established Business Rules industry standards has been extremely valuable. It has brought  the quality of the implementation to a new level: improving performance by orders of magnitude and making the system more robust.

The most popular industry standards for production Business Rules systems are “Miss Manners” and Waltzdb. Designed more than 25 years ago, they have been subject of endless debate. Their direct applicability to customer problems as well as features used by platform vendors specifically to improve the numbers have been questioned. After all, however, we still use such benchmarks to understand the performance of Business Rules systems. Manners and Waltzdb are complementary benchmarks: while Manners tests the speed of combinatoric evaluation, Waltzdb tests the efficiency of action scheduling and evaluation. 

Just as Thomas Edison once famously said about genius, performance improvements too are one percent inspiration and ninety nine percent perspiration. Because durable_rules relies on Redis to store inference state, all performance improvements derived from optimizing how Redis is used. For each experiment I reset the slowlog, executed the test, reviewed the slowlog results and implemented an improvement (sometimes a regression :( ). I repeated this over and over. In my first experiments I set the slowlog threshold to 1 second, as I implemented effective improvements I reduced the slowlog threshold and by the end I was only reviewing the commands which took more than 1 millisecond.

All the performance measurements were done using Redis 3.0, unix sockets, disabling Redis disk save and iMac, 4GHz i7, 32GB 1600MHz DDR3, 1.12 TB Fusion Drive.


Consider the following tests plotted in the chart below:

  1. Blue: Execute a set of sequential hget operations from a node.js client in a tight loop. The throughput for retrieving a 200 byte value is about 25K operations every second, as the size increases, the throughput decreases to 11K operations for 20KB values.
  2. Green: A set of 100 hget operations batched in a multi-exec command: The throughput in the case of 200 byte value is around 160k operations every second, degrading to 18k when the value is 20KB.
  3. Yellow: A set of hget operations inside a Lua script: The throughput for 200 bytes jumps all the way up to 3M operations/sec, the throughput degrades to 458K for 20KB value.
  4. Red: A set of cmsg unpack operations inside a Lua script: The throughput for 200 bytes is 9M operations/sec, which degrades to 1.1M for 20KB values.
  5. Orange: A set of Lua table get inside a Lua script: The throughput remains constant at 63M operations/sec.


What is important about the experiment above is the order of magnitude of each operation. For the Manners and Waltzdb benchmarks, where hundreds of thousands of combinatorics need to be evaluated in a few seconds, reducing sequential and batched access from redis clients is critical but not enough. Access to Redis structures from Lua as well as packing and unpacking using cmsg can also become a bottleneck. Thus, all performance optimizations were done following these principles.

  1. Push the combinatorics evaluation as close to the data as possible leveraging Lua scripts.
  2. Design chunky Lua script function signatures batching operations when possible.
  3. Avoid fetching the same information multiple times from Redis data-structures by using Lua local variables.
  4. Be careful with the shape of the data stored, avoid excessive cmsg usage.

Miss Manners

Miss Manners has decided to throw a party. She wants her guests not to get bored, so she needs to sit them such that adjacent guests are always of opposite sex and share at least one hobby. The problem needs to be resolved for 8, 16, 32 and 128 guests. 


The ruleset for the algorithm is relatively simple consisting of only seven rules (RubyPython, JScript). Guests hobbies are expressed using facts. Therefore the input datasets consists of 20, 40, 83 and 439 facts for 8, 16, 32 and 128 guests respectively. The declarative algorithm conceptually (not practically) creates the combinatorics tree for all the guests and selects the first valid solution via depth first search.


The antecedent in the rule below (expressed in Python) is the heart of the algorithm, as it determines valid solutions. Consider the case of 128 seatings with 439 guest assertions: by the end of the test execution the first 3 terms of the expression will have generated hundreds of thousands combinatorics, the next two terms will have filtered out the invalid ones. Recalculating the full expression for all terms for all combinatorics every time a new seating is asserted slows down execution significantly. I addressed the problem with the following three improvements:

    @when_all(c.seating << (m.t == 'seating') & 
                           (m.path == True),
              c.right_guest << (m.t == 'guest') & 
                               ( == c.seating.right_guest_name),
              c.left_guest << (m.t == 'guest') & 
                              ( != & 
                              (m.hobby == c.right_guest.hobby),
              none((m.t == 'path') & 
                   (m.p_id == c.seating.s_id) & 
                   (m.guest_name ==,
              none((m.t == 'chosen') & 
                   (m.c_id == c.seating.s_id) & 
                   (m.guest_name == & 
                   (m.hobby == c.right_guest.hobby)),
              (m.t == 'context') & (m.l == 'assign'))
    def find_seating(c):
  1. When evaluating combinatorics and resolving action scheduling conflicts the rules engine has to choose the most recent result to enable the depth first search of a valid solution. This resolution policy has been implemented and pushed as de-facto standard since the early days of Business Rules.
  2. For a given rule, to avoid scanning and evaluating all expression lvalues (inference list) for every assertion, the inference results are stored in a hash-set indexed by the combined result of all equality expressions. Thus, the combinatorics calculation complexity becomes linear and depends on the number of assertions done on the rule.
  3. When evaluating a fact through a complex expression the engine might compare it with the same target fact multiple times, caching target facts in lua local variables reduces the number of Redis data-structure access.


JavaScript is the fastest, as it solves the problem for 128 guests in about 1.7 seconds, on the flip side its memory usage is the highest reaching 46MB including Redis memory usage. Python can still solve the 128 guest problem below 2 seconds with remarkably efficient memory usage of 26MB including Redis memory. Ruby is not far behind, a little bit above 2 seconds and good memory usage of 35MB including Redis. 


As shown in the graph below, benchmark execution time is directly correlated with the number of Redis commands executed. In the case of 128 guests more than 400K commands were executed. In the case of JavaScript, this is a throughput of 225K Redis commands per second.



Waltz is an image recognition algorithm invented by Dr. David Waltz in 1972. Given a set of lines in a 2D space, the computer needs to interpret the 3D depth of the image. The first part of the algorithm consists of identifying four types of junctions: Line, Arrow, Fork, Tee. Junctions have 16 different possible labelings following Huffman-Clowes notation, where + is convex, – is concave and an arrow is occluding. Valid junctions and labelings:


It is important to point out: pairs of adjacent junctions constraint each other’s edge labeling. So, after choosing the labeling for an initial junction, the second part of the algorithm iterates through the graph, propagating the labeling constraints by removing inconsistent labels. Example labeling a cube with 9 lines:


The ruleset for the algorithm consists of 34 rules (RubyPythonJScript). In the Waltzdb benchmark, the dataset provided is a set of cubes organized in regions. Each region consists of 4 visible and 2 hidden cubes composed by 72 lines. In addition, all tests are flanked by four visible and 1 hidden cube formed by 76 more lines. The tests were run with 4 (376 lines, 20 visible cubes, 9 hidden cubes), 8 (680 lines, 36 visible cubes, 17 hidden cubes), 12 (984 lines, 52 visible cubes, 23 hidden cubes), 16 (1288 lines, 68 visible cubes, 33 hidden cubes) and 50 regions (3872 lines, 204 visible cubes, 101 hidden cubes).



  1. waltzdb requires a large amount of action evaluations. Some actions (reversing edges or making junctions) don’t imply a change in context (moving forward state), such actions can be batched and marshaled to the scripting engine in a single shot.
  2. Fact assertions can lead to multiple rule evaluation, instead of invoking a Lua function for every rule evaluation, all rule evaluations for a single fact assertion can be batched and marshaled to Redis in a single shot. 
  3. durable_rules doesn’t support fact modification. To avoid excessive churn in the scheduled action queue, events can be used (instead of facts) to drive the state machine coordination.


JavaScript was the fastest completing the 50 region problem in 21 seconds, memory usage reached 170MB including Redis memory usage. Both Python and Ruby finished the problem in about 23 seconds (not far behind). The memory usage for both Python and Ruby was below 100MB (including Redis).


Just as the Manners benchmark, the benchmark time is directly correlated with the number of Redis commands executed. The case of 50 regions executed almost 3.5M Redis commands. The test in JavaScript has a throughput of 159K Redis commands per second.

durable_rules efficiency in combinatorics evaluation is competitive,  its efficiency in memory usage is remarkable. The results compare well with well established Production Business Rules systems. There is performance penalty for marshaling actions in and out of Redis as shown in Waltzdb.

Redis is a great technology, but it has to be used appropriately. The fact that Manners is able to push 225k Redis commands and Waltzdb 159K Redis commands every second, while I was able to get 3M hget commands/sec in a simple experiment, means I might be able to still squeeze more performance out of this system.


From its beginning the objective of the durable_rules project has been to provide a library, which can be leveraged in the most popular languages for building internet applications. Even though durable_rules borrows a healthy number of concepts from the Production Business Rules and Expert Systems domains, I consider it critical to tap into the infrastructure and knowledge base, which has been built over the last few years with languages such as JavaScript, Python and Ruby.

The initial durable_rules prototypes were built using JavaScript and the node.js environment. As I considered targeting a larger audience, the implementation evolved into a core engine written in C, using JSON as the lingua-franca for defining both rulesets and the data to be processed. A set of thin adapters would allow using the engine in JavaScript, Python and Ruby. I borrowed the MongoDB JSON query syntax as the linguistic abstraction for defining rulesets. For optimal performance the data in Redis would be stored using the message pack format (not JSON).


Because scripting languages have great support for the JSON data model (hash sets, arrays and primitives) I was optimistic (perhaps naive) this approach would be a one-size fits all solution. However, as the project progressed, I found a couple of problems with this idea. 

First problem

In durable_rules the logical (and, or) and sequence operator (when_all, when_any) precedence matters. In MongoDB an expression like: ‘qty > 100 and price < 9.95’ is represented as ‘{{ qty: { $gt: 100 } }, { price: { $lt: 9.95 } }}’. This is nice and simple, but the JSON data model doesn’t imply hash-set ordering. What if I don’t want the engine to evaluate price when qty < 100? …And then came other features such as correlations, statecharts, flowcharts, parallel, salience, tumbling windows, count windows… I departed from MongoDB queries. The example below shows how a typical rule is defined in the latest durable_rules implementation. This rule waits for three events: The first of type ‘purchase’. The second with the same ‘ip’ address as the first but different ‘cc’ number. And the third with same ‘ip’ address as the first, but ‘cc’ number different than that of the second and the first.

    suspect: {
        all: [
            {first: {t: 'purchase'}},
            {second: { 
                $and: [
                    {ip: {first: 'ip'}}, 
                    {$neq: {cc: {first: 'cc'}}} 
            {third: { 
                $and: [
                    {ip: {first: 'ip'}}, 
                    {$neq: {cc: {first: 'cc'}}}, 
                    {ip: {second: 'ip' }}, 
                    {$neq: {cc: {second: 'cc'}}}

Second problem

The rule above expresses precisely what I want. But I just cannot imagine manually defining tens, hundreds or thousands of rules using the format above. It is not sustainable. Ruby, with its great DSL support, gave me the creative inspiration for making significant improvements over the user model. By leveraging language extensibility I was able to define and follow the set of principles for the three libraries I have implemented so far JavaScript, Python, Ruby:

  1. Leverage the language
    • Use existing expression language
    • Use existing scoping constructs
    • Use blocks, lambdas and decorators
  2. Enable compact and maintainable code
    • Avoid nesting\scoping when possible
    • Encourage the user to define short phrases, avoid cascading
    • Facilitate troubleshooting

To the extent possible I tried driving uniformity across library implementations, however I did not sacrifice usability for uniformity. Below is a summary of the features I used for implementing the libraries. 


Notes on Ruby

Ruby has the most comprehensive language extensibility: Operator overloading for logical and arithmetic expressions. Property accessor interception to leverage ‘.’ notation. Block support to seamlessly integrate actions as part of the rule definition. Contexts applied to block execution to avoid unnecessary variable qualification in expression rvalues and action implementations. All these features lead to a compact and fairly well integrated ruleset definitions.  

Durable.ruleset :suspect do
  when_all c.first = m.t == "purchase",
           c.second = (m.ip == first.ip) & ( !=,
           c.third = (m.ip == second.ip) & ( != & ( != do
    puts "detected " + + " " + + " " +

‘durable_rules’ exposes a REST API to post events and assert facts. I used the Sinatra framework to implement such an API. It was very easy to integrate, except for the threading model: for each request a synchronous response is expected and multiple concurrent requests are dispatched in different threads in a single process.

One caveat, which really bothers me: In Ruby it is not possible to overload logical ‘and’ and ‘or’. I had to use bitwise operators instead. The bitwise operator precedence is different from that of their logical counterparts. So, it is required to enclose logical expressions with parens ‘(m.ip == first.ip) & ( !=’ as opposed to ‘m.ip == first.ip && !=’.

Notes on Python

Python has great language extensibility. Similar to Ruby: operator overloading for logical and arithmetic expressions and property accessor interception allows using ‘.’ notation for tuple names. Function decorators for rule action integration. Contexts for leveraging scoping constructs when defining rulesets, statecharts and flowcharts. This results in compact and well integrated ruleset definitions.

with ruleset(’suspect'):
    @when_all(c.first << m.t == 'purchase',
              c.second << (m.ip == first.ip) & ( !=,
              c.third << (m.ip == second.ip) & ( != & ( !=
    def detected(c):
        print ('detected {0} {1} {2}'.format(,,

I used the Werkzeug framework to implement the REST API. I considered using Flask, but I really didn’t need all its features. Similar to Sinatra and Ruby: for each request a synchronous response is expected and multiple concurrent requests are dispatched in different threads in a single process.

Python has the same caveat as Ruby regarding logical ‘and’ and ‘or’ overloading. Lambda support in Python is fairly limited, so I resorted to function decorators, still with a very reasonable outcome. In python ‘=‘ is a statement, so I had to use the ‘<<‘ operator for naming events and facts in correlated sequences.  

Notes on JavaScript

JavaScript does not have operator overloading, so a set of functions for describing logical and arithmetic expressions needs to be provided. Property accessor overloading is not readily available, I played a nifty trick to overcome this obstacle (see note below). Lambda support enables rule action integration. Contexts allow leveraging scoping constructs when defining rulesets, statecharts and flowcharts. While the result seems acceptable, the question still remains: has this crossed the usability tipping point?

with (d.ruleset('suspect')) {
    whenAll(c.first =,
            c.second = m.ip.eq(c.first.ip).and(, 
            c.third = m.ip.eq(c.second.ip).and(,,
        function(c) {
            console.log('detected ' + + ' ' + + ' ' +;

Node.js provides great support for implementing the REST API. In addition the single threaded continuation based model is a beautiful invention (in my opinion), it helps with JavaScript ‘global’ variable management, enables non-blocking IO calls and provides nice concurrency control using multiple processes.

‘’: It is easy to take it for granted. In order to enable such an expression without requiring the user to define object hierarchies, I leveraged V8 extensibility in C++ (given that the rulesets are run in node.js). In V8 it is possible to provide a proxy object that intercepts Property Set\Get and call back a JavaScript lambda:

C++ class definition:

class Proxy {

  explicit Proxy(Handle<Function> gvalue, Handle<Function> svalue) { 
    Isolate* isolate = Isolate::GetCurrent();
    _gfunc.Reset(isolate, gvalue); 
    _sfunc.Reset(isolate, svalue); 
  Handle<Object> Wrap();


  static Proxy* Unwrap(Local<Object> obj);
  static void Get(Local<String> name, const PropertyCallbackInfo<Value>& info);
  static void Set(Local<String> name, Local<Value> value, const PropertyCallbackInfo<Value>& info);
  Persistent<Function> _gfunc;
  Persistent<Function> _sfunc;

From JavaScript

var s = r.createProxy(
    function(name) {
    function(name, value) {

Please find the full implementation in durable_rules code base.

A note on multi-platform support

Unfortunately a Polyglot framework does not imply multi-platform support. In the Unix ecosystem, the ‘durable_rules’ libraries (including the C engine implementations) have a high degree of portability. That is to say: having implemented the framework in the Mac OS X platform, very few changes were needed to make it work in Linux. Integrating the Windows platform has been a lot more difficult: VisualStudio, which is the premier toolset for Windows, doesn’t fully support C99 (variable length arrays, asprintf, snprintf…), luckily all these limitations can be overcome without needing fundamental changes nor creating a different code fork. Hiredis is not officially supported by Redis for Windows, the MS Open Tech team has done a nice job porting and maintaining the Redis code base for Windows. I extracted the ‘Hiredis’ portion for the ‘durable_rules’ C engine as it heavily depends on it. 



The past few months have been a time of intense study and creative software implementation. Which led to the full development of the durable_rules forward inference algorithm. Before I release what I consider the durable_rules beta version, I would like to present my thoughts on what I have done with the Rete algorithm. It is difficult to talk about an algorithm without naming it. That is why, in this note, I refer to durable_rules forward inference implementation as Rete_D (D for distributed).

As I read blogs and white papers on rules engines implementations, at least in principle, I found interesting similarities between Rete_D and Drools (ReteOO as well as PHREAK). In particular it is worth mentioning the features:

  • Alpha node indexing: Rete_D accomplishes by using a trie written in C (see my previous blog entry).
  • Beta node indexing: in Rete_D `join` and `not` nodes are indexed using a lua table.
  • Set oriented propagation:  Inserts and deletes are queued. During rule evaluation, inserts and deletes are evaluated against a set of frames which facts and events stored in redis.
  • Heap based agendas: The agenda for rule execution is managed with a redis sorted set and lists ordered by salience.

Rete_D has a number of important features, which make it unique. To demonstrate them I use a simple fraud detection rule written in Ruby.

Durable.ruleset :fraud do
  when_all c.first = m.t == "purchase",
                 c.second = (m.t == "purchase”) & (m.location != first.location) do
    puts "fraud detected " + first.location + " " + second.location


In the example above, you might have noticed there is no types nor models definition. Events and facts are JSON objects and rules are written for the JSON type system (sets of name value pairs). A good analogy is the MongoDB query language, in which the existence of JSON objects is assumed and an explicit structured schema definition is not needed. In fact in my earliest Rete_D implementation I used the MongoDB syntax for defining rules. I departed from MongoDB query because expression precedence when using operators such as And, Or, All and Any is ambiguous. In today’s durable_rules implementation our example will be translated to:

fraud: {
    all: [
        {first: {t: 'purchase'}},
        {second: {$and: [{t: 'purchase'}, {$neq: {location: {first: 'location'}}}]}}

Events and facts are fully propagated down the network, that is, there is no property extraction constructs in either the library nor the Rete_D structure. In a rule definition, events and facts have to be named and referenced by such a name when expressing correlations. Any property name can be used. The results:

  1. Simpler rule definitions. 
  2. Integration with scripting languages, that provide great support for JSON type system.


With Rete_D, the tree evaluation can distributed among different compute units (call them processes or machines). There are three important reasons for doing so:

  1. Availability: avoid single points of failure.   
  2. Scalability: distribute the work utilizing resources available.
  3. Performance: consolidate event ingestion with real-time rule evaluation.

To make this viable, pushing evaluation as close to the data a possible is critical. Alpha nodes are evaluated by the Rete_D C library and can be distributed symmetrically across compute units (such as Heroku web dynos). The Beta nodes are evaluated by a lua script (created during rule registration) in the Redis servers where the state is kept. Events and facts can be evaluated in different Redis servers depending on the context they belong to. Finally actions are to be executed by the compute units responsible for alpha node evaluation.



Events are a first class citizen in the Rete_D data model. The difference between events and facts is events can only be seen once by any action. This principle implies that events can be deleted as soon as a rule is resolved and before the corresponding action is scheduled for dispatch. This can dramatically reduce the combinatorics of join evaluation and the amount of unnecessary computations for some scenarios.  

To illustrate this point, let’s add the following rule to the example above:

  when_start do
    assert :fraud1, {:id => 1, :sid => 1, :t => "purchase", :location => "US"}
    assert :fraud1, {:id => 2, :sid => 1, :t => "purchase", :location => "CA"}
    assert :fraud1, {:id => 3, :sid => 1, :t => "purchase", :location => "UK"}
    assert :fraud1, {:id => 4, :sid => 1, :t => "purchase", :location => "GE"}
    assert :fraud1, {:id => 5, :sid => 1, :t => "purchase", :location => "AU"}
    assert :fraud1, {:id => 6, :sid => 1, :t => "purchase", :location => "MX"}
    assert :fraud1, {:id => 7, :sid => 1, :t => "purchase", :location => "FR"}
    assert :fraud1, {:id => 8, :sid => 1, :t => "purchase", :location => "ES"}
    assert :fraud1, {:id => 9, :sid => 1, :t => "purchase", :location => "BR"}
    assert :fraud1, {:id => 10, :sid => 1, :t => "purchase", :location => "IT"}

Because each fact matches the first and the second rule expression, there are 10 x 10 comparisons, which produce 10 x (10 -1) = 90 results. In contrast, let’s replace ‘when_start’ with the following rule:

  when_start do
    post :fraud1, {:id => 1, :sid => 1, :t => "purchase", :location => "US"}
    post :fraud1, {:id => 2, :sid => 1, :t => "purchase", :location => "CA"}
    post :fraud1, {:id => 3, :sid => 1, :t => "purchase", :location => "UK"}
    post :fraud1, {:id => 4, :sid => 1, :t => "purchase", :location => "GE"}
    post :fraud1, {:id => 5, :sid => 1, :t => "purchase", :location => "AU"}
    post :fraud1, {:id => 6, :sid => 1, :t => "purchase", :location => "MX"}
    post :fraud1, {:id => 7, :sid => 1, :t => "purchase", :location => "FR"}
    post :fraud1, {:id => 8, :sid => 1, :t => "purchase", :location => "ES"}
    post :fraud1, {:id => 9, :sid => 1, :t => "purchase", :location => "BR"}
    post :fraud1, {:id => 10, :sid => 1, :t => "purchase", :location => "IT"}

Remember each event can only be see once, therefore there are 10 comparisons, which produce only 5 results. As you can see events can dramatically improve performance for some scenarios. 

Context References

It is common to provide rule configuration information by asserting facts. In some cases this might lead to increasing join combinatorics. Rete_D implements an eventually consistent state cache, which can be referenced during alpha node evaluation, thus reducing the beta evaluation load. Consider the rule implemented in Ruby:

Durable.ruleset :a8 do
  when_all m.amount > do
    puts "a8 approved " + m.amount.to_s
  when_start do
    patch_state :a8, {:sid => :global, :min_amount => 100}
    post :a8, {:id => 2, :sid => 1, :amount => 200}

An event is compared against the ‘min_amount’ property of the context state object called ‘global’. This evaluation is done by an alpha node using the state cached in the compute unit and refreshed every two minutes (by default). Again this can lead to reduction in join combinatorics and significant performance improvements for some scenarios.


Rete_D provides a number of unique features, which I believe make it relevant for solving Streaming Analytics problems where rules are expected to handle a large amount of data in real time. The following features are not yet implemented in Rete_D. They represent exciting areas for future research.

  • Beta node sharing
  • Lazy rule evaluation
  • Rule based partitioning
  • Backward inference

Boosting Performance with C

Right after posting my last blog entry I began work on Durable Rules third iteration. My main question was: How can I improve the framework performance by an order of magnitude across all scenarios? As you can see it has been quite a long time since my last blog post. Well… I have been very busy finding an answer for the question above.

Early on, I decided to re-implement the core rules engine in C. Compared with scripting and memory managed languages, C provides good control over memory allocation and a lot flexibility on data structure definition. Obviously, control and flexibility come at a price: It took me a lot longer to develop the code.

So, in the end, this exercise was not just simply porting code from one language to another. I had to rethink and design the data structures. I also had to design and write a JSon parser. And I must say: I also took some time to invent a new feature to enable batch processing for streaming. The detailed explanation is in the following sections.

Data Structure Definition

The most important part of the project was to define a suitable data structure to represent a ruleset. The central principle: the system has to guarantee O(n) (linear) performance as a function of the input (event) size. To illustrate how this is achieved, let’s consider the simple rule definition below:

var d = require('durable');{
    rating: {
        track: {
            whenAll: {
                a: { event: ’start’ },
                b: { event: ‘end’ }
            run: function(s) { }

The code above waits for two events ‘start’ and ‘end’ to execute the ‘track’ action. Conceptually this rule is represented by a Rete tree, which root is an alpha node for message based events (type = $m). This node is followed by two alpha nodes with two rules event = ’start’ and event = ‘end’ respectively. The alpha nodes are joined by the ‘all’ beta node, which is followed by the track action node.


In my C implementation the Rete tree is a trie. Every time an event is observed, each event message property is evaluated in the trie. The conceptual counterpart of a trie node is an alpha node in the Rete tree. Thus each trie node contains the rule to be applied to a single named property. Also each trie node points to subsequent properties to be evaluated (typically derived from an ‘and’ expression). The leafs of the trie point to a linked list of beta nodes terminated by an action node.


Very important: the trie node keeps the map of subsequent properties in a hash table. This is a slight deviation over typical trie nodes used in text context search, which keep a map of subsequent valid characters using an array. I decided to handle collisions in the hash table using a simple linear probing algorithm. With linear probing I get better memory usage and acceptable predictable near constant time performance as long as the load in the tables is kept low.

In order to increase memory locality and make better use of the processor cache, the trie structure is packaged in two arrays. One array has all the alpha, beta and action nodes. While a second array contains all the hash table buckets for the trie nodes. Nodes and hash tables never hold pointers to other nodes, only array offsets.


JSon parsing

To reduce the cost of data transformation I decided to use JSon as the event type system for this new Rules engine. Thus a JSon object could be passed directly from a Web Request input stream without any extra processing. C doesn’t have built in facilities for JSon parsing. So I wrote a lightweight JSon parser following a few strict principles:

  • Avoid Object Model (DOM) definition and buffering
  • Avoid using the heap, only use stack memory
  • Optimize for parsing key value pairs in a single pass
  • Calculate property name hashes while parsing

The JSon parser is really just a simple state machine tailored for the specific use in the rules engine. An important aside: I chose the DJB algorithm to hash property names because it is known to have good distribution and it added minimal overhead when parsing. The property name hash codes, as explained in the section above, are critical for the Rete tree ultra fast evaluation.


Compared with events which don’t match any rule, single positive event evaluation is expensive because it needs to be recorded in the Redis cache. In some cases it triggers the evaluation of a beta join and in other cases the queueing of an action. Batching helps optimizing the cost of all this activity and allows for processing large streams of data. Let’s consider the following snippet of code:{
    approval: {
        rule: {
            whenSome: { $and: [ { subject: 'approve’ }, { $lte: { amount: 1000 }}]},
            run: function(s) {}
}, '', null, function(host) {
    host.postBatch('approval', [{ id: '0', sid: 1, subject: 'approve', amount: 100 }, 
                                { id: '1', sid: 1, subject: 'approve', amount: 100 },
                                { id: '2', sid: 1, subject: 'approve', amount: 100 },
                                { id: '3', sid: 1, subject: 'approve', amount: 100 },
                                { id: '4', sid: 1, subject: 'approve', amount: 100 }]);

The ‘rule’ action will be able to process at least one event which matches the expression at the time of evaluation. The postBatch function allows clients to post an array of event messages for evaluation and dispatching.


In order not to lose track of my main objective I constantly measured performance when developing the project. When talking about performance there is always a lot of confusion. So first I will explain the methodology I used for measuring, then I will present the results for three important tests.

In all benchmarks: I used the same equipment: IMac OS X 10.9.4, 3.4GHz i7, 16GB RAM 1333MGHz DDR3. I drove the CPU utilization to 90% across all cores by running the same test concurrently in as many node.js processes as needed. In addition I added as many Redis servers as required to prevent it from becoming a bottleneck. I defined the number of operations to be performed and I measured the time it took to perform them.’Throughput’ is: the number of operations divided by the time it took to perform them. ‘Cost’ is: elapsed time divided by the number of operations. In the notes below I use the unit ‘K’ to denote thousand and ‘M’ to denote million.

Reject Events Test

In this test I defined a ruleset with 30 rules and continuously asserted event messages, which did not match any rule. I ran several experiments with different event message sizes. I used two different modes single and batch events.  The objective was to measure the raw JSon parsing and the tree expression evaluation speed. Just as a reference, not as baseline, I ran the same experiments for the JScript JSON.Parse function.


In batch mode durable rules is able to process a little less than 10M small messages (30B) every second, in single message mode it is 25% slower (7.5M messages every second). In both batch and single mode durable rules is able to process 1M larger messages (1KB) every second. Important note: the complexity of the algorithm is linear (O(n)) to the size of event messages. Interesting factoid: parsing the same messages in JScript (JSON.Parse) is also linear, but every byte takes more cycles to parse, indeed it can parse 10M small messages every second but only 0.25M large messages every second.

Accept Events Test

Again I defined a ruleset with 30 rules and continuously asserted event messages. This time all event messages matched a rule in the set and lead to queueing an action. I ran experiments with different message sizes and I tried two modes: single and batch. The objective was to measure redis caching and beta join calculation done in Redis scripts.


In batch mode durable rules is able to process 250K small event messages (50B) every second, while it can process 60K single small event messages every second. It can process 120K and 40K large event messages (1KB) every second in batch and single mode respectively. Again, the algorithm is linear (O(n)) to the size of the event message.

Rete Cycle Test

Finally I tested the full Rete cycle. I defined a 30 rules ruleset, but in this case not only were the event messages accepted, but the actions dequeued, the messages removed and the new associated state re-asserted. All of this was done while keeping the system consistency guarantees.


In batch mode durable rules can run 100K small event messages (50B) through the Rete cycle, while in single mode it can run 18K. In the case of larger event messages (1KB), the results are 15K and 65K in single and batch mode respectively. The algorithm is linear (O(n)) to the size of the event message.

In conclusion the new C implementation provides a significant performance boost over my previous JScript based implementation. The new batch\streaming feature allows for flowing a lot more data through the system making a more efficient use of the Redis cache.

To learn more, please visit:

Rete Meets Redis

In my previous post I briefly touched on the subject of Rules and the work I published in NPM as ‘durable’, which source is available under In this post I explore in detail how to use the Rete algorithm and the Redis server as a powerful combination for processing real time events. A little bit of background first: The Rete algorithm was created a few decades ago and has been used in Inference based Expert Systems for efficient execution of business rules. Redis is a data-structure server, currently  used in a large variety of scalable applications.

Ruleset Definition

A ruleset is a set of conditions followed by actions. In ‘durable’ I use rulesets to express conditions about events, that is, things that happen. Let’s illustrate this with a very simple and downright simplistic example. Imagine we are rating the effectiveness of a web page, and we establish the following rules:

  1. As soon as a visitor hits our web page, we will start a timer of 5 minutes (assuming the visitor is now reading).
  2. When our visitor clicks on a link within the 5 minute timeout, we are going to increase the page score (yes the page worked!)
  3. When the timeout expires and our visitor did not click on any link, we are going to decrease the page score (the page was abandoned)

I now translate the rules above to the simple JSON abstraction used in ‘durable’ for expressing rules. Don’t worry if you don’t understand all the details as long as you see how it relates to the description above, you should be able to follow. If you are curious about the rules definition abstraction please visit

rating: {
    r1: {
        when: { event: ’start' },
        run: function(s) {
            s.state = ‘reading’;
            s.startTimer(’timeout’, 300);
    r2: {
        whenAll: {
            $m: { event: ’click’  },
            $s: { state: ‘reading’ }
        run: function(s) { increaseScore(); }
    r3: {
        whenAll: {
            $m: { $t: ’timeout’ },
            $s: { state: ‘reading’ }
        run: function(s) { decreaseScore(); }

The Rete Tree

When running the ruleset above, the first thing the ‘durable’ runtime is going to do is create a Rete tree as shown below:

Rating Ruleset Rete Tree

In a Rete tree there are three different kinds of nodes:

  1. The solid line circles in the diagram above are called alpha nodes. They have one input and can have several outputs. Their main task is routing data into beta or action nodes. They are associated with an assertion about the input. When the assertion is met, the data is pushed into all children nodes.
  2. The dashed line circles are known as beta nodes. They have two or more inputs and one output. They are in charge of doing join operations. That is, pushing data into action nodes when all input criteria has been met.
  3. The squares represent actions. Actions do real work (such as sending email or triggering other events), they can change the state of the tree, negating previous state and asserting new state.

The two top level alpha nodes represent the type of assertions that can be made. In the case of durable there are always two top level nodes: one for event messages and another for user state changes.

Forward Chaining

Now it is time to consider how an event sequence works. Please pay special attention to step 2.iv. Remembering facts about rules antecedent state is the key for high performance rule evaluation. When the event in step 3 happens, the system doesn’t need to reevaluate the entire ruleset and it does not need to deserialize the user state.

1. Let’s start by posting the following message:

    id: 1,
    event: ‘start'
i. The message comes through the REST handler and is posted to the ruleset’s Rete tree.
ii. The message is pushed to the alpha node `type = ‘$m’`. The condition is matched, so it is now pushed to the next alpha node.
iii. The message meets the condition `$m.event = ‘start’`. It is then pushed to the action node.
iv. An action item is added to the action queue of the ruleset. The REST handler response is sent to back to the user at this point.

2. The action is picked up by the background dispatch loop:

i. The action is executed, the state is changed to `s.state = ‘reading’`.
ii. The new state is pushed to the alpha node `type = ‘$s’`. The condition is matched. The state is pushed to the next alpha node.
iii. The state meets the condition `s.state = ‘reading’`, so it is pushed to the beta node.
iv. The ‘all’ join in the beta node is not complete yet. This input is remembered (this fact is stored in memory).

3. Now we assume the user clicked on a link in the page and as a result the following message is posted:

    id: 2,
    event: ‘click'
i. The request comes through the REST handler and is posted to the ruleset’s Rete tree.
ii. The message is pushed to the alpha node `type = ‘$m’`, which pushes the message to the next alpha node.
iii. The message meets the condition `$m.subject = ‘approved’` and is pushed to the beta node.
iv. The ‘all’ join is now satisfied, so the message is pushed to the action node.
v. Now an action is added to the action queue of the ruleset. The REST handler response is sent to back to the user at this point.

4. The action is picked up by the background dispatch loop.

i. The all join state is deleted.
ii. The action is executed.

Tree State in Redis

In the previous section you might have noticed I glossed over the Rete tree memory management details (points 2.iv and 3.iv). In order to scale out, that is, to be able to use several cores to evaluate the ruleset, it is necessary to store the tree state in a cache external to the processes. Managing the beta nodes state needs careful consideration. Let’s look at a first approach to handling step 2.iv and 3.iv above:

Get the beta node state from the cache
If the ‘all' join is not complete
    Store the fact that one beta node side is complete
    Delete the beta node state from the cache
    Add an item to the action queue

The pseudo code above has two fatal flaws: 1. A race condition when checking and storing the beta node state can lead to losing event messages 2. Failure after deleting the beta node state can lead to dropping the action. To fix both problems we need to use a lock and the code needs to be idempotent, so it can be retried in case of failure.

Acquire a lock with a timeout (in case of failure)
Get the beta node state from the cache
If the ‘all' join is not complete
    Store the fact that one beta node side is complete
    Add an item to the action queue if it has not been added
    Delete the beta node state from the cache
Release the lock

Now the code above requires up to 5 cache accesses. In addition, recovering from a failure will delay waiting for the lock to timeout. In order to achieve real-time event processing performance only one memory cache access can be afforded for each relevant event (I came to this conclusion after studying the performance profiles from my benchmarks). There has to be a better way of doing this.

Redis offers features such as sorted sets, single threaded multi command execution and Lua scripting, which can be used to efficiently address the problem above. The Forward Chaining algorithm has to be modified for efficient memory management and join evaluation. Specifically: 1. Push the beta node state all the way down to the action nodes 2. The action nodes now are in charge of storing the information in the Redis, evaluating upstream joins and enqueuing actions in a single multi execute sequence.To illustrate this point, for the example above, the following sets are created to store the join information:

  • rating.r1.$m
  • rating.r2.$s.$s
  • rating.r2.$m.$m
  • rating.r3.$s.$s
  • rating.r3.$m.$m

The join evaluation works as follows:

Multi execute sequence for step 2.iv.
    Add an entry identifying the user state to rating.r2.$s.$s.
    Join by testing if rating.r2.$s.$s and rating.r2.$m.$m have an entry.
Multi execute sequence for step 3.iv.
    Add an entry identifying the event message to rating.r2.$m.$m.
    Join by testing if rating.r2.$s.$s and rating.r2.$m.$m have an entry.
    Add action entry to the action queue.
    Remove the entries in rating.r2.$m.$m and rating.r2.$s.$s.

Please note multi execute is run as one single threaded set of instructions in the Redis server. The use of sets ensure the idempotency of operations (it can be retried in case of failure).

Inference: From Expert Systems to Cloud Scale Event Processing

For a few years now I have been interested in event processing, tracking and analyzing information about things that happen. Not only because time and again I have stumbled upon problems that can easily be modeled with events, but because I have come to believe proper event modeling, just as data modeling, is one of the fundamental pillars of distributed application design and development.  E-commerce, supply-chain-management, financial, social and health-care applications, are some of the domains where event processing is interesting.

There are a number of high end commercial products specifically tailored for event processing. Despite very strong theoretical underpinnings, in my opinion, many of them suffer from fundamental problems:

  • Complex abstraction: Very powerful abstractions can cover a large number of cases. The learning curve, however, is steep. In addition the abstraction is not well integrated with most popular languages and tooling.
  • Type system impedance: Event rules and event data are defined in a custom type system. User state typically handled separately using relational models. As a result applications spend a large amount of cycles transforming data.
  • Events and State storage: Because of the type system impedance, events and state are often stored separately, which leads to increased network access and complex distributed coordination.
  • Difficult to manage: Monolithic servers cannot readily be deployed to the cloud, as they require special network and management arrangements.

First Round

About a year ago, given the rapid developments in cloud technologies and the trend towards enterprise software democratization. I decided to invest my personal time in researching this area and sharing the results with anyone interested. After a few months of intense creativity, learning, and coding I published the results in NPM under the name ‘durable’ version 0.0.x, git repository

Important learnings to report:

  • The best place for writing code is the airplane, preferable overseas roundtrip to\from Shanghai. Indeed: no email, no messages, no phone, 12-14 hours of uninterrupted coding! Unsurpassable remedy for the jet-lag when flying out of Seattle around noon. Getting back to Seattle at 8AM is a little rough, given you need to start the day after staring at the computer for more than 10 hours.
  • A new exciting project is a perfect way to justify to your partner the purchase of new equipment (in my case a beautiful MacBook pro). I must say: retina display has kept me motivated, as I look forward to working on my project at least a few minutes every day just because the editor, the browser and the terminal look so clear and beautiful.

On a more serious note: From the beginning of the project I established a few core principles worth enumerating and carrying along in future iterations.

  • JSON type system: Event information and user state are defined stored and queried using the JSON type system.
  • Storage: Events and user state are stored in the same place. This eliminates unnecessary network access and distributed consistency coordination.
  • REST API: Events are raised through a REST API. This allows for easy integration with different event sources.
  • Fault tolerance: Reactions to events are guaranteed to be executed at least once. The tradeoff is reactive functions need to be idempotent.
  • Cloud ready: Leverage technologies that can easily be hosted and managed in the cloud. Node.js, D3.js and MongoDb in this case.

I was very happy with the results of my work for about a week. Then disaster! The code got old, it started to rot and had to be rewritten (yes I have worked as a manager for a long time, however I’m still a developer and I can’t help this). A couple of fundamental problems made me feel uncomfortable with my recent work:

  • Meta-linguistic abstraction: It had many concepts, which made it complex. It heavily relied on the ‘Promise’ concept, which did not compose well with Node.js ecosystem. It inherited the MongoDb expression language with its faults and limitations. For example, a logical ‘and’ was defined using a JSON object, which implies no ordering.
  • Message brokering:  Events were stored as messages in a queue. A background daemon periodically fetched and correlated messages with user state. As a result I could not bring the performance of the system quite where I wanted.

Note: I will describe the benchmark in future posts, suffice it to say: in a quad core, 16 GB IMac I got 700 event-action-event cycles/sec

Second Round

September last year, I had heard a number of people allude to the importance and relevance of business rules. I was familiar with this domain, but I had always considered it a solution for a very specific problem. With some skepticism, I spent a few days reading documents on inference, forward chaining and Rete algorithms. It occurred to me that inference could help improve the performance of ‘durable’ at least by an order of magnitude. So I decided to start a new code iteration by implementing a Rete algorithm, which could scale out simply by adding commodity hardware. The published papers on forward chaining only consider an in memory single process environment, without problems of consistency nor cost of memory access. So my main area of focus became the use of  cache technology for storing inference state. At the end I decided to use Redis because it is fast, it offers powerful data structures (hashsets and ordered hashsets), server side scripting and, believe it or not, its single threaded model is great for handling concurrency and consistency.

On top of the basic principles I had established in the first iteration (JSON, storage, REST API, Fault Tolerance and Cloud Ready), I adopted 4 new principles:

  • Meta-linguistic abstraction: A reduced set of concepts provides an intuitive, minimalistic abstraction to ease the learning curve.
  • Rules: The basic building block of the system is the ‘Rule’ which is composed of an antecedent (expression) and a consequent (action). By allowing expressing rules over incoming events as well as user state, more complex constructs such as statecharts and flowcharts can be supported, because they can be reduced to a set of rules.
  • Forward chaining: Allows for quick event evaluation without the need for recomputing all expressions nor reading and deserializing user state. This the key for a significant performance boost over my previous implementation based on message brokering.
  • Caching: Inference state is stored in a out of proc cache, which allows scaling out rule evaluation beyond a single process using commodity hardware.

I just published my work to NPM under the name ‘durable’ version 0.10.x, git repository So far I’m still happy with my work, but a few questions remain unanswered:

  • Can I improve the performance of the system even more by reimplementing it in C and cutting the cost of event marshaling to Redis?
  • Or will the next performance boost come from moving rule evaluation closer to the memory location?
  • Can I leverage a C implementation to support meta-linguistic abstractions for Python and Ruby?
  • Now that I’m not traveling to Shanghai as often, how can I find 12 hours of uninterrupted coding time?

Note: After more than a month of perf tuning, using the same benchmark as in First Round, with a quad core, 16 GB IMac I got 7500 event-action-event cycles/sec.


Get every new post delivered to your Inbox.