Transaction In Doubt

My notes on Software Engineering

Tag: transactions

Boosting Performance with C

Right after posting my last blog entry I began work on Durable Rules third iteration. My main question was: How can I improve the framework performance by an order of magnitude across all scenarios? As you can see it has been quite a long time since my last blog post. Well… I have been very busy finding an answer for the question above.

Early on, I decided to re-implement the core rules engine in C. Compared with scripting and memory managed languages, C provides good control over memory allocation and a lot flexibility on data structure definition. Obviously, control and flexibility come at a price: It took me a lot longer to develop the code.

So, in the end, this exercise was not just simply porting code from one language to another. I had to rethink and design the data structures. I also had to design and write a JSon parser. And I must say: I also took some time to invent a new feature to enable batch processing for streaming. The detailed explanation is in the following sections.

Data Structure Definition

The most important part of the project was to define a suitable data structure to represent a ruleset. The central principle: the system has to guarantee O(n) (linear) performance as a function of the input (event) size. To illustrate how this is achieved, let’s consider the simple rule definition below:

var d = require('durable');{
    rating: {
        track: {
            whenAll: {
                a: { event: ’start’ },
                b: { event: ‘end’ }
            run: function(s) { }

The code above waits for two events ‘start’ and ‘end’ to execute the ‘track’ action. Conceptually this rule is represented by a Rete tree, which root is an alpha node for message based events (type = $m). This node is followed by two alpha nodes with two rules event = ’start’ and event = ‘end’ respectively. The alpha nodes are joined by the ‘all’ beta node, which is followed by the track action node.


In my C implementation the Rete tree is a trie. Every time an event is observed, each event message property is evaluated in the trie. The conceptual counterpart of a trie node is an alpha node in the Rete tree. Thus each trie node contains the rule to be applied to a single named property. Also each trie node points to subsequent properties to be evaluated (typically derived from an ‘and’ expression). The leafs of the trie point to a linked list of beta nodes terminated by an action node.


Very important: the trie node keeps the map of subsequent properties in a hash table. This is a slight deviation over typical trie nodes used in text context search, which keep a map of subsequent valid characters using an array. I decided to handle collisions in the hash table using a simple linear probing algorithm. With linear probing I get better memory usage and acceptable predictable near constant time performance as long as the load in the tables is kept low.

In order to increase memory locality and make better use of the processor cache, the trie structure is packaged in two arrays. One array has all the alpha, beta and action nodes. While a second array contains all the hash table buckets for the trie nodes. Nodes and hash tables never hold pointers to other nodes, only array offsets.


JSon parsing

To reduce the cost of data transformation I decided to use JSon as the event type system for this new Rules engine. Thus a JSon object could be passed directly from a Web Request input stream without any extra processing. C doesn’t have built in facilities for JSon parsing. So I wrote a lightweight JSon parser following a few strict principles:

  • Avoid Object Model (DOM) definition and buffering
  • Avoid using the heap, only use stack memory
  • Optimize for parsing key value pairs in a single pass
  • Calculate property name hashes while parsing

The JSon parser is really just a simple state machine tailored for the specific use in the rules engine. An important aside: I chose the DJB algorithm to hash property names because it is known to have good distribution and it added minimal overhead when parsing. The property name hash codes, as explained in the section above, are critical for the Rete tree ultra fast evaluation.


Compared with events which don’t match any rule, single positive event evaluation is expensive because it needs to be recorded in the Redis cache. In some cases it triggers the evaluation of a beta join and in other cases the queueing of an action. Batching helps optimizing the cost of all this activity and allows for processing large streams of data. Let’s consider the following snippet of code:{
    approval: {
        rule: {
            whenSome: { $and: [ { subject: 'approve’ }, { $lte: { amount: 1000 }}]},
            run: function(s) {}
}, '', null, function(host) {
    host.postBatch('approval', [{ id: '0', sid: 1, subject: 'approve', amount: 100 }, 
                                { id: '1', sid: 1, subject: 'approve', amount: 100 },
                                { id: '2', sid: 1, subject: 'approve', amount: 100 },
                                { id: '3', sid: 1, subject: 'approve', amount: 100 },
                                { id: '4', sid: 1, subject: 'approve', amount: 100 }]);

The ‘rule’ action will be able to process at least one event which matches the expression at the time of evaluation. The postBatch function allows clients to post an array of event messages for evaluation and dispatching.


In order not to lose track of my main objective I constantly measured performance when developing the project. When talking about performance there is always a lot of confusion. So first I will explain the methodology I used for measuring, then I will present the results for three important tests.

In all benchmarks: I used the same equipment: IMac OS X 10.9.4, 3.4GHz i7, 16GB RAM 1333MGHz DDR3. I drove the CPU utilization to 90% across all cores by running the same test concurrently in as many node.js processes as needed. In addition I added as many Redis servers as required to prevent it from becoming a bottleneck. I defined the number of operations to be performed and I measured the time it took to perform them.’Throughput’ is: the number of operations divided by the time it took to perform them. ‘Cost’ is: elapsed time divided by the number of operations. In the notes below I use the unit ‘K’ to denote thousand and ‘M’ to denote million.

Reject Events Test

In this test I defined a ruleset with 30 rules and continuously asserted event messages, which did not match any rule. I ran several experiments with different event message sizes. I used two different modes single and batch events.  The objective was to measure the raw JSon parsing and the tree expression evaluation speed. Just as a reference, not as baseline, I ran the same experiments for the JScript JSON.Parse function.


In batch mode durable rules is able to process a little less than 10M small messages (30B) every second, in single message mode it is 25% slower (7.5M messages every second). In both batch and single mode durable rules is able to process 1M larger messages (1KB) every second. Important note: the complexity of the algorithm is linear (O(n)) to the size of event messages. Interesting factoid: parsing the same messages in JScript (JSON.Parse) is also linear, but every byte takes more cycles to parse, indeed it can parse 10M small messages every second but only 0.25M large messages every second.

Accept Events Test

Again I defined a ruleset with 30 rules and continuously asserted event messages. This time all event messages matched a rule in the set and lead to queueing an action. I ran experiments with different message sizes and I tried two modes: single and batch. The objective was to measure redis caching and beta join calculation done in Redis scripts.


In batch mode durable rules is able to process 250K small event messages (50B) every second, while it can process 60K single small event messages every second. It can process 120K and 40K large event messages (1KB) every second in batch and single mode respectively. Again, the algorithm is linear (O(n)) to the size of the event message.

Rete Cycle Test

Finally I tested the full Rete cycle. I defined a 30 rules ruleset, but in this case not only were the event messages accepted, but the actions dequeued, the messages removed and the new associated state re-asserted. All of this was done while keeping the system consistency guarantees.


In batch mode durable rules can run 100K small event messages (50B) through the Rete cycle, while in single mode it can run 18K. In the case of larger event messages (1KB), the results are 15K and 65K in single and batch mode respectively. The algorithm is linear (O(n)) to the size of the event message.

In conclusion the new C implementation provides a significant performance boost over my previous JScript based implementation. The new batch\streaming feature allows for flowing a lot more data through the system making a more efficient use of the Redis cache.

To learn more, please visit:

Inference: From Expert Systems to Cloud Scale Event Processing

For a few years now I have been interested in event processing, tracking and analyzing information about things that happen. Not only because time and again I have stumbled upon problems that can easily be modeled with events, but because I have come to believe proper event modeling, just as data modeling, is one of the fundamental pillars of distributed application design and development.  E-commerce, supply-chain-management, financial, social and health-care applications, are some of the domains where event processing is interesting.

There are a number of high end commercial products specifically tailored for event processing. Despite very strong theoretical underpinnings, in my opinion, many of them suffer from fundamental problems:

  • Complex abstraction: Very powerful abstractions can cover a large number of cases. The learning curve, however, is steep. In addition the abstraction is not well integrated with most popular languages and tooling.
  • Type system impedance: Event rules and event data are defined in a custom type system. User state typically handled separately using relational models. As a result applications spend a large amount of cycles transforming data.
  • Events and State storage: Because of the type system impedance, events and state are often stored separately, which leads to increased network access and complex distributed coordination.
  • Difficult to manage: Monolithic servers cannot readily be deployed to the cloud, as they require special network and management arrangements.

First Round

About a year ago, given the rapid developments in cloud technologies and the trend towards enterprise software democratization. I decided to invest my personal time in researching this area and sharing the results with anyone interested. After a few months of intense creativity, learning, and coding I published the results in NPM under the name ‘durable’ version 0.0.x, git repository

Important learnings to report:

  • The best place for writing code is the airplane, preferable overseas roundtrip to\from Shanghai. Indeed: no email, no messages, no phone, 12-14 hours of uninterrupted coding! Unsurpassable remedy for the jet-lag when flying out of Seattle around noon. Getting back to Seattle at 8AM is a little rough, given you need to start the day after staring at the computer for more than 10 hours.
  • A new exciting project is a perfect way to justify to your partner the purchase of new equipment (in my case a beautiful MacBook pro). I must say: retina display has kept me motivated, as I look forward to working on my project at least a few minutes every day just because the editor, the browser and the terminal look so clear and beautiful.

On a more serious note: From the beginning of the project I established a few core principles worth enumerating and carrying along in future iterations.

  • JSON type system: Event information and user state are defined stored and queried using the JSON type system.
  • Storage: Events and user state are stored in the same place. This eliminates unnecessary network access and distributed consistency coordination.
  • REST API: Events are raised through a REST API. This allows for easy integration with different event sources.
  • Fault tolerance: Reactions to events are guaranteed to be executed at least once. The tradeoff is reactive functions need to be idempotent.
  • Cloud ready: Leverage technologies that can easily be hosted and managed in the cloud. Node.js, D3.js and MongoDb in this case.

I was very happy with the results of my work for about a week. Then disaster! The code got old, it started to rot and had to be rewritten (yes I have worked as a manager for a long time, however I’m still a developer and I can’t help this). A couple of fundamental problems made me feel uncomfortable with my recent work:

  • Meta-linguistic abstraction: It had many concepts, which made it complex. It heavily relied on the ‘Promise’ concept, which did not compose well with Node.js ecosystem. It inherited the MongoDb expression language with its faults and limitations. For example, a logical ‘and’ was defined using a JSON object, which implies no ordering.
  • Message brokering:  Events were stored as messages in a queue. A background daemon periodically fetched and correlated messages with user state. As a result I could not bring the performance of the system quite where I wanted.

Note: I will describe the benchmark in future posts, suffice it to say: in a quad core, 16 GB IMac I got 700 event-action-event cycles/sec

Second Round

September last year, I had heard a number of people allude to the importance and relevance of business rules. I was familiar with this domain, but I had always considered it a solution for a very specific problem. With some skepticism, I spent a few days reading documents on inference, forward chaining and Rete algorithms. It occurred to me that inference could help improve the performance of ‘durable’ at least by an order of magnitude. So I decided to start a new code iteration by implementing a Rete algorithm, which could scale out simply by adding commodity hardware. The published papers on forward chaining only consider an in memory single process environment, without problems of consistency nor cost of memory access. So my main area of focus became the use of  cache technology for storing inference state. At the end I decided to use Redis because it is fast, it offers powerful data structures (hashsets and ordered hashsets), server side scripting and, believe it or not, its single threaded model is great for handling concurrency and consistency.

On top of the basic principles I had established in the first iteration (JSON, storage, REST API, Fault Tolerance and Cloud Ready), I adopted 4 new principles:

  • Meta-linguistic abstraction: A reduced set of concepts provides an intuitive, minimalistic abstraction to ease the learning curve.
  • Rules: The basic building block of the system is the ‘Rule’ which is composed of an antecedent (expression) and a consequent (action). By allowing expressing rules over incoming events as well as user state, more complex constructs such as statecharts and flowcharts can be supported, because they can be reduced to a set of rules.
  • Forward chaining: Allows for quick event evaluation without the need for recomputing all expressions nor reading and deserializing user state. This the key for a significant performance boost over my previous implementation based on message brokering.
  • Caching: Inference state is stored in a out of proc cache, which allows scaling out rule evaluation beyond a single process using commodity hardware.

I just published my work to NPM under the name ‘durable’ version 0.10.x, git repository So far I’m still happy with my work, but a few questions remain unanswered:

  • Can I improve the performance of the system even more by reimplementing it in C and cutting the cost of event marshaling to Redis?
  • Or will the next performance boost come from moving rule evaluation closer to the memory location?
  • Can I leverage a C implementation to support meta-linguistic abstractions for Python and Ruby?
  • Now that I’m not traveling to Shanghai as often, how can I find 12 hours of uninterrupted coding time?

Note: After more than a month of perf tuning, using the same benchmark as in First Round, with a quad core, 16 GB IMac I got 7500 event-action-event cycles/sec.