Transaction In Doubt

My notes on Software Engineering

Tag: Python

Miss Manners and Waltzdb

While durable_rules is a framework for complex event coordination, its linguistic abstraction is also suitable for solving business rules problems such as constraint propagation. At its core durable_rules is powered by a variation of the Rete algorithm. Measuring and driving its performance using well established Business Rules industry standards has been extremely valuable. It has brought  the quality of the implementation to a new level: improving performance by orders of magnitude and making the system more robust.

The most popular industry standards for production Business Rules systems are “Miss Manners” and Waltzdb. Designed more than 25 years ago, they have been subject of endless debate. Their direct applicability to customer problems as well as features used by platform vendors specifically to improve the numbers have been questioned. After all, however, we still use such benchmarks to understand the performance of Business Rules systems. Manners and Waltzdb are complementary benchmarks: while Manners tests the speed of combinatoric evaluation, Waltzdb tests the efficiency of action scheduling and evaluation. 

Just as Thomas Edison once famously said about genius, performance improvements too are one percent inspiration and ninety nine percent perspiration. Because durable_rules relies on Redis to store inference state, all performance improvements derived from optimizing how Redis is used. For each experiment I reset the slowlog, executed the test, reviewed the slowlog results and implemented an improvement (sometimes a regression 😦 ). I repeated this over and over. In my first experiments I set the slowlog threshold to 1 second, as I implemented effective improvements I reduced the slowlog threshold and by the end I was only reviewing the commands which took more than 1 millisecond.

All the performance measurements were done using Redis 3.0, unix sockets, disabling Redis disk save and iMac, 4GHz i7, 32GB 1600MHz DDR3, 1.12 TB Fusion Drive.

Principles

Consider the following tests plotted in the chart below:

  1. Blue: Execute a set of sequential hget operations from a node.js client in a tight loop. The throughput for retrieving a 200 byte value is about 25K operations every second, as the size increases, the throughput decreases to 11K operations for 20KB values.
  2. Green: A set of 100 hget operations batched in a multi-exec command: The throughput in the case of 200 byte value is around 160k operations every second, degrading to 18k when the value is 20KB.
  3. Yellow: A set of hget operations inside a Lua script: The throughput for 200 bytes jumps all the way up to 3M operations/sec, the throughput degrades to 458K for 20KB value.
  4. Red: A set of cmsg unpack operations inside a Lua script: The throughput for 200 bytes is 9M operations/sec, which degrades to 1.1M for 20KB values.
  5. Orange: A set of Lua table get inside a Lua script: The throughput remains constant at 63M operations/sec.

redis

What is important about the experiment above is the order of magnitude of each operation. For the Manners and Waltzdb benchmarks, where hundreds of thousands of combinatorics need to be evaluated in a few seconds, reducing sequential and batched access from redis clients is critical but not enough. Access to Redis structures from Lua as well as packing and unpacking using cmsg can also become a bottleneck. Thus, all performance optimizations were done following these principles.

  1. Push the combinatorics evaluation as close to the data as possible leveraging Lua scripts.
  2. Design chunky Lua script function signatures batching operations when possible.
  3. Avoid fetching the same information multiple times from Redis data-structures by using Lua local variables.
  4. Be careful with the shape of the data stored, avoid excessive cmsg usage.

Miss Manners

Miss Manners has decided to throw a party. She wants her guests not to get bored, so she needs to sit them such that adjacent guests are always of opposite sex and share at least one hobby. The problem needs to be resolved for 8, 16, 32 and 128 guests. 

mannerstable

The ruleset for the algorithm is relatively simple consisting of only seven rules (RubyPython, JScript). Guests hobbies are expressed using facts. Therefore the input datasets consists of 20, 40, 83 and 439 facts for 8, 16, 32 and 128 guests respectively. The declarative algorithm conceptually (not practically) creates the combinatorics tree for all the guests and selects the first valid solution via depth first search.

Learnings

The antecedent in the rule below (expressed in Python) is the heart of the algorithm, as it determines valid solutions. Consider the case of 128 seatings with 439 guest assertions: by the end of the test execution the first 3 terms of the expression will have generated hundreds of thousands combinatorics, the next two terms will have filtered out the invalid ones. Recalculating the full expression for all terms for all combinatorics every time a new seating is asserted slows down execution significantly. I addressed the problem with the following three improvements:


    @when_all(c.seating << (m.t == 'seating') & 
                           (m.path == True),
              c.right_guest << (m.t == 'guest') & 
                               (m.name == c.seating.right_guest_name),
              c.left_guest << (m.t == 'guest') & 
                              (m.sex != c.right_guest.sex) & 
                              (m.hobby == c.right_guest.hobby),
              none((m.t == 'path') & 
                   (m.p_id == c.seating.s_id) & 
                   (m.guest_name == c.left_guest.name)),
              none((m.t == 'chosen') & 
                   (m.c_id == c.seating.s_id) & 
                   (m.guest_name == c.left_guest.name) & 
                   (m.hobby == c.right_guest.hobby)),
              (m.t == 'context') & (m.l == 'assign'))
    def find_seating(c):
  1. When evaluating combinatorics and resolving action scheduling conflicts the rules engine has to choose the most recent result to enable the depth first search of a valid solution. This resolution policy has been implemented and pushed as de-facto standard since the early days of Business Rules.
  2. For a given rule, to avoid scanning and evaluating all expression lvalues (inference list) for every assertion, the inference results are stored in a hash-set indexed by the combined result of all equality expressions. Thus, the combinatorics calculation complexity becomes linear and depends on the number of assertions done on the rule.
  3. When evaluating a fact through a complex expression the engine might compare it with the same target fact multiple times, caching target facts in lua local variables reduces the number of Redis data-structure access.

Results

JavaScript is the fastest, as it solves the problem for 128 guests in about 1.7 seconds, on the flip side its memory usage is the highest reaching 46MB including Redis memory usage. Python can still solve the 128 guest problem below 2 seconds with remarkably efficient memory usage of 26MB including Redis memory. Ruby is not far behind, a little bit above 2 seconds and good memory usage of 35MB including Redis. 

manners

As shown in the graph below, benchmark execution time is directly correlated with the number of Redis commands executed. In the case of 128 guests more than 400K commands were executed. In the case of JavaScript, this is a throughput of 225K Redis commands per second.

mannerscommands

Waltzdb

Waltz is an image recognition algorithm invented by Dr. David Waltz in 1972. Given a set of lines in a 2D space, the computer needs to interpret the 3D depth of the image. The first part of the algorithm consists of identifying four types of junctions: Line, Arrow, Fork, Tee. Junctions have 16 different possible labelings following Huffman-Clowes notation, where + is convex, – is concave and an arrow is occluding. Valid junctions and labelings:

waltzjunctions

It is important to point out: pairs of adjacent junctions constraint each other’s edge labeling. So, after choosing the labeling for an initial junction, the second part of the algorithm iterates through the graph, propagating the labeling constraints by removing inconsistent labels. Example labeling a cube with 9 lines:

waltzcube

The ruleset for the algorithm consists of 34 rules (RubyPythonJScript). In the Waltzdb benchmark, the dataset provided is a set of cubes organized in regions. Each region consists of 4 visible and 2 hidden cubes composed by 72 lines. In addition, all tests are flanked by four visible and 1 hidden cube formed by 76 more lines. The tests were run with 4 (376 lines, 20 visible cubes, 9 hidden cubes), 8 (680 lines, 36 visible cubes, 17 hidden cubes), 12 (984 lines, 52 visible cubes, 23 hidden cubes), 16 (1288 lines, 68 visible cubes, 33 hidden cubes) and 50 regions (3872 lines, 204 visible cubes, 101 hidden cubes).

waltzdata

Learnings

  1. waltzdb requires a large amount of action evaluations. Some actions (reversing edges or making junctions) don’t imply a change in context (moving forward state), such actions can be batched and marshaled to the scripting engine in a single shot.
  2. Fact assertions can lead to multiple rule evaluation, instead of invoking a Lua function for every rule evaluation, all rule evaluations for a single fact assertion can be batched and marshaled to Redis in a single shot. 
  3. durable_rules doesn’t support fact modification. To avoid excessive churn in the scheduled action queue, events can be used (instead of facts) to drive the state machine coordination.

Results

JavaScript was the fastest completing the 50 region problem in 21 seconds, memory usage reached 170MB including Redis memory usage. Both Python and Ruby finished the problem in about 23 seconds (not far behind). The memory usage for both Python and Ruby was below 100MB (including Redis).

waltzdbres

Just as the Manners benchmark, the benchmark time is directly correlated with the number of Redis commands executed. The case of 50 regions executed almost 3.5M Redis commands. The test in JavaScript has a throughput of 159K Redis commands per second.
waltzdbcommandsConclusion

durable_rules efficiency in combinatorics evaluation is competitive,  its efficiency in memory usage is remarkable. The results compare well with well established Production Business Rules systems. There is performance penalty for marshaling actions in and out of Redis as shown in Waltzdb.

Redis is a great technology, but it has to be used appropriately. The fact that Manners is able to push 225k Redis commands and Waltzdb 159K Redis commands every second, while I was able to get 3M hget commands/sec in a simple experiment, means I might be able to still squeeze more performance out of this system.

Polyglot

From its beginning the objective of the durable_rules project has been to provide a library, which can be leveraged in the most popular languages for building internet applications. Even though durable_rules borrows a healthy number of concepts from the Production Business Rules and Expert Systems domains, I consider it critical to tap into the infrastructure and knowledge base, which has been built over the last few years with languages such as JavaScript, Python and Ruby.

The initial durable_rules prototypes were built using JavaScript and the node.js environment. As I considered targeting a larger audience, the implementation evolved into a core engine written in C, using JSON as the lingua-franca for defining both rulesets and the data to be processed. A set of thin adapters would allow using the engine in JavaScript, Python and Ruby. I borrowed the MongoDB JSON query syntax as the linguistic abstraction for defining rulesets. For optimal performance the data in Redis would be stored using the message pack format (not JSON).

arch

Because scripting languages have great support for the JSON data model (hash sets, arrays and primitives) I was optimistic (perhaps naive) this approach would be a one-size fits all solution. However, as the project progressed, I found a couple of problems with this idea. 

First problem

In durable_rules the logical (and, or) and sequence operator (when_all, when_any) precedence matters. In MongoDB an expression like: ‘qty > 100 and price < 9.95’ is represented as ‘{{ qty: { $gt: 100 } }, { price: { $lt: 9.95 } }}’. This is nice and simple, but the JSON data model doesn’t imply hash-set ordering. What if I don’t want the engine to evaluate price when qty < 100? …And then came other features such as correlations, statecharts, flowcharts, parallel, salience, tumbling windows, count windows… I departed from MongoDB queries. The example below shows how a typical rule is defined in the latest durable_rules implementation. This rule waits for three events: The first of type ‘purchase’. The second with the same ‘ip’ address as the first but different ‘cc’ number. And the third with same ‘ip’ address as the first, but ‘cc’ number different than that of the second and the first.

JSON.stringify({
    suspect: {
        all: [
            {first: {t: 'purchase'}},
            {second: { 
                $and: [
                    {ip: {first: 'ip'}}, 
                    {$neq: {cc: {first: 'cc'}}} 
                ]
            }},
            {third: { 
                $and: [
                    {ip: {first: 'ip'}}, 
                    {$neq: {cc: {first: 'cc'}}}, 
                    {ip: {second: 'ip' }}, 
                    {$neq: {cc: {second: 'cc'}}}
                ]
            }}
        ],
    }
})

Second problem

The rule above expresses precisely what I want. But I just cannot imagine manually defining tens, hundreds or thousands of rules using the format above. It is not sustainable. Ruby, with its great DSL support, gave me the creative inspiration for making significant improvements over the user model. By leveraging language extensibility I was able to define and follow the set of principles for the three libraries I have implemented so far JavaScript, Python, Ruby:

  1. Leverage the language
    • Use existing expression language
    • Use existing scoping constructs
    • Use blocks, lambdas and decorators
  2. Enable compact and maintainable code
    • Avoid nesting\scoping when possible
    • Encourage the user to define short phrases, avoid cascading
    • Facilitate troubleshooting

To the extent possible I tried driving uniformity across library implementations, however I did not sacrifice usability for uniformity. Below is a summary of the features I used for implementing the libraries. 

arch

Notes on Ruby

Ruby has the most comprehensive language extensibility: Operator overloading for logical and arithmetic expressions. Property accessor interception to leverage ‘.’ notation. Block support to seamlessly integrate actions as part of the rule definition. Contexts applied to block execution to avoid unnecessary variable qualification in expression rvalues and action implementations. All these features lead to a compact and fairly well integrated ruleset definitions.  

Durable.ruleset :suspect do
  when_all c.first = m.t == "purchase",
           c.second = (m.ip == first.ip) & (m.cc != first.cc),
           c.third = (m.ip == second.ip) & (m.cc != first.cc) & (m.cc != second.cc) do
    puts "detected " + first.cc + " " + second.cc + " " + third.cc
  end
end

‘durable_rules’ exposes a REST API to post events and assert facts. I used the Sinatra framework to implement such an API. It was very easy to integrate, except for the threading model: for each request a synchronous response is expected and multiple concurrent requests are dispatched in different threads in a single process.

One caveat, which really bothers me: In Ruby it is not possible to overload logical ‘and’ and ‘or’. I had to use bitwise operators instead. The bitwise operator precedence is different from that of their logical counterparts. So, it is required to enclose logical expressions with parens ‘(m.ip == first.ip) & (m.cc != first.cc)’ as opposed to ‘m.ip == first.ip && m.cc != first.cc’.

Notes on Python

Python has great language extensibility. Similar to Ruby: operator overloading for logical and arithmetic expressions and property accessor interception allows using ‘.’ notation for tuple names. Function decorators for rule action integration. Contexts for leveraging scoping constructs when defining rulesets, statecharts and flowcharts. This results in compact and well integrated ruleset definitions.

with ruleset(’suspect'):
    @when_all(c.first << m.t == 'purchase',
              c.second << (m.ip == first.ip) & (m.cc != first.cc),
              c.third << (m.ip == second.ip) & (m.cc != first.cc) & (m.cc != second.cc))
    def detected(c):
        print ('detected {0} {1} {2}'.format(c.first.cc, c.second.cc, c.third.cc))

I used the Werkzeug framework to implement the REST API. I considered using Flask, but I really didn’t need all its features. Similar to Sinatra and Ruby: for each request a synchronous response is expected and multiple concurrent requests are dispatched in different threads in a single process.

Python has the same caveat as Ruby regarding logical ‘and’ and ‘or’ overloading. Lambda support in Python is fairly limited, so I resorted to function decorators, still with a very reasonable outcome. In python ‘=‘ is a statement, so I had to use the ‘<<‘ operator for naming events and facts in correlated sequences.  

Notes on JavaScript

JavaScript does not have operator overloading, so a set of functions for describing logical and arithmetic expressions needs to be provided. Property accessor overloading is not readily available, I played a nifty trick to overcome this obstacle (see note below). Lambda support enables rule action integration. Contexts allow leveraging scoping constructs when defining rulesets, statecharts and flowcharts. While the result seems acceptable, the question still remains: has this crossed the usability tipping point?

with (d.ruleset('suspect')) {
    whenAll(c.first = m.amount.gt(100),
            c.second = m.ip.eq(c.first.ip).and(m.cc.neq(c.first.cc)), 
            c.third = m.ip.eq(c.second.ip).and(m.cc.neq(c.first.cc), m.cc.neq(c.second.cc)),
        function(c) {
            console.log('detected ' + c.first.cc + ' ' + c.second.cc + ' ' + c.third.cc);
        }
    );
}

Node.js provides great support for implementing the REST API. In addition the single threaded continuation based model is a beautiful invention (in my opinion), it helps with JavaScript ‘global’ variable management, enables non-blocking IO calls and provides nice concurrency control using multiple processes.

‘m.amount.gt(100)’: It is easy to take it for granted. In order to enable such an expression without requiring the user to define object hierarchies, I leveraged V8 extensibility in C++ (given that the rulesets are run in node.js). In V8 it is possible to provide a proxy object that intercepts Property Set\Get and call back a JavaScript lambda:

C++ class definition:

class Proxy {
public:

  explicit Proxy(Handle<Function> gvalue, Handle<Function> svalue) { 
    Isolate* isolate = Isolate::GetCurrent();
    _gfunc.Reset(isolate, gvalue); 
    _sfunc.Reset(isolate, svalue); 
  }
  Handle<Object> Wrap();

private:

  static Proxy* Unwrap(Local<Object> obj);
  static void Get(Local<String> name, const PropertyCallbackInfo<Value>& info);
  static void Set(Local<String> name, Local<Value> value, const PropertyCallbackInfo<Value>& info);
 
  Persistent<Function> _gfunc;
  Persistent<Function> _sfunc;
};

From JavaScript

var s = r.createProxy(
    function(name) {
      ….
    },
    function(name, value) {
    }
 );

Please find the full implementation in durable_rules code base.

A note on multi-platform support

Unfortunately a Polyglot framework does not imply multi-platform support. In the Unix ecosystem, the ‘durable_rules’ libraries (including the C engine implementations) have a high degree of portability. That is to say: having implemented the framework in the Mac OS X platform, very few changes were needed to make it work in Linux. Integrating the Windows platform has been a lot more difficult: VisualStudio, which is the premier toolset for Windows, doesn’t fully support C99 (variable length arrays, asprintf, snprintf…), luckily all these limitations can be overcome without needing fundamental changes nor creating a different code fork. Hiredis is not officially supported by Redis for Windows, the MS Open Tech team has done a nice job porting and maintaining the Redis code base for Windows. I extracted the ‘Hiredis’ portion for the ‘durable_rules’ C engine as it heavily depends on it.