Transaction In Doubt

My notes on Software Engineering

Tag: JScript

Miss Manners and Waltzdb

While durable_rules is a framework for complex event coordination, its linguistic abstraction is also suitable for solving business rules problems such as constraint propagation. At its core durable_rules is powered by a variation of the Rete algorithm. Measuring and driving its performance using well established Business Rules industry standards has been extremely valuable. It has brought  the quality of the implementation to a new level: improving performance by orders of magnitude and making the system more robust.

The most popular industry standards for production Business Rules systems are “Miss Manners” and Waltzdb. Designed more than 25 years ago, they have been subject of endless debate. Their direct applicability to customer problems as well as features used by platform vendors specifically to improve the numbers have been questioned. After all, however, we still use such benchmarks to understand the performance of Business Rules systems. Manners and Waltzdb are complementary benchmarks: while Manners tests the speed of combinatoric evaluation, Waltzdb tests the efficiency of action scheduling and evaluation. 

Just as Thomas Edison once famously said about genius, performance improvements too are one percent inspiration and ninety nine percent perspiration. Because durable_rules relies on Redis to store inference state, all performance improvements derived from optimizing how Redis is used. For each experiment I reset the slowlog, executed the test, reviewed the slowlog results and implemented an improvement (sometimes a regression 😦 ). I repeated this over and over. In my first experiments I set the slowlog threshold to 1 second, as I implemented effective improvements I reduced the slowlog threshold and by the end I was only reviewing the commands which took more than 1 millisecond.

All the performance measurements were done using Redis 3.0, unix sockets, disabling Redis disk save and iMac, 4GHz i7, 32GB 1600MHz DDR3, 1.12 TB Fusion Drive.

Principles

Consider the following tests plotted in the chart below:

  1. Blue: Execute a set of sequential hget operations from a node.js client in a tight loop. The throughput for retrieving a 200 byte value is about 25K operations every second, as the size increases, the throughput decreases to 11K operations for 20KB values.
  2. Green: A set of 100 hget operations batched in a multi-exec command: The throughput in the case of 200 byte value is around 160k operations every second, degrading to 18k when the value is 20KB.
  3. Yellow: A set of hget operations inside a Lua script: The throughput for 200 bytes jumps all the way up to 3M operations/sec, the throughput degrades to 458K for 20KB value.
  4. Red: A set of cmsg unpack operations inside a Lua script: The throughput for 200 bytes is 9M operations/sec, which degrades to 1.1M for 20KB values.
  5. Orange: A set of Lua table get inside a Lua script: The throughput remains constant at 63M operations/sec.

redis

What is important about the experiment above is the order of magnitude of each operation. For the Manners and Waltzdb benchmarks, where hundreds of thousands of combinatorics need to be evaluated in a few seconds, reducing sequential and batched access from redis clients is critical but not enough. Access to Redis structures from Lua as well as packing and unpacking using cmsg can also become a bottleneck. Thus, all performance optimizations were done following these principles.

  1. Push the combinatorics evaluation as close to the data as possible leveraging Lua scripts.
  2. Design chunky Lua script function signatures batching operations when possible.
  3. Avoid fetching the same information multiple times from Redis data-structures by using Lua local variables.
  4. Be careful with the shape of the data stored, avoid excessive cmsg usage.

Miss Manners

Miss Manners has decided to throw a party. She wants her guests not to get bored, so she needs to sit them such that adjacent guests are always of opposite sex and share at least one hobby. The problem needs to be resolved for 8, 16, 32 and 128 guests. 

mannerstable

The ruleset for the algorithm is relatively simple consisting of only seven rules (RubyPython, JScript). Guests hobbies are expressed using facts. Therefore the input datasets consists of 20, 40, 83 and 439 facts for 8, 16, 32 and 128 guests respectively. The declarative algorithm conceptually (not practically) creates the combinatorics tree for all the guests and selects the first valid solution via depth first search.

Learnings

The antecedent in the rule below (expressed in Python) is the heart of the algorithm, as it determines valid solutions. Consider the case of 128 seatings with 439 guest assertions: by the end of the test execution the first 3 terms of the expression will have generated hundreds of thousands combinatorics, the next two terms will have filtered out the invalid ones. Recalculating the full expression for all terms for all combinatorics every time a new seating is asserted slows down execution significantly. I addressed the problem with the following three improvements:


    @when_all(c.seating << (m.t == 'seating') & 
                           (m.path == True),
              c.right_guest << (m.t == 'guest') & 
                               (m.name == c.seating.right_guest_name),
              c.left_guest << (m.t == 'guest') & 
                              (m.sex != c.right_guest.sex) & 
                              (m.hobby == c.right_guest.hobby),
              none((m.t == 'path') & 
                   (m.p_id == c.seating.s_id) & 
                   (m.guest_name == c.left_guest.name)),
              none((m.t == 'chosen') & 
                   (m.c_id == c.seating.s_id) & 
                   (m.guest_name == c.left_guest.name) & 
                   (m.hobby == c.right_guest.hobby)),
              (m.t == 'context') & (m.l == 'assign'))
    def find_seating(c):
  1. When evaluating combinatorics and resolving action scheduling conflicts the rules engine has to choose the most recent result to enable the depth first search of a valid solution. This resolution policy has been implemented and pushed as de-facto standard since the early days of Business Rules.
  2. For a given rule, to avoid scanning and evaluating all expression lvalues (inference list) for every assertion, the inference results are stored in a hash-set indexed by the combined result of all equality expressions. Thus, the combinatorics calculation complexity becomes linear and depends on the number of assertions done on the rule.
  3. When evaluating a fact through a complex expression the engine might compare it with the same target fact multiple times, caching target facts in lua local variables reduces the number of Redis data-structure access.

Results

JavaScript is the fastest, as it solves the problem for 128 guests in about 1.7 seconds, on the flip side its memory usage is the highest reaching 46MB including Redis memory usage. Python can still solve the 128 guest problem below 2 seconds with remarkably efficient memory usage of 26MB including Redis memory. Ruby is not far behind, a little bit above 2 seconds and good memory usage of 35MB including Redis. 

manners

As shown in the graph below, benchmark execution time is directly correlated with the number of Redis commands executed. In the case of 128 guests more than 400K commands were executed. In the case of JavaScript, this is a throughput of 225K Redis commands per second.

mannerscommands

Waltzdb

Waltz is an image recognition algorithm invented by Dr. David Waltz in 1972. Given a set of lines in a 2D space, the computer needs to interpret the 3D depth of the image. The first part of the algorithm consists of identifying four types of junctions: Line, Arrow, Fork, Tee. Junctions have 16 different possible labelings following Huffman-Clowes notation, where + is convex, – is concave and an arrow is occluding. Valid junctions and labelings:

waltzjunctions

It is important to point out: pairs of adjacent junctions constraint each other’s edge labeling. So, after choosing the labeling for an initial junction, the second part of the algorithm iterates through the graph, propagating the labeling constraints by removing inconsistent labels. Example labeling a cube with 9 lines:

waltzcube

The ruleset for the algorithm consists of 34 rules (RubyPythonJScript). In the Waltzdb benchmark, the dataset provided is a set of cubes organized in regions. Each region consists of 4 visible and 2 hidden cubes composed by 72 lines. In addition, all tests are flanked by four visible and 1 hidden cube formed by 76 more lines. The tests were run with 4 (376 lines, 20 visible cubes, 9 hidden cubes), 8 (680 lines, 36 visible cubes, 17 hidden cubes), 12 (984 lines, 52 visible cubes, 23 hidden cubes), 16 (1288 lines, 68 visible cubes, 33 hidden cubes) and 50 regions (3872 lines, 204 visible cubes, 101 hidden cubes).

waltzdata

Learnings

  1. waltzdb requires a large amount of action evaluations. Some actions (reversing edges or making junctions) don’t imply a change in context (moving forward state), such actions can be batched and marshaled to the scripting engine in a single shot.
  2. Fact assertions can lead to multiple rule evaluation, instead of invoking a Lua function for every rule evaluation, all rule evaluations for a single fact assertion can be batched and marshaled to Redis in a single shot. 
  3. durable_rules doesn’t support fact modification. To avoid excessive churn in the scheduled action queue, events can be used (instead of facts) to drive the state machine coordination.

Results

JavaScript was the fastest completing the 50 region problem in 21 seconds, memory usage reached 170MB including Redis memory usage. Both Python and Ruby finished the problem in about 23 seconds (not far behind). The memory usage for both Python and Ruby was below 100MB (including Redis).

waltzdbres

Just as the Manners benchmark, the benchmark time is directly correlated with the number of Redis commands executed. The case of 50 regions executed almost 3.5M Redis commands. The test in JavaScript has a throughput of 159K Redis commands per second.
waltzdbcommandsConclusion

durable_rules efficiency in combinatorics evaluation is competitive,  its efficiency in memory usage is remarkable. The results compare well with well established Production Business Rules systems. There is performance penalty for marshaling actions in and out of Redis as shown in Waltzdb.

Redis is a great technology, but it has to be used appropriately. The fact that Manners is able to push 225k Redis commands and Waltzdb 159K Redis commands every second, while I was able to get 3M hget commands/sec in a simple experiment, means I might be able to still squeeze more performance out of this system.

Boosting Performance with C

Right after posting my last blog entry I began work on Durable Rules third iteration. My main question was: How can I improve the framework performance by an order of magnitude across all scenarios? As you can see it has been quite a long time since my last blog post. Well… I have been very busy finding an answer for the question above.

Early on, I decided to re-implement the core rules engine in C. Compared with scripting and memory managed languages, C provides good control over memory allocation and a lot flexibility on data structure definition. Obviously, control and flexibility come at a price: It took me a lot longer to develop the code.

So, in the end, this exercise was not just simply porting code from one language to another. I had to rethink and design the data structures. I also had to design and write a JSon parser. And I must say: I also took some time to invent a new feature to enable batch processing for streaming. The detailed explanation is in the following sections.

Data Structure Definition

The most important part of the project was to define a suitable data structure to represent a ruleset. The central principle: the system has to guarantee O(n) (linear) performance as a function of the input (event) size. To illustrate how this is achieved, let’s consider the simple rule definition below:

var d = require('durable');
d.run({
    rating: {
        track: {
            whenAll: {
                a: { event: ’start’ },
                b: { event: ‘end’ }
            },
            run: function(s) { }
        }   
    }
});

The code above waits for two events ‘start’ and ‘end’ to execute the ‘track’ action. Conceptually this rule is represented by a Rete tree, which root is an alpha node for message based events (type = $m). This node is followed by two alpha nodes with two rules event = ’start’ and event = ‘end’ respectively. The alpha nodes are joined by the ‘all’ beta node, which is followed by the track action node.

rete4

In my C implementation the Rete tree is a trie. Every time an event is observed, each event message property is evaluated in the trie. The conceptual counterpart of a trie node is an alpha node in the Rete tree. Thus each trie node contains the rule to be applied to a single named property. Also each trie node points to subsequent properties to be evaluated (typically derived from an ‘and’ expression). The leafs of the trie point to a linked list of beta nodes terminated by an action node.

trie

Very important: the trie node keeps the map of subsequent properties in a hash table. This is a slight deviation over typical trie nodes used in text context search, which keep a map of subsequent valid characters using an array. I decided to handle collisions in the hash table using a simple linear probing algorithm. With linear probing I get better memory usage and acceptable predictable near constant time performance as long as the load in the tables is kept low.

In order to increase memory locality and make better use of the processor cache, the trie structure is packaged in two arrays. One array has all the alpha, beta and action nodes. While a second array contains all the hash table buckets for the trie nodes. Nodes and hash tables never hold pointers to other nodes, only array offsets.

array

JSon parsing

To reduce the cost of data transformation I decided to use JSon as the event type system for this new Rules engine. Thus a JSon object could be passed directly from a Web Request input stream without any extra processing. C doesn’t have built in facilities for JSon parsing. So I wrote a lightweight JSon parser following a few strict principles:

  • Avoid Object Model (DOM) definition and buffering
  • Avoid using the heap, only use stack memory
  • Optimize for parsing key value pairs in a single pass
  • Calculate property name hashes while parsing

The JSon parser is really just a simple state machine tailored for the specific use in the rules engine. An important aside: I chose the DJB algorithm to hash property names because it is known to have good distribution and it added minimal overhead when parsing. The property name hash codes, as explained in the section above, are critical for the Rete tree ultra fast evaluation.

Batching

Compared with events which don’t match any rule, single positive event evaluation is expensive because it needs to be recorded in the Redis cache. In some cases it triggers the evaluation of a beta join and in other cases the queueing of an action. Batching helps optimizing the cost of all this activity and allows for processing large streams of data. Let’s consider the following snippet of code:

d.run({
    approval: {
        rule: {
            whenSome: { $and: [ { subject: 'approve’ }, { $lte: { amount: 1000 }}]},
            run: function(s) {}
        },
    }
}, '', null, function(host) {
    host.postBatch('approval', [{ id: '0', sid: 1, subject: 'approve', amount: 100 }, 
                                { id: '1', sid: 1, subject: 'approve', amount: 100 },
                                { id: '2', sid: 1, subject: 'approve', amount: 100 },
                                { id: '3', sid: 1, subject: 'approve', amount: 100 },
                                { id: '4', sid: 1, subject: 'approve', amount: 100 }]);
});

The ‘rule’ action will be able to process at least one event which matches the expression at the time of evaluation. The postBatch function allows clients to post an array of event messages for evaluation and dispatching.

Benchmarks

In order not to lose track of my main objective I constantly measured performance when developing the project. When talking about performance there is always a lot of confusion. So first I will explain the methodology I used for measuring, then I will present the results for three important tests.

In all benchmarks: I used the same equipment: IMac OS X 10.9.4, 3.4GHz i7, 16GB RAM 1333MGHz DDR3. I drove the CPU utilization to 90% across all cores by running the same test concurrently in as many node.js processes as needed. In addition I added as many Redis servers as required to prevent it from becoming a bottleneck. I defined the number of operations to be performed and I measured the time it took to perform them.’Throughput’ is: the number of operations divided by the time it took to perform them. ‘Cost’ is: elapsed time divided by the number of operations. In the notes below I use the unit ‘K’ to denote thousand and ‘M’ to denote million.

Reject Events Test

In this test I defined a ruleset with 30 rules and continuously asserted event messages, which did not match any rule. I ran several experiments with different event message sizes. I used two different modes single and batch events.  The objective was to measure the raw JSon parsing and the tree expression evaluation speed. Just as a reference, not as baseline, I ran the same experiments for the JScript JSON.Parse function.

negative

In batch mode durable rules is able to process a little less than 10M small messages (30B) every second, in single message mode it is 25% slower (7.5M messages every second). In both batch and single mode durable rules is able to process 1M larger messages (1KB) every second. Important note: the complexity of the algorithm is linear (O(n)) to the size of event messages. Interesting factoid: parsing the same messages in JScript (JSON.Parse) is also linear, but every byte takes more cycles to parse, indeed it can parse 10M small messages every second but only 0.25M large messages every second.

Accept Events Test

Again I defined a ruleset with 30 rules and continuously asserted event messages. This time all event messages matched a rule in the set and lead to queueing an action. I ran experiments with different message sizes and I tried two modes: single and batch. The objective was to measure redis caching and beta join calculation done in Redis scripts.

positive

In batch mode durable rules is able to process 250K small event messages (50B) every second, while it can process 60K single small event messages every second. It can process 120K and 40K large event messages (1KB) every second in batch and single mode respectively. Again, the algorithm is linear (O(n)) to the size of the event message.

Rete Cycle Test

Finally I tested the full Rete cycle. I defined a 30 rules ruleset, but in this case not only were the event messages accepted, but the actions dequeued, the messages removed and the new associated state re-asserted. All of this was done while keeping the system consistency guarantees.

full

In batch mode durable rules can run 100K small event messages (50B) through the Rete cycle, while in single mode it can run 18K. In the case of larger event messages (1KB), the results are 15K and 65K in single and batch mode respectively. The algorithm is linear (O(n)) to the size of the event message.

In conclusion the new C implementation provides a significant performance boost over my previous JScript based implementation. The new batch\streaming feature allows for flowing a lot more data through the system making a more efficient use of the Redis cache.

To learn more, please visit: http://www.github.com/jruizgit/rules.