Building a highly scalable configuration rule engine for eCommerce

Dharmesh Panchmatia
10 min readJan 10, 2021

Abstract

Configuration is the process of selecting components for a product to ensure its validity at the time of order placement. Configuration rule engines drive the configuration experience of products on eCommerce platforms to ensure products can be introduced quickly without requiring any code changes.

Configuration rule engines need to be scalable to handle complex products such as network routers and switches as there can be over a million records in their bill of material (BOM). Most of the configuration rule engines in the market today struggle dealing with configuration of complex products.

We have built a scalable rule engine for Cisco’s eCommerce platform that can scale to millions of transactions per hour and can handle complex products with ease. This white paper explains configuration rule engine basics, highlights challenges with traditional rule engines, and details out our novel approach to address these challenges.

Configuration Basics

Two key capabilities required to enable a configurable product on an eCommerce platform are:

  1. Application for Product setup: allows product teams to set up structure and configuration rules for products
  2. Rule Engine: applies rules at runtime and guides a user (buyer) with selection of product components

Application for Product Setup

Building an application for product setup is relatively straight forward and hence I will not focus on that aspect. I will instead explain product structure and configuration rules with an example.

Example of product structure

The following is an example of the product structure for a rack server. A rack server consists of many components (SKUs), such as chassis, blades, processors, memory, storage, power supply, operating system, etc. I will use a rack server with only four components: blades, processors, memory and power supply, for simplicity.

Components in bold italics are logical entities (groups) used for grouping similar physical components (SKUs). Physical components (SKUs) are depicted in regular font. “|” delimiter shows the level of depth of a component from the top level of the hierarchy.

Product — RACKSERVER

|- BLADE

| |-BLADENAME

| | |- B1

| | | |- Processor

| | | | |- P1

| | | | |- P2

| | | |- Memory

| | | | |- M1–256MB

| | | | |- M2–512MB

| | |- B2

| | | |- Processor

| | | | |- P3

| | | | |- P1

| | | |- Memory

| | | | |- M1–256MB

| | | | |- M2–512MB

|- Power

| |- Power Supply

| | |- AC-PWR-SKU

| | |- HIGH-AC-PWR-SKU

The product RACKSERVER consists of BLADE and Power. BLADE further contains BLADENAME, Processor and Memory units, and Power contains Power Supply. It is a hierarchical tree with the product RACKSERVER as the root of the tree. BLADE & Power are children of RACKSERVER and BLADENAME is a child of BLADE.

B1, B2, P1, P2, P3, M1–256MB, M2–512MB, AC-PWR-SKU & HIGH-AC-PWR-SKU are examples of physical SKUs. B1 & B2 are blade SKUs, P1, P2 & P3 are processor SKUs, AC-PWR-SKU & HIGH-AC-PWR-SKU are power supply SKUs and M1–256MB & M2–512MB are memory SKUs with capacity of 256 & 512MB respectively.

Example of configuration rules

There are many types of configuration rules. We will focus on only three rules types: expansion, compatibility and value constraint rules, to illustrate challenges of dealing with complex products. Other rule types face similar issues.

Expansion Rule: expand a component (SKU) on selection of another component (SKU)

Rule1: Automatically add processor P1 if user selects blade B1.

If BLADENAME = B1 (Condition)
then Processor = P1 (Action)

Rule2: Automatically add processor P2 if user selects blade B2.

If BLADENAME = B2 (Condition)
then Processor = P2 (Action)

Rule3: Automatically add power supply AC-PWR-SKU if user selects processor P1.

If Processor = P1 (Condition)
then Power Supply = AC-PWR-SKU (Action)

Incompatibility Rule: one component is incompatible with another component. These rules can be defined on individual components (SKUs) or on a group of components.

Rule4: Hide power supply HIGH-AC-PWR-SKU on selection of blade B1 as it is not compatible
B1 incompatible HIGH-AC-PWR-SKU

Value Constraint Rule: Constrain the number of components selected based on the total value of an attribute of those components (for e.g. total memory for a rack server cannot exceed 1GB)

Rule5: Throw error message if total value of memory selected exceeds 1GB
If count(M1–256MB ) > 4 or count(M2–512MB) > 2 or [count(M1–256MB) > 0 and count(M2–512MB) > 1] or [count(M1–256MB) > 2 and count(M2–512MB) > 0] (Condition)
then display error: Memory cannot exceed 1GB (Action)

2. Configuration Rule Engine

On selection of a component (SKU) of a product by a user, the eCommerce application sends the user selection to a configuration rule engine. The rule engine executes all relevant rules, changes the data set based on the outcome of the rules and then returns the next set of selectable components to the eCommerce application. The end goal is to ensure a valid configuration of the product which can be successfully ordered.

We can build rule engines in many different ways, but the following are typical approaches used:

Using forward chaining:
Forward Chaining starts with the data and uses inference rule to extract more data until it reaches a goal. An inference engine using forward chaining searches the inference rules until it finds one where the antecedent (If clause) is true. This is also called constraining rule execution as one narrows the selection options available to users based on previous selections.

Using a variation of Rete algorithm:

Charles Forgy’s Rete algorithm provides a generalized functionality responsible for matching data tuples (“facts”) against productions (“rules”) in a pattern-matching system.

A drawback of Rete algorithm is that it is memory intensive as all different permutations of data and rule tuples can be exponentially large. It needs considerable amount of memory for every user session.

Hence many configuration rule engines use variations of forward chaining. For rest of the discussion we will focus on configuration rule engines implemented using forward chaining.

Traditional Configuration Rule Engines versus Our Approach

This section highlights the challenges with traditional rule engines and explains how we have overcome them.

1. Rich content at component level

Tradition rule engines do not have the capability for product teams to provide rich content as part of product setup. This feature is critical for complex products with a high number of SKUs under the same logical group. For example, high end blade servers can have 50+ processor SKUs and it becomes difficult for users to understand the differences between them without rich descriptive content that can provide that clarity.

Our Approach: We have built content management features in our configuration rule engine. Our rule engine offers card templates that allows product teams to provide rich content such as images and verbiage at component level as part of the product setup. The product teams can change the images or the verbiage for a product anytime without the need of a code change. This feature is a competitive differentiator for Cisco.

2. High resource consumption

Assume that a rack server has 16 slots and there are 10 different blades that can be placed in these slots. In addition, assume there are 50 different processor and 40 different memory SKUs available for each blade. We can configure each blade with a combination of different processor and memory SKUs.

Traditional Engines: In a typical engine, the product structure repeats the 10 blades with all its underlying processors and memory SKUs for each of the slots resulting in a very large product structure (for each slot ->10 blades, for each blade->50 processor SKUs & 40 memory SKUs). In addition, most traditional rule engines are stateful and maintain the product structure in session memory for every user session. This results in a very high memory consumption.

Our Approach: We use the concept of dynamic instantiation to address the above problem. We have also built a stateless rule engine instead of a stateful rule engine. Since the engine is stateless, all user selections have to be sent to the rule engine on every user selection, resulting in a much bigger payload. But the benefit we get is that we do not have to maintain all previous user selections in memory, thus significantly reducing our memory footprint. Memory requirement for large, complex products reduces by 70+% compared to traditional rule engines.

3. Poor response times

Traditional Engines: Lets continue with the product structure of a rack server with 5 rules described in the previous section. When a user selects blade B1, the engine will check the antecedent (if conditions) of all the 5 rules to see if any of them are satisfied. Rule1 & Rule4 satisfy the “if condition” and hence the consequent (Then Processor = P1) is added to the data and item HIGH-AC-PWR-SKU is removed from the domain as it is incompatible with blade B1.

Since the engine has added a component P1 to the data set, the engine needs to check all the rules again to see if any “if conditions” are satisfied after addition of P1. “If conditions” of all the rules are tested again, and this time, Rule3 satisfies the condition and hence Power Supply = AC-PWR-SKU is added.

The algorithm needs to test all the rules and its conditions and take action based on the “Then” part of the rule. These actions may add additional components, which may satisfy additional rules, and hence the rule base needs to be searched again and again till it adds no more components to the set.

A typical rack server will have 7000–8000 available components and 1000–2000 rules written on them. A rack server configuration requires selection of 50–100 components. Every component selection will iterate through and test the rules multiple times. This evaluation of rules for component selections can become expensive from a processing perspective. Traditional rule engines compile rules once, and perform pattern matching and rule execution every time it asserts facts, resulting in very poor performance.

Our Approach: Our rule engine works on the principle of “Rules compile and match once, run execution multiple times” versus the traditional approach of “Rules compile once, and match and execute multiple times”. By doing so, our algorithm avoids repeated evaluation of rules for every fact. In addition, we have optimized our code by leveraging libraries such as Trove for hashmaps to save on code execution times. This helps us in improving the performance of our algorithm significantly; resulting in 90+% performance improvements compared to traditional rule engines. 95th percentile response time for the rule engine is ~10 milliseconds.

4. APIs for 3rd party Integration

Tradition rule engines do not expose APIs for integration with other eCommerce platforms. This becomes a limitation for companies that have a large channel partner ecosystem because it does not give their partners the flexibility to transact using B2B APIs.

Our Approach: Our rule engine exposes APIs that can work with any 3rd party platform as long as it has the ability to invoke external APIs. Approximately 25% of Cisco revenue is booked by our channel partners leveraging these APIs. We guarantee strong uptime (99.95% uptime) and response time SLAs (95th percentile response time of less than 200 milliseconds) to ensure smooth functioning of our large channel partner ecosystem.

5. Time to introduce new offers

Traditional Engines: Let us consider execution of the value constraint rule. The condition part of the rule 5 has expressions for 2 memory SKUs: M1–256MB & M2–512MB. If we add a 3rd memory SKU M3–128MB, then the existing rule has to be modified as follows. The added expressions are in bold.

Rule5:

If count(M1–256MB) > 4 or count(M2–512MB) > 2 or [count(M1–256MB) > 0 and count(M2–512MB) > 1] or [count(M1–256MB) > 2 and count(M2–512MB) > 0] or count(M3–128MB) > 8 or [count(M1–256MB) > 0 and count(M3–128MB) > 6] or [count(M1–256MB) > 1 and count(M3–128MB) > 4] or [count(M1–256MB) > 2 and count(M3–128MB) > 2] or [count(M1–256MB) > 3 and count(M3–128MB) > 0] or [count(M2–512MB) > 1 and count(M3–128MB) > 0] or [count(M2–512MB) > 0 and count(M3–128MB) > 2] …

(Condition)
then display error: Memory can not exceed 1GB (Action)

As one can see the number of expressions grow from 4 to 12+ as the number of SKUs increase from 2 to 3. If we have 10 memory SKUs instead of 3, the number of expressions grow exponentially and it is impossible to write and test these rules for accuracy.

This is true not only for value constraint rules but for other rules as well. Consider a scenario where two more blades B3 and B4 are added as part of a new product introduction. All the blade rules have to be added again two times, once for B3, and once for B4.

In addition, if we have a line of products in which the product structures and the rules are the very similar with very few minor differences, one has to create a different product structure and rules for each product. This delays the ability to go-to-market with new products quickly.

Our Approach: We have addressed manageability of rules by providing the flexibility of defining rules on properties of components instead of components themselves. This not only reduces work when new components are added (you need to define property values for new components instead of adding new rules) and reduces the number of rules one needs to define but it also allows product management teams to create a single model for similar products instead of having to maintain a different model for each product. We can model complex products in 1–2 days compared to 1–2 weeks for tradition configuration rule engines.

Conclusion

Traditional configuration rule engines run into challenges for complex products such as lack of content management, lack of APIs, high response times and high resource consumption. We have a built a highly scalable rule engine that addresses these pain points and handles configuration of large, complex products very easily.

It handles 500,000 hits per hour with a 95th percentile response time of ~10 milliseconds with a single JVM with 1GB memory. It scales horizontally and can handle millions of transactions per hour as it is stateless. We sell over 10,000 hardware, software and hybrid products (software licenses as a component of hardware) leveraging our rule engine.

--

--