Statistical Process Control Example¶

Use case: data analysis in a company's pipeline can use a machine learning (ML) model to predict sales. Monitor predictions against actual results and ensure that the machine learning model performs within an acceptable error rate.
Example recipe:the recipe uses a container node to load and run the machine learning model against a dataset. A time balance test has been written for statistical process control to ensure that the root mean square error (RMSE) of the model is less than a threshold deemed acceptable. After running the recipe variation a few times, analysis can be done on the test metrics for outliers and trends and adjustments can be made to the process.
Node type: container node.

Step 1: Define container inputs¶

The first step in the statistical process control test is to define the inputs for the container.

Visual example in Automation¶

The node loads the ML model from a dictionary source and the dataset from a Redshift source. The container loads two inputs, the ML model in a Jupyter Notebook and a Redshift dataset.

(Screenshot of the container inputs)

Summary of how-to¶

Open the node editor.
From the Inputs tab, configure the Source inputs.
Select Add notes for changelog , describe the changes, then click Update.

Step 2: Declare runtime variable¶

When configuring the container node, declare values exported from the Jupyter Notebook as variables. The predicted_total_sales_rmse variable will be used in a test to place controls on the ML process.

Visual example in Automation¶

The container node's config.json file defines a script export as a runtime variable. The container config defines the keys for the ML script and its input parameters. The container export includes the variable storing the total sales RMSE value.

(Screenshot of the JSON config file)

Summary of how-to¶

In the same node, select the Configuration tab. For more information, see Create a Container Node.

Step 3: Configure test¶

Write a statistical process control test, confirming the machine learning model is less than the threshold.

Visual example in Automation¶

The test ensures that the RMSE is within acceptable limits by comparing the RMSE error value against a control and stops the pipeline execusion if the forecasting accuracy deviates from what is acceptable.

(Screenshot of the Teststab)

Summary of how-to¶

In the same node, select the Tests tab.
Create a statistical process control test.
1. Click Add Test.
2. Fill out the fields Test Name, Failure Action, Test Logic, and Control Value.
  
  Where Test Logic is Compare variable against metric and the Test Variable is the export variable from Step 2.
Select Add notes for changelog , describe the changes, then click Update.

Variation metrics view¶

After this variation has been run a few times, there will be sufficient metrics to measure the effectiveness of the machine learning model and adjustments can be made to the process if outlier data points become consistent trends.

A user can view the RMSE test trends over time from the Variation Metrics page.

(Screenshot of the Variations Metrics page)

File contents¶

The system records the test configurations shown above in the config.json, data_sources/dict_datasource.json, and notebook.json files.

notebook.json (ABRIDGED)

{
    "image-repo": "{{gpcConfig.image_repo}}",
    "image-tag": "{{gpcConfig.image_tag}}",
    "dockerhub-namespace": "{{gpcConfig.namespace}}",
    "container-input-file-keys": [
        {
            "filename": "SARIMA_ML_Model.ipynb",
            "key": "dict_datasource.resources_basic"
        },
        {
            "filename": "{{db_sales_filename}}",
            "key": "redshift_datasource.Load_Sales_Data"
        }
    ],
    "container-output-file-keys": [
        {
            "filename": "SARIMA_{{quarter}}_*.png",
            "key": "s3_datasink.*"
        },
        {
            "filename": "{{db_sales_filename}}",
            "key": "s3_datasink.Upload_Db_Sales_File"
        }
    ],
    "tests": {
        "validate_predicted_total_sales_rmse": {
            "description": "RMSE for total data",
            "action": "stop-on-error",
            "test-variable": "predicted_total_sales_rmse",
            "type": "test-contents-as-float",
            "test-logic": {
                "test-compare": "less-than",
                "test-metric": 4000
            }
        },
        . . .

        }
    }
}