Parameterize Your Processing¶

Deploy code across multiple environments using variables to eliminate configuration changes.

Enterprise data analytic pipelines need to be flexible enough to incorporate different runtime conditions. Which version of the raw data should be used? Is the data directed to production or testing? Should records be filtered according to some criterion? Should a specific set of processing steps in the workflow be included or not? To increase development velocity, these options need to be built into the pipeline as options.

A robust pipeline design will allow the engineer or analyst to invoke or specify these options using parameters. In software development, a parameter is some information (e.g. a name, a number, an option) that is passed to a program that affects the way that it operates. Parameters allow code to be generalized so that it can operate on a variety of inputs and respond to a range of conditions.

Parameters can also be used to improve productivity. Consider a preprocessing run that performs an operation on some data. After running for several hours, it stops unexpectedly due to an error. An inflexible program might need to be restarted at the beginning, losing several hours of processing. A program using parameters could be designed to be restarted at any specified point, in this case where the processing left off. Using parameters, the program completes the processing in much less time.

Parameters allow the data analytic pipeline to be designed to accommodate different runtime circumstances. This flexibility is critical for DataOps which seeks to make analytics more responsive to the needs of the organization.

Why parameterize?¶

Parameters and named sets of parameters will increase your velocity.
With parameters you can vary:
- Inputs (you can make a time machine).
- Outputs.
- Steps in the workflow.