Reuse and Containerize¶

Developing components only once and consolidating complex analytics for use in many pipelines and use cases saves development time.

Reuse:

Do not create one "monolith" of code.
Reuse the code and results.

Containerize:

Manage the environment for each component (e.g. Docker, AMI).
Practice environment version control.

Reuse code as configuration¶

In DataOps, the data analytics team moves at lightning speed using highly optimized tools and processes. One of the most important productivity tools is the ability to reuse and containerize code. When we talk about reusing code, we mean reusing data analytics components. All of the files that comprise the data analytics pipeline — scripts, source code, algorithms, HTML, configuration files, and parameter files — we think of these as code. Like other software development, code reuse can significantly boost coding velocity.

Code reuse saves time and resources by leveraging existing tools, libraries, or other code in the extension or development of new code. If a software component has taken several months to develop, it effectively saves the organization several months of development time when another project reuses that component. This practice can be used to decrease project budgets. In other cases, code reuse makes it possible to complete projects that would have been impossible if the team was forced to start from scratch.

Containers make code reuse much simpler. A container packages everything needed to run a piece of software — code, runtimes, tools, libraries, configuration files — into a stand-alone executable. Containers are somewhat like virtual machines, but use fewer resources because they do not include full operating systems. A given hardware server can run many more containers than virtual machines.

Containerize¶

A container eliminates the problem in which code runs on one machine, but not on another, because of slight differences in the setup and configuration of the two servers or software environments. A container enables code to run the same way on every machine by automating the task of setting up and configuring a machine environment. This is one DataOps technique that facilitates moving code from development to production — the run-time environment is the same for both. One popular open-source container technology is Docker.

Each step in the data analytics pipeline is the output of the prior stage and the input to the next stage. It is cumbersome to work with an entire data analytics pipeline as one monolith, so it is common to break it down into smaller components. On a practical level, smaller components are much easier to reuse by other team members.

Some steps in the data analytics pipeline are messy and complicated. For example, one operation might call a custom tool, run a python script, use FTP, and other specialized logic. This operation might be both hard to set up, because it requires a specific set of tools, and difficult to create because it requires a specific skill set. This scenario is another common use case for creating a container. Once the code is placed in a container, it is much easier to use by other programmers who aren't familiar with the custom tools inside the container, but know how to use the container's external interfaces. All of the complexity is embedded inside the container. It is also easier to deploy that code to different environments. Containers make code reuse much more turnkey and allow developers much greater flexibility in sharing their work with each other.