Skip to content

Get Started with Tests

One of the primary DataOps practices is to write and automate tests against both code and data to ensure all of the people, tools, and processes of production and development pipelines operate accurately and produce quality outputs—and catch problems before a customer does.

Use Automation to add tests and monitor different aspects of your recipes.

See Test Patterns and Examples for a summary of test types with sample cases and configurations.

Testing best practices

The lifecycle of a recipe should include increasing test coverage over time, often as future insurance in response to encountered issues. For example, if a particular data vendor has a history of data quality or delivery timeliness issues, automated testing that is refined over time can help hold the vendor accountable to an SLA.

Test frequency

Frequent testing is important to identify and resolve issues as they occur. If each node in a recipe graph includes a test, a user can identify the node generating an error (as opposed to investigating the entire data analytic pipeline to diagnose an issue).

Consider three categories of tests

  • Verify that data inputs are free from issues.
  • Check that your business logic is correct.
  • Confirm that your outputs are consistent and correct.

Start with basic tests

  • Use a recency test to check that the data is up-to-date and has the expected date stamp, even before performing any transformations.
  • Conformity tests can check for data that has an expected format (i.e. social security numbers, NPI numbers, zip codes, phone numbers), to make sure the data conforms to the expected format.
  • Write a count verification test to make sure input data sets have a certain size (bytes or row count), and that size is within expectations when the data arrives or enters the pipeline.
    • Write another count verification test to make sure output data sets have a certain size when complete.
  • Use a consistency test for DateTime stamps.

    For example, make sure the beginning date or time of a transaction is before the end date or time.

  • As you write the ETL or ELT code, implement a data validation test for any assumptions, to make sure they remain true.

  • Check that the rows in the raw tables at the beginning of the pipeline match the target tables with a location balance test.

    If the number of rows can shrink or grow, there may be an issue with an underlying where clause (or data that causes an issue with a where clause).

  • A historical balance test can verify the expected trends if a specific value should increase, decrease, or stay the same.

Use advanced tests for preventive measures

Or to validate trends against internal knowledge.

In large organizations, knowledge of every step in a pipeline spans multiple teams. Proper testing often requires a collaborative effort to understand assumptions made about the data in every pipeline step. Advanced tests can help document these assumptions.

Note

To get started, advanced tests may require input from multiple team members.

  • Check that data are within known or expected ranges using conformity tests (for inputs) and range verification tests (for outputs).

    For example, are the stock prices of a particular company and monthly sales volumes within reasonable ranges?

  • Write field validation tests to ensure that information is complete and there are no missing fields or values.

  • Add input checks, such as historical balance tests, to check that your data providers don't have errors. Create a data supplier report card for reference.

  • Review data outputs for negative trends. Implement completeness tests, which are historical balance tests for analytic outputs.

Advanced testing use cases

Write statistical process control tests against multiple facets of data and analytic processes to spot negative trends early. These time balance tests help your teams produce analytic insights about their internal processes, which can lead to process improvements and increased efficiency, productivity, and quality.

Respond to reports of data issues by adding tests for specific conditions—so they never happen again. Maybe that means a consistency or conformity check needs to catch data quality issues. Maybe that is a data validation against the business logic. Or maybe location balance and range verification tests should audit data accuracy problems.