Testgen landing v4 - DataKitchen Documentation

The Stakes Just Got Higher

Bad data has always been expensive. AI makes it exponentially worse.

A bad row that once broke one dashboard now corrupts thousands of downstream reports, model predictions, and automated decisions. The blast radius of a single data quality failure keeps growing.

The teams that stay trusted are the ones who got test coverage before they needed it. Not after the incident report.

Download Free Guide: AI & Data Quality →

From the community: r/dataengineering

"We just eyeball row counts and pray." When there's no time to write tests, this is the actual quality strategy at most data teams. The monitoring is vibes-based.

From the community: dbt Community Forum

"The data changes faster than I can keep the tests up to date." Handwritten tests become technical debt overnight. A schema change upstream breaks them all.

From the community: Hacker News

"Nobody gives us the time to write tests. It's always the next feature, never quality." This is the #1 reason data engineers do not test, confirmed across 849 community comments.

Research · 849 Community Voices

We Asked The Community Why Data Engineers Don't Test.

We read 849 comments across 18 threads on Reddit, Hacker News, Stack Overflow, and the dbt Community Forum. The answers were honest, funny, and occasionally brutal. They are exactly what TestGen was built to solve.

Read the full breakdown →

#1 Barrier

Nobody gives us time to write tests

#2 Barrier

Data changes faster than tests can keep up

#3 Barrier

No domain knowledge to write meaningful tests

#4 Barrier

Too many false positives: alerts nobody trusts

TestGen addresses all four. See how →

What TestGen Does

Everything You Need.
None of What You Don't.

Built for data engineers who need coverage fast. Not another platform to configure for six months before it delivers value.

∑

Data Profiling

51 column-level characteristics captured in a single run: types, patterns, nulls, value distributions, percentiles, and PII signals. Every table. No SQL written.

⚑

Hygiene Detection

27 types of data problems flagged automatically after profiling, before you write a single test. Invalid formats, mixed types, blank value variants, stale tables, and more.

⚡

Auto Test Generation

One profiling run creates 32 test types applied across every column, generating thousands of individual test instances automatically. TestGen infers bounds, patterns, and expected distributions from your data with no configuration required.

⊞

Data Catalog

360-degree column-level view: semantic type, value distribution, hygiene flags, PII risk, test results, and Critical Data Element tagging. All derived from profiling with no manual entry.

◎

Quality Scoring

Automated scorecards roll up profiling and test results per table, domain, or pipeline zone. Drill to the column pulling the score down. Share a 1-click issue report.

⟳

Table Monitors

ML-driven anomaly detection on freshness, volume, schema drift, and metric drift. TestGen learns your data's normal behavior and alerts when it deviates. No thresholds to configure manually.

✎

Business Rule Tests

10 configurable test types for rules that cannot be inferred from data automatically: Data Match, Prior Match, Aggregate Match, and more. Configure them in the UI with no custom SQL required.

⌘

CLI & CI/CD Integration

Run tests in any orchestrator: Airflow, dbt, Azure Data Factory, GitHub Actions. Non-zero exit codes stop the pipeline before bad data reaches production. Works at every Medallion layer: Bronze ingestion, Silver transformation, and Gold delivery.

⇄

Observability Integration

TestGen is the data quality layer. DataOps Observability is the pipeline layer. Together they cover every point where data can fail: from a bad source column to a broken pipeline step to a wrong number in a dashboard.

TestGen results export directly into the DataOps Observability timeline. One view. Every failure. Source to customer.

Learn about DataOps Observability →

Who Is TestGen For?

Three Teams. One Tool.

Data Engineers

Get test coverage across all your tables, not just the four important ones that got manual tests.

No YAML. No SQL. No weeks of test-writing. Profile your tables, generate tests automatically, and integrate into your pipeline with a single CLI command. Works with Airflow, dbt, ADF, and any CI/CD system.

Data Quality Teams

Build scorecards, track quality over time, and create evidence that moves upstream teams to fix their data.

Quality scores by table, domain, or pipeline zone. Drill down to the exact column pulling the score down. One-click shareable issue reports give you something concrete to bring to the source team conversation.

Data Governance Teams

Automated PII detection. CDE tagging per column. Audit-ready issue reports. No manual catalog curation required.

Catalog your data assets, flag PII risks, and tag Critical Data Elements with evidence of quality at every layer. Quality scoring by business domain and stakeholder group. Everything is derived from profiling runs, not hand-entered metadata.

Real Examples. Not Marketing Copy.

What TestGen Actually Catches

Before you install anything, here is exactly what TestGen finds.

27 Hygiene Issues Found After Profiling

Non-standard blank values: empty string, N/A, 0 used as null, and similar variants
Invalid zip code format in string column
Leading or trailing spaces in text fields
Mostly dates stored in string column
Multiple data types within same column name
No column values present at all
Mostly not-null but sporadic empty values
Recency issue: no records within the last year
Duplicate values in a column expected to be unique
PII risk: email, phone, or SSN pattern detected
Quoted values found in string column
Similar values match when standardized

Auto-Generated Test Examples

Alpha truncation: values cut at consistent length
Average shift: mean deviates from historical baseline
Constant value present: column stopped changing
Daily record count: row count outside expected range
Distinct value change: new or missing categorical values
Future date: timestamp values beyond today
Incremental average shift: trend is breaking
Value present in list-of-values: referential integrity check
Minimum and maximum value bounds exceeded
Percent unique: uniqueness ratio outside tolerance
Pattern match: format regex violated
Required entry: nulls found in a non-null column

Custom Business Rule Tests

Data Match: value matches a reference table lookup
Prior Match: value unchanged from the previous run
Aggregate Match No Drops: sum consistent across joins
Row count match between source and target
Cross-column consistency rules
Date sequence validation
Referential integrity across tables
Business-defined range or ratio checks
Configurable in the UI with no SQL required
Shareable with business users for review

Transparent Pricing

All the Checkmarks.
None of the Typical Cost Burden.

No usage-based surprises. No VC-driven price resets. No 6-month sales cycles. Just a number, published on the page. No per-table tax. Monitor every asset without costs that balloon as your data grows. Vendors like Monte Carlo and Bigeye charge per monitored asset. We do not.

	TestGen Open Source	TestGen Enterprise	Typical Observability Vendor (e.g., Monte Carlo, Bigeye, Anomalo)
Price	$0Free forever	$100per user / per connection	$50K–200K+per year, negotiated
Data Profiling (51 characteristics)	✓	✓	Partial
Auto Test Generation	✓	✓	✗
Hygiene Issue Detection (27 types)	✓	✓	✗
Quality Scoring & Dashboards	✓	✓	Partial
Table Monitors (ML anomaly detection)	✓	✓	✓
SSO / Multi-project / RBAC	✗	✓	✓
Pricing Transparency	✓	✓	✗
VC-Backed Pricing Risk	None	None	High

Why our pricing is public: DataKitchen is profitable and investor-free. We will not suddenly pivot pricing models or sunset features after a funding round, because there are no funding rounds. The enterprise version is $100 per user per connection. That's the number. It doesn't change in your renewal conversation.

The Data Quality Tax: The anomaly detection algorithms inside most $100K/year observability platforms are open source: Z-score calculations, time-series variance, and record count comparisons. You are paying an enterprise markup on commodity math. TestGen gives you the same detection capabilities, transparently priced, with auto-generated tests that those platforms don't offer at all. Read the full analysis →

Who Builds TestGen

Built by Practitioners.
Not by a Sales Team.

DataKitchen has spent a decade building DataOps tooling for data engineering teams. We've written three books on DataOps. We built the open source DataOps Observability platform. We've spoken at data conferences worldwide.

TestGen is the data quality testing layer we always wished existed. It is purpose-built for the data engineer who needs coverage across hundreds of tables, not just the four most important ones that got manual tests.

We are profitable. We are independent. We are not going to raise a Series B and tell you your annual contract is going up 3x. We charge $100 per user per connection for the enterprise version. That is the number. It's on the page. It doesn't change.

Free Certification

Not ready to install yet? Get certified in Data Observability for free. Learn the concepts, prove the skills, and take it at your own pace.

Get certified free →

Books Published on DataOps Including the DataOps Cookbook and The DataOps Way to Data Quality

10+

Years Building DataOps Tools Across orchestration, observability, and data quality

Venture Capital Raised Profitable since year one. No investors, no pivot risk, no pricing surprises.

Open Source Products TestGen for data quality and DataOps Observability for pipeline monitoring

Start In Under 5 Minutes

Your First Profiling Run
is One Command Away.

Runs in Docker. No cloud account required. No credit card. No sales call.

Two commands. That is it.

        $
        pip install testgen-tooling
      

        $
        tg launch
      

↓ Install on Mac / Linux ↓ Install on Windows ↓ Install via pip

A browser-based UI opens at localhost. Connect your database, run your first profile, and see hygiene issues and auto-generated tests within minutes.

▸ Works with: Snowflake · Databricks · PostgreSQL · AWS Redshift · Azure Synapse · Azure SQL

Already running pipelines? TestGen's CLI integrates directly into Airflow, dbt, and Azure Data Factory as well as any CI/CD system.
Run tests on every pipeline execution. Fail the pipeline job when data quality fails.

→ CLI reference docs · → Full documentation · → Product tour (3 min)

You Have 400 Tables. Tests for 4 of Them. TestGen Fixes That.