Uniqueness Tests¶
Uniqueness tests verify that data values are not inappropriately duplicated — whether at the column level (every value is distinct) or at the table level (no repeated rows). Shifts in uniqueness patterns can indicate duplication from faulty joins, ingestion errors, or upstream processing changes.
Unique Values¶
Tests that no values for the column are repeated in multiple records.
Auto-generated when profiling finds that every value in the column is distinct (in tables with more than 500 rows).
| Scope | Column |
| Measures | Count of duplicate values |
| Threshold | Maximum acceptable duplicate count (default: 0) |
| On failure | Column values are not unique per row. |
| Default severity | Fail |
How it works: Computes COUNT(*) - COUNT(DISTINCT column) and fails when the result exceeds the threshold. A result of 0 means every value is unique.
When to use: This test is ideal when the database does not enforce a primary key or unique constraint on the column. It serves as an independent check on uniqueness. If a small number of exceptions to uniqueness are acceptable, increase the threshold to reflect the expected duplicate count.
Percent Unique¶
Tests for a statistically significant shift in the percentage of unique values compared to baseline data.
Auto-generated when profiling finds more than 10 distinct values in a non-measurement column.
| Scope | Column |
| Measures | Cohen's H Difference (0.20 = small, 0.50 = moderate, 0.80 = large, 1.20 = very large) |
| Threshold | Maximum acceptable Cohen's H Difference (default: 0.5) |
| On failure | Significant shift in percent of unique values compared to baseline. |
| Default severity | Warning |
How it works: Compares the current ratio of distinct values to total values against the baseline ratio using the Cohen's H statistic — a standardized measure of the difference between two proportions. The test fails when the difference exceeds the threshold.
When to use: Think of this as a similarity test that measures whether the percentage of unique values is consistent with the percentage at baseline. A significant change might indicate duplication or a telling shift in cardinality between entities. You can refine the threshold as you observe legitimate variation over time.
Duplicate Rows¶
Tests for the absence of duplicate rows based on a unique combination of column values.
Auto-generated based on profiling results. The system selects columns that together should form a unique key.
| Scope | Table |
| Measures | Count of duplicate records |
| Threshold | Maximum acceptable duplicate count (default: 0) |
| On failure | Column value combinations are duplicated in the table. |
| Default severity | Fail |
How it works: Groups rows by the specified columns and counts groups with more than one row. The result measure is the total number of duplicate rows found. The test fails when this count exceeds the threshold.
When to use: By default when auto-generated, the test considers all columns to protect against duplication of entire rows. If you know the minimum columns that should constitute a unique record (such as a set of IDs), edit the test to use those — this makes the test more sensitive. Alternatively, if you know columns that can always be excluded (such as file_date or refresh_snapshot_id), removing them tightens the test.