Data Profiling¶
Use data profiling to discover the structure, content, and quality characteristics of tables in your database. Profiling results populate the Data Catalog, surface hygiene issues, and serve as the foundation for automated test generation.
What profiling discovers¶
During a profiling run, TestGen scans the tables and columns in a table group and captures 55 column-level characteristics for each column. These characteristics fall into several categories:
- Data types and structure — The declared column type versus the actual content. TestGen suggests more appropriate data types when the data doesn't match the declared type (for example, a VARCHAR column that contains only dates).
- Value distributions — Record counts, null counts, distinct value counts, and the most frequent values. These reveal whether columns are populated as expected and whether the data has the expected cardinality.
- Statistical measures — For numeric columns: min, max, average, standard deviation, and percentiles. For text columns: min and max lengths, pattern analysis, and character content.
- Date analysis — Date ranges, recency distribution, and coverage across days, weeks, and months.
- Content detection — Pattern matching to identify data that may contain PII (emails, phone numbers, SSNs), embedded delimiters, quoted values, or mixed data types within a single column.
Hygiene issues¶
After collecting column statistics, the system analyzes the results to detect 32 types of data hygiene issues — structural inconsistencies and content anomalies that may indicate data quality problems. Examples include columns with non-standard blank values, inconsistent patterns across tables, data type mismatches, and potential PII in unexpected locations.
Each hygiene issue is assigned a likelihood rating (Possible, Likely, or Definite) and comes with a suggested action to guide your investigation.
How profiling supports your data quality workflow¶
Profiling is the starting point for several downstream features in TestGen:
- Test generation — TestGen uses profiling results to auto-generate data quality tests tailored to the actual content and patterns in your columns.
- Data Catalog — Profiling results populate the Data Catalog, giving you a hierarchical view of your table groups, tables, and columns with their statistics and quality indicators.
- Freshness monitors — Freshness monitors are auto-generated from profiling data. The profiling process identifies appropriate columns to use for fingerprint comparison, which is how Freshness monitors detect whether a table has been updated.
- Quality scores — Hygiene issue results contribute to the data quality scores visible on the Quality Dashboard.
When to run profiling¶
Profiling queries each column of every table in the table group individually, so the duration depends on the number of tables and columns, record counts, and database compute resources. Profiling is not designed to be part of your daily workflow. Best practice is to run profiling on a biweekly or monthly schedule during off-hours.
Running profiling periodically serves as an audit trail for data changes over time, keeps test definitions current with the realities of your database, and ensures that hygiene issues are detected as your data evolves.
Work with large datasets¶
When a table group points at a very large table — hundreds of millions to billions of rows — profiling takes longer and consumes more database compute. It also lowers the quality of the generated tests. For both reasons, define table groups around homogeneous subsets of your data rather than entire large tables.
The core issue is data homogeneity. When a table mixes records from different domains, the profiled statistics reflect that mix rather than the behavior of any meaningful segment. Consider a table that contains both pencil sales and car sales: the profiled sale price spans from two cents to $100,000, and the expected value lists combine pencil attributes such as color and form factor with car attributes such as make, model, and trim. Neither the ranges nor the value lists catch meaningful anomalies for either product type. Profiling a pencils subset and a cars subset separately produces ranges and value lists that are specific, actionable, and far more likely to detect problems.
The same principle applies to time. When your goal is to monitor for drift and maintain data quality going forward, profiling billions of rows of historical data adds noise. Profile the window that reflects current behavior — such as the past year of transactions — so the generated tests reflect the data you intend to monitor.
There are two practical ways to define a homogeneous subset:
- By business category — such as product type, region, or customer segment.
- By recency — a rolling time window, such as the trailing 12 months.
If you can create views on the target database, a view is the cleanest way to define a subset. Create a view for each category or time window, then point a table group at those views. For some test types, TestGen also supports a subset condition that applies a row-level filter to an individual test. Use a subset condition to scope a single test without a view, but note that you apply it manually per test. See Test Definitions.
Getting started¶
To run profiling, you need a table group that defines which tables and columns to scan. See Run Profiling for setup steps, and Investigate Profiling Results for reviewing your results and taking action on hygiene issues.