Skip to content

Data Profiling

Use data profiling to discover the structure, content, and quality characteristics of tables in your database. Profiling results populate the Data Catalog, surface hygiene issues, and serve as the foundation for automated test generation.

What profiling discovers

During a profiling run, TestGen scans the tables and columns in a table group and captures 55 column-level characteristics for each column. These characteristics fall into several categories:

  • Data types and structure — The declared column type versus the actual content. TestGen suggests more appropriate data types when the data doesn't match the declared type (for example, a VARCHAR column that contains only dates).
  • Value distributions — Record counts, null counts, distinct value counts, and the most frequent values. These reveal whether columns are populated as expected and whether the data has the expected cardinality.
  • Statistical measures — For numeric columns: min, max, average, standard deviation, and percentiles. For text columns: min and max lengths, pattern analysis, and character content.
  • Date analysis — Date ranges, recency distribution, and coverage across days, weeks, and months.
  • Content detection — Pattern matching to identify data that may contain PII (emails, phone numbers, SSNs), embedded delimiters, quoted values, or mixed data types within a single column.

Hygiene issues

After collecting column statistics, the system analyzes the results to detect 32 types of data hygiene issues — structural inconsistencies and content anomalies that may indicate data quality problems. Examples include columns with non-standard blank values, inconsistent patterns across tables, data type mismatches, and potential PII in unexpected locations.

Each hygiene issue is assigned a likelihood rating (Possible, Likely, or Definite) and comes with a suggested action to guide your investigation.

How profiling supports your data quality workflow

Profiling is the starting point for several downstream features in TestGen:

  • Test generation — TestGen uses profiling results to auto-generate data quality tests tailored to the actual content and patterns in your columns.
  • Data Catalog — Profiling results populate the Data Catalog, giving you a hierarchical view of your table groups, tables, and columns with their statistics and quality indicators.
  • Freshness monitorsFreshness monitors are auto-generated from profiling data. The profiling process identifies appropriate columns to use for fingerprint comparison, which is how Freshness monitors detect whether a table has been updated.
  • Quality scores — Hygiene issue results contribute to the data quality scores visible on the Quality Dashboard.

When to run profiling

Profiling queries each column of every table in the table group individually, so the duration depends on the number of tables and columns, record counts, and database compute resources. Profiling is not designed to be part of your daily workflow. Best practice is to run profiling on a biweekly or monthly schedule during off-hours.

Running profiling periodically serves as an audit trail for data changes over time, keeps test definitions current with the realities of your database, and ensures that hygiene issues are detected as your data evolves.

Getting started

To run profiling, you need a table group that defines which tables and columns to scan. See Run Profiling for setup steps, and Investigate Profiling Results for reviewing your results and taking action on hygiene issues.