Profiling Characteristics¶
This reference lists the 55 column-level characteristics that TestGen captures during a profiling run. Use it to understand what a specific statistic measures and how it can help identify data quality concerns.
| Column characteristic | Description | Use case example |
|---|---|---|
| Average embedded spaces | The average number of embedded spaces in string values. | Identify format changes and potential column parse errors in string values. |
| Average length | The average length of data in the alpha column. | Change in data representation. |
| Average value | The average numeric value in the column. | Shift in the expected average. |
| Before 1yr date count | The count of records with date values earlier than one year before the current date. | Identify transactional versus historical dates (for example, ship dates versus birthdays). |
| Before 5yr date count | The count of records with date values earlier than five years before the current date. | Similar to the 1yr but with a wider timeframe. |
| Before 20yr date count | The count of records with date values earlier than twenty years before the current date. | Similar to the 1yr but with a wider timeframe. |
| Boolean True count | Count of records defined as boolean that equates to the True value. | The expected ratio of boolean True. |
| Column name | The name of the column being profiled. | |
| Column type | The data type as defined in the table. | |
| Data type suggestion | Automatically derived data type based on the data present. | Refine data types for better ingestion constraints. |
| Date count | Count of records with valid date values in the alpha column. | Confirm date type and expected ratio. |
| Date days present | Count of distinct days (YYYY-MM-DD) present in date values in a column. | Identify changes in date coverage by day. |
| Date months present | Count of distinct months (YYYY-MM) present in date values in a column. | Identify changes in date coverage by month. |
| Date weeks present | Count of distinct weeks (YYYY-WW) present in date values in a column. | Identify changes in date coverage by week. |
| Distinct pattern count | Count of distinct patterns in string data, with alphas shown as A and numeric characters as N. | Identify columns with predefined formats. |
| Distinct standard value count | Count of distinct values in string data transformed to upper case with spaces dropped. | Identify potential duplicates formatted differently. |
| Distinct value count | Investigate distinct values present in the column. | Identify columns with potential domain or coded values, standard categories. |
| Distinct value hash | The hash of distinct values present in a column. | Identify changes in distinct values present. |
| Embedded space count | Count of records with embedded spaces in string values. | Identify format changes and potential column parse errors in string values. |
| Filled value count | Count of records with miscellaneous potential representations of blank value in the alpha column, exclusive of other potential blank counts. | Consistent blank value representation. |
| Fractional sum | The sum of fractional values in a column | Assess decimal truncation. |
| Functional data type | A more detailed data type category. | Reporting and helps in identifying quality control testing. |
| Functional table type | A more detailed table type that goes beyond SQL types. | Reporting and helps in identifying quality control testing. |
| Future date count | Count of records with date values beyond the current date. | Identify the ratio of scheduled future date values. |
| General type | Broad data type: (A)lpha, (N)umeric, (D)ate, (B)oolean. | A broad category for creating test types. |
| Includes digit count | Count of records with values that contain at least one digit (0-9) in an alpha column. | Classify columns with mixed alphanumeric content, such as IDs and codes. |
| Lead space count | Count of records with leading space in alpha column. | Potential data parsing errors. |
| Maximum date | The maximum date value in a column. | Expected range of values. |
| Maximum length | The maximum length of data in an alpha column. | Change in data reporting. |
| Maximum text | The maximum text value in a column. | Expected range of values. |
| Maximum value | The maximum numeric value in a column. | Expected range of values. |
| Minimum date | The minimum date value in a column. | Expected range of values. |
| Minimum length | The minimum length of data in an alpha column. | Change in data representation. |
| Minimum text | The minimum text value in a column. | Expected range of values. |
| Minimum value | The minimum numeric value in a column | Expected range of values. |
| Minimum value Gr0 | The minimum numeric value is greater than zero in a column. | Expected range of values. |
| Null value count | Count of null values in a column, exclusive of other potential blank counts. | Consistent blank value representation and the ratio of null values. |
| Numeric count | Count of records with valid numeric values in the alpha column. | Confirm data type and count. |
| Percentile 25 | For numeric values. The boundary of the 25th percentile. | Confirm consistent distribution and outlier detection. |
| Percentile 50 | For numeric values. The boundary of the 50th percentile. | Confirm consistent distribution and outlier detection. |
| Percentile 75 | For numeric values. The boundary of the 75th percentile. | Confirm consistent distribution and outlier detection. |
| PII flag | A classification flag identifying Personally Identifiable Information by pattern, column name, and data characteristics. Categorized as high risk (passwords, SSNs, tax IDs), moderate risk (names, addresses, phone numbers, emails), or low risk (user IDs, routing numbers). | Identify sensitive data that may require security controls and privacy compliance. |
| Position | The ordinal position of a column in a table. | Data consistency. |
| Quoted value count | Count of records with values enclosed in quotes (single or double) in an alpha column. | Potential ingestion or processing errors where values were improperly escaped. |
| Record count | Total count of records in a table. | Record keeping and denominator for different ratios. |
| Standard deviation value | The standard deviation of numeric values in a column. | For calculating outlying values per standard deviation from the average. |
| Standard pattern match | Tag column contents based on pre-defined REGEX pattern matches for 90+% of values. | Detailed info about column contents. |
| Table name | Name of the table being profiled. | Data consistency. |
| Top frequent values | The top five most frequent values in an alpha column. | Change in most frequent values. |
| Top patterns | The top 10 most frequent patterns in alpha columns. | Consistent formatting (for example, phone numbers). |
| Value count | Count of records with values present in a column. | Shift in the ratio of missing value. |
| Within 1mo date count | Count of records with date values within one month before the current date. | Expected recency of date values. |
| Within 1yr date count | Count of records with date values within one year of the current date. | Expected recency of date values. |
| Zero length count | Count of records with zero length (blank) in an alpha column, exclusive of other potential blank counts. | Consistent blank value representation. |
| Zero value count | Count of records with zero values in alpha or numeric column, exclusive of other potential blank counts. | Consistent blank value representation and the ratio of zero values. |