Data Hygiene Issues¶

This reference covers the 32 data hygiene issues that TestGen detects during a profiling run. Hygiene issues are structural inconsistencies and content anomalies — such as data type mismatches, non-standard blank values, and potential PII — that may indicate data quality problems. Use this page to understand what a specific issue means and what action to consider.

To review and act on hygiene issues detected in your data, see Investigate Profiling Results.

Likelihood scale¶

Each hygiene issue is assigned a likelihood rating that indicates how strongly the detected pattern suggests an actual data problem:

Possible — A speculative finding that often indicates problems but may be benign.
Likely — Typically indicates a data problem.
Definite — Indicates a data problem.

Hygiene issue details¶

Hygiene issue name	Description	Suggested action
Suggested data type	The actual column contents suggest a more appropriate data type than the one defined in the table.	Consider changing the column data type to tighten controls over data ingested in the column and to make downstream analysis easier.
Non-standard blank values	Values representing missing data may be unexpected or inconsistent. Non-standard values may include empty strings as opposed to nulls; dummy entries such as MISSING or repeated characters that may have been used to bypass entry requirements; processing artifacts such as NULL; or spreadsheet artifacts such as NA or ERROR.	Consider cleansing the column upon ingestion to replace all variants of missing data with a standard designation, like NULL.
Invalid USA zip code format	Some values present do not conform with the expected format of USA Zip Codes.	Consider correcting invalid column values or changing them to indicate a missing value if corrections cannot be made.
Multiple data types per column name (minor)	Columns with the same name have the same general type across tables, but the types do not exactly match. Truncation issues may result if columns are commingled and assumed to be the same format.	Consider changing the column data types to be fully consistent. This will tighten your standards at ingestion and ensure that data is consistent between tables.
Multiple data types per column name (major)	Columns with the same name have different types across tables. Differences could be significant enough to cause errors in downstream analysis, extra steps resulting in divergent business logic, and inconsistencies in results.	Ideally, you should change the column data types to be fully consistent. If the data is meant to be different, you should change column names so downstream users aren't confused.
No column values present	This column is present in the table, but no values have been ingested or assigned in any records. This could indicate missing data or a processing error. Note that this considers dummy values and zero-length values as missing data.	Review your source data, ingestion process, and any processing steps that update this column.
Pattern inconsistency within column	Alpha-numeric string data within this column conforms to two to four different patterns, with 95% matching the first pattern. This could indicate data errors in the remaining values.	Review the values for any data that doesn't conform to the most common pattern and correct any data errors.
Pattern inconsistency across tables	Alpha-numeric string data within this column matches a single pattern, but other columns with the same name have data that matches a different single pattern. Inconsistent formatting may contradict user assumptions and cause downstream errors, extra steps, and inconsistent business logic.	Review the profiled patterns for the same column in other tables. You may want to add a step in your processing to make patterns consistent.
Leading spaces found in column values	Spaces were found before data at the front of column string values. This likely contradicts user expectations and could be a sign of broader ingestion or processing errors.	Review your source data, ingestion process, and any processing steps that update this column.
Quoted values found in column values	Column values were found within quotes. This likely contradicts user expectations and could be a sign of broader ingestion or processing errors.	Review your source data, ingestion process, and any processing steps that update this column.
Character column with mostly numeric values	This column is defined as alpha, but more than 95% of its values are numeric. Numbers in alpha columns won't sort correctly and might contradict user expectations downstream. It's also possible that more than one type of information is stored in the column, making it harder to retrieve.	Review your source data and ingestion process. Consider whether it might be better to store the numeric data in a numeric column. If the alpha data is significant, you could store it in a different column.
Character column with mostly date values	This column is defined as alpha, but more than 95% of its values are dates. Dates in alpha columns might not sort correctly and might contradict user expectations downstream. It's also possible that more than one type of information is stored in the column, making it harder to retrieve.	Review your source data and ingestion process. Consider whether it might be better to store the date values as a date or datetime column. If the alpha data is also significant, you could store it in a different column.
Small percentage of missing values found	Under 3% of values in this column were found to be null, zero-length, or dummy values, but values are not universally present. This could indicate unexpected missing values in a required column.	Review your source data and follow up with data owners to determine whether this data needs to be corrected, supplemented, or excluded.
Small percentage of divergent values found	Under 3% of values in this column were found to be different from the most common value. This could indicate a data error.	Review your source data and follow up with data owners to determine whether this data needs to be corrected.
Unexpected Boolean values found	This column appears to contain boolean (True/False) data, but unexpected values were found. This could indicate inconsistent coding for the same intended values, potentially leading to downstream errors or inconsistent business logic.	Review your source data and follow up with data owners to determine whether this data needs to be corrected.
Potential duplicate values found	This column is largely unique, but some duplicate values are present. This pattern is uncommon and could indicate inadvertent duplication.	Review your source data and follow up with data owners to determine whether this data needs to be corrected.
Similar values match when standardized	When column values are standardized (removing spaces, single quotes, periods, and dashes), matching values are found in other records. This may indicate that formats should be further standardized to allow consistent comparisons for merges, joins, and roll-ups. It could also indicate the presence of unintended duplicates.	Review standardized versus raw data values for all matches. Correct data if values should be consistent.
Unlikely dates out of typical range	Some date values in this column are earlier than 1900-01-01 or later than 30 years after the profile date.	Review your source data and follow up with data owners to determine whether this data needs to be corrected or removed.
No table dates within one year	Among all date columns present in the table, none fall inside one year from the profile date.	Review your source data and follow up with data owners to determine whether dates in the table should be more recent.
No table dates within six months	Among all date columns present in the table, the most recent date falls six months to one year back from the profile date.	Review your source data and follow up with data owners to determine whether dates in the table should be more recent.
Unexpected column contains US states	This column is not labeled as a state but contains mostly US State abbreviations. This could indicate shifted or switched source data columns.	Review your source data and follow up with data owners to determine whether the column should be populated with US states.
Unexpected column contains emails	This column is not labeled as email but contains mostly email addresses. This could indicate shifted or switched source data columns.	Review your source data and follow up with data owners to determine whether the column should be populated with email addresses.
Unexpected numeric values found	Under 3% of values in this column were found to be numeric. This could indicate a data error.	Review your source data and follow up with data owners to determine whether numeric values are invalid entries here.
Invalid USA ZIP-3 format	The majority of values in this column are 3-digit United States ZIP, but divergent patterns were found. This could indicate an incorrect roll-up category or a PII concern.	Review your source data, ingestion process, and any processing steps that update this column.
Delimited data embedded in column	Delimited data, separated by a common delimiter (comma, tab, pipe, or caret) is present in over 80% of column values. This could indicate data that was incorrectly ingested or data that would be better represented in parsed form.	Review your source data and follow up with data consumers to determine the most useful representation of this data.
Character column with numbers and units	This column is defined as alpha, but values include numbers with percents or common units. Embedded measures in alpha columns are harder to access, won't sort correctly, and might contradict user expectations downstream.	Review your source data and ingestion process. Consider whether it might be better to parse the numeric and unit data and store in separate columns.
Variant codings for same values	This column contains more than one common variant that represents a single value or state. This can occur when data is integrated from multiple sources with different standards, or when free entry is permitted without validation. The variations can cause confusion and error for downstream data users.	Review your source data and ingestion process. Consider cleansing this data to standardize on a single set of definitive codes.
Personally identifiable information	This column contains data that could be Personally Identifiable Information (PII).	PII may require steps to ensure data security and compliance with relevant privacy regulations and legal requirements. You may have to classify and inventory PII, implement appropriate access controls, encrypt data, and monitor for unauthorized access. Note that PII that is lower-risk in isolation might be high-risk in conjunction with other data.