Data Type and Format Issues¶

Column values don't match the expected type, format, or structural standard — including detection of sensitive data patterns.

For an overview of all hygiene issue categories, see Data Hygiene Issues.

Suggested data type¶

The actual column contents suggest a more appropriate data type than the one defined in the table.


Likelihood	Likely
How it's detected	Flagged when a text column contains values that all meet criteria for a more specific type (such as numeric, date, or boolean).

Storing data in an overly general type (such as text instead of numeric) means values won't sort correctly, type-specific operations may fail, and downstream users may make incorrect assumptions about the data.

Suggested action: Consider changing the column data type to tighten controls over data ingested and to make values more efficient, consistent, and suitable for downstream analysis.

Character column with mostly numeric values¶

This column is defined as text, but more than 95% of its values are numeric.


Likelihood	Likely
Quality dimension	Validity
How it's detected	Flagged when a text column (excluding zip codes and IDs) has over 95% numeric values.

Numbers in text columns won't sort correctly and might contradict user expectations downstream. It's also possible that more than one type of information is stored in the column, making it harder to retrieve.

Suggested action: Review your source data and ingestion process. Consider whether it might be better to store the numeric data in a numeric column. If the non-numeric data is significant, you could store it in a different column.

Character column with mostly date values¶

This column is defined as text, but more than 95% of its values are dates.


Likelihood	Likely
Quality dimension	Validity
How it's detected	Flagged when a text column has over 95% date values.

Dates in text columns might not sort correctly and might contradict user expectations downstream. It's also possible that more than one type of information is stored in the column, making it harder to retrieve.

Suggested action: Review your source data and ingestion process. Consider whether it might be better to store the date values as a date or datetime column. If the non-date data is also significant, you could store it in a different column.

Character column with numbers and units¶

This column is defined as text, but values include numbers with percents or common units (such as lb, kg, km, ft).


Likelihood	Possible
Quality dimension	Consistency
How it's detected	Flagged when over 50% of values contain digits and the most frequent value matches a number-with-unit pattern.

Embedded measures in text columns are harder to access, won't sort correctly, and might contradict user expectations downstream.

Suggested action: Review your source data and ingestion process. Consider whether it might be better to parse the numeric and unit data and store in separate columns.

Delimited data embedded in column¶

Delimited data, separated by a common delimiter (comma, tab, pipe, or caret) is present in over 80% of column values.


Likelihood	Likely
Quality dimension	Validity
How it's detected	Flagged when profiling identifies a standard delimited-data pattern in the column.

This could indicate data that was incorrectly ingested or data that would be better represented in parsed form. Embedded delimited data is difficult to query and analyze directly.

Suggested action: Review your source data and follow up with data consumers to determine the most useful representation of this data.

Leading spaces¶

Spaces were found before data at the front of column string values.


Likelihood	Likely
Quality dimension	Validity
How it's detected	Profiling counts values with leading whitespace. Flagged when any are found.

This likely contradicts user expectations and could be a sign of broader ingestion or processing errors. Leading spaces cause silent failures in joins, lookups, and string comparisons — two values that look identical may not match if one has a hidden leading space.

Suggested action: Review your source data, ingestion process, and any processing steps that update this column. Consider trimming whitespace during ingestion.

Quoted values¶

Column values were found within quotes.


Likelihood	Likely
Quality dimension	Validity
How it's detected	Profiling counts values enclosed in quotation marks. Flagged when any are found.

Embedded quotes typically contradict user expectations and could be a sign of broader ingestion or processing errors — for example, a CSV import that did not properly strip field delimiters.

Suggested action: Review your source data, ingestion process, and any processing steps that update this column.

Non-printing characters¶

Non-printing characters were found embedded in a text column.


Likelihood	Definite
Quality dimension	Validity
How it's detected	Profiling scans for non-printing ASCII characters in text values. Flagged when any are found.

Embedded non-printing characters affect filters and aggregations and may cause problems for downstream users who don't recognize their presence. These characters are typically invisible in query results and spreadsheet views.

Suggested action: Review your source data and follow up with data owners to determine whether the data can be corrected upstream. If not, strip non-printing characters from processed data during ingestion.

Non-alpha name or address¶

Entirely non-alphabetic values were found in a column representing an entity name or address element.


Likelihood	Definite
Quality dimension	Validity
How it's detected	Flagged when a column identified as a name or address (e.g., person name, city, entity name) contains values with no alphabetic characters.

Non-alphabetic values are highly likely to be invalid for this kind of column. This may indicate a file format change, an error in an ingestion process, or incorrect source data.

Suggested action: Review your pipeline process and source data to determine the root cause. If upstream corrections are not possible, consider assigning the processed value to null to reflect that data is missing.

Non-alpha prefixed name¶

Non-alphabetic characters were found at the start of a column representing an entity name.


Likelihood	Definite
Quality dimension	Validity
How it's detected	Flagged when a column identified as a person or city name has values starting with a non-alphabetic character (excluding quotes and spaces, which are caught by other checks).

Values starting with a non-alphabetic character are highly likely to be invalid for this kind of column. This could also indicate flagging or coding that could be broken out into a separate column in processed data.

Suggested action: Review your pipeline process and source data to determine the root cause. If upstream corrections are not possible, consider applying corrections directly to processed data where possible.

Inconsistent casing¶

Mixed-case and all-upper-case values were found in the same column representing an entity name or address element.


Likelihood	Definite
Quality dimension	Validity
How it's detected	Flagged when a column identified as a name or address contains both mixed-case and upper-case values.

Inconsistent casing can cause failures in case-sensitive joins and lookups, and may confuse downstream users expecting a consistent format.

Suggested action: Review your source data and follow up with data owners to determine whether consistent casing should be applied at the source. If source data corrections are not possible, consider standardizing the column upon ingestion.

Invalid USA zip code format¶

Some values do not conform with the expected 5-digit or 9-digit USA zip code format.


Likelihood	Definite
Quality dimension	Validity
How it's detected	Flagged when a column identified as a zip code contains values that do not match the standard `NNNNN` or `NNNNN-NNNN` patterns, or when the column is stored as a numeric type instead of text.

Invalid zip code formats can cause errors in geographic analysis, address validation, and shipping calculations. A numeric column type is also flagged because leading zeros are lost in numeric storage (e.g., zip code 01234 becomes 1234).

Suggested action: Consider correcting invalid column values or changing them to indicate a missing value if corrections cannot be made. If the column is stored as a numeric type, consider converting it to text.

Invalid USA ZIP-3 format¶

The majority of values in this column are 3-digit US regional zip prefixes, but divergent patterns were found.


Likelihood	Definite
Quality dimension	Validity
How it's detected	Flagged when a column named like a zip or postal code has a majority of 3-digit numeric values but also contains values in other patterns.

This could indicate an incorrect roll-up category or mixed data in the column. Depending on your needs and regulatory requirements, longer zip codes in a Zip-3 column could also represent a PII concern.

Suggested action: Review your source data, ingestion process, and any processing steps that update this column.

Personally identifiable information¶

This column contains data that could be personally identifiable information (PII).


Likelihood	Varies by PII type (high, moderate, or low risk)
Quality dimension	Validity
How it's detected	Flagged when profiling detects patterns matching known PII categories — such as SSNs, credit card numbers, email addresses, phone numbers, national IDs, and other sensitive data — through column name and data pattern analysis.

PII may require steps to ensure data security and compliance with relevant privacy regulations and legal requirements. Note that PII that is lower-risk in isolation might be high-risk in conjunction with other data.

Suggested action: Classify and inventory PII, implement appropriate access controls, encrypt data, and monitor for unauthorized access. Your organization might be required to update privacy policies and train staff on data protection practices. See Flag PII columns for how TestGen manages PII flags and masking.