Hello ,
Here’s a scenario that happens more often than anyone admits:
You build a report. Your director uses it in a board presentation. A week later, someone notices the customer count includes 58 records with obviously fake email addresses like `evan.clarke743@noemail`.
The report wasn’t wrong. The data underneath it was dirty. And now your credibility takes a hit — not because of your SQL skills, but because you didn’t check.
Data quality auditing is one of the most practical skills you can develop. It takes 10 minutes, it saves you from embarrassing situations, and the same patterns work in every database you’ll ever touch.
The Audit: Four Checks You Can Run Today
I ran these against the Summit Adventures database (the fake adventure tourism company I created to help people learn business analytics). This database has intentional data quality issues built in — about 15% of records have some kind of anomaly — specifically so you can practice finding them.
Check 1: Invalid Email Addresses
-- Find emails that don't follow standard format
SELECT
email,
first_name || ' ' || last_name AS customer_name
FROM customers
WHERE email LIKE '%@noemail'
OR email NOT LIKE '%@%.%'
OR email LIKE '%@%@%'
OR email LIKE '% %'
LIMIT 10;
Results:
| Email | Customer |
|——-|———-|
| evan.clarke743@noemail | Evan Clarke |
| steven.grenier862@noemail | Steven Grenier |
| stefani.oneill982@noemail | Stefani O’Neill |
| hannah.toussaint391@noemail | Hannah Toussaint |
| angela.lopez603@noemail | Angela Lopez |
| blake.gonzalez671@noemail | Blake Gonzalez |
| christelle.vallée790@noemail | Christelle Vallée |
| alice.langern125@noemail | Alice Langern |
| mercedes.rodriguez650@noemail | Mercedes Rodriguez |
| jeremy.paffrath225@noemail | Jeremy Paffrath |
These are records where no valid email was provided. In a real company, these customers are unreachable by email — which means any marketing campaign that targets them is wasting resources.
Check 2: The Full Audit Summary
Here’s the powerful part. One query that summarizes all major data quality issues:
-- Complete data quality audit summary
SELECT
'Invalid emails' AS issue_type,
COUNT(*) AS issue_count,
ROUND(
COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
) AS pct_affected
FROM customers
WHERE email LIKE '%@noemail'
UNION ALL
SELECT
'Placeholder phones' AS issue_type,
COUNT(*) AS issue_count,
ROUND(
COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
) AS pct_affected
FROM customers
WHERE phone LIKE '000-%' OR phone LIKE '555-%'
UNION ALL
SELECT
'Unrealistic ages' AS issue_type,
COUNT(*) AS issue_count,
ROUND(
COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
) AS pct_affected
FROM customers
WHERE EXTRACT(YEAR FROM AGE(date_of_birth)) < 18
OR EXTRACT(YEAR FROM AGE(date_of_birth)) > 100
UNION ALL
SELECT
'Missing dietary info' AS issue_type,
COUNT(*) AS issue_count,
ROUND(
COUNT(*)::numeric / (SELECT COUNT(*) FROM customers) * 100, 1
) AS pct_affected
FROM customers
WHERE dietary_restrictions IS NULL
ORDER BY pct_affected DESC;
Results:
| Issue Type | Count | % Affected |
|———–|——-|————|
| Missing dietary info | 782 | 78.2% |
| Invalid emails | 58 | 5.8% |
| Placeholder phones | 35 | 3.5% |
| Unrealistic ages | 16 | 1.6% |
This is your data quality scorecard. In one glance, you know:
78.2% missing dietary info — This is an optional field, not necessarily a quality issue. But operations should know that dietary data is only available for 218 out of 1,000 customers.
5.8% invalid emails — 58 customers are unreachable. Marketing should exclude these from campaigns.
3.5% placeholder phones — 35 customers have fake phone numbers (000- and 555- prefixes). Not useful for SMS campaigns.
1.6% unrealistic ages — 16 records with ages under 18 or over 100. Likely data entry errors.
How to Build Your Own Audit Query
The pattern is the same for any database:
-- Template: Data quality audit
SELECT 'Description of issue' AS issue_type,
COUNT(*) AS issue_count,
ROUND(
COUNT(*)::numeric / (SELECT COUNT(*) FROM your_table) * 100, 1
) AS pct_affected
FROM your_table
WHERE [condition that identifies bad data]
UNION ALL
SELECT 'Next issue' AS issue_type,
COUNT(*),
ROUND(COUNT(*)::numeric / (SELECT COUNT(*) FROM your_table) * 100, 1)
FROM your_table
WHERE [next condition]
ORDER BY pct_affected DESC;
Each `UNION ALL` block checks for one type of issue. The `ORDER BY` at the end sorts by severity so you see the biggest problems first.
What to Check in Any Database
Here’s a checklist you can adapt:
Contact information:
Emails without @ and domain → `WHERE email NOT LIKE ‘%@%.%’`
Phone numbers with placeholder prefixes → `WHERE phone LIKE ‘000-%’`
Missing required fields → `WHERE field IS NULL`
Dates and ages:
Future dates → `WHERE signup_date > CURRENT_DATE`
Unrealistic ages → `WHERE age < 0 OR age > 120`
Dates before the company existed → `WHERE created_at < '2020-01-01'`
Financial data:
Negative amounts → `WHERE amount < 0`
Zero-dollar transactions → `WHERE amount = 0`
Suspiciously large amounts → `WHERE amount > 100000`
Duplicates:
Same email, different names → `GROUP BY email HAVING COUNT(*) > 1`
Same phone, different customers → `GROUP BY phone HAVING COUNT(*) > 1`
When to Run Audits
Before any analysis. Five minutes of checking saves you from presenting conclusions based on dirty data.
Before sending reports to executives. A 3-row audit summary at the bottom of your report (“Note: 58 records excluded due to invalid email addresses”) shows professionalism and builds trust.
After any data import or migration. New data loads are the most common source of quality problems.
Common Mistakes to Avoid
Mistake 1: Fixing data silently
Don’t just exclude bad records without mentioning it. Document what you found and what you excluded. Transparency builds more trust than a “clean” report that hides problems.
Mistake 2: Treating all NULLs as errors
Missing dietary restrictions isn’t a data quality issue — it’s an optional field. Missing email addresses on records that need email outreach IS a quality issue. Context matters.
Mistake 3: Auditing once and never again
Data quality degrades over time as new records come in. Build your audit as a saved query you can re-run monthly.
Try This Monday Morning
Before your next analysis:
1. Pick your most-used table
2. Write three `WHERE` conditions that identify suspicious records
3. Use the `UNION ALL` pattern to create a one-query audit summary
4. Add a note to your report documenting what you found
It takes less than 15 minutes. And the first time it catches something before your boss does, you’ll never skip it again.
Until next time,
Brian ([say hi on twitter!](https://twitter.com/briangraves))
—
P.S. Data quality thinking is woven throughout SQL for Business Impact. Every module uses real-world data with intentional messiness — because that’s what you’ll face at work. The course teaches you to handle it confidently. Check it out at [sqlforbusinessimpact.com](https://sqlforbusinessimpact.com).
P.P.S. What’s the worst data quality surprise you’ve found at work? Hit reply — I collect these stories because they’re genuinely fascinating. I read every response.
Want More SQL Insights Like This?
Join thousands of analysts getting weekly tips on turning data into business impact.
