Sanity Bytes: Bad Data: Why LLMs Fail before Training starts

A glowing brain on a block

AI-generated content may be incorrect. Everyone loves talking about model architecture, token limits, GPUs, context windows, and agents. But almost no one talks about the most important thing: Your AI is only as good as the data you feed it, and most organizations feed it garbage. I’ve now seen enough enterprise AI projects to say this confidently: 80% of LLM failures are caused not by the model but by the data beneath it. And the irony? Most of these failures are invisible until it’s too late. Let’s unpack why.

The Story: Two Companies, Same AI Model, Very Different Outcomes

Two real-world scenarios:

Company A

They invested in a strong LLM.
Added RAG for enterprise search.
Connected internal knowledge bases.

But the results were terrible: inaccurate answers, contradicting outputs, outdated facts.

Company B

They used the exact same model.
No special fine-tuning.
No complex GPU stack.

Their system delivered accurate, consistent answers, and employees trusted it.

The difference?

Company B spent 6 months fixing their data foundation. Company A spent 6 months convincing themselves they had a model problem. The Hard Truth: LLMs Consume Meaning, Not Just Text

When we say “bad data,” most people think:

duplicates
missing values
outdated documents
inconsistent formatting

Sure, those matter. But LLMs break for a deeper reason: LLMs operate on patterns and meaning. Not accuracy. Not truth.

That means:

contradictory sources → contradictory outputs
redundant content → hallucinated “averages”
multiple versions of the same document → uncertainty
messy file structures → incomplete reasoning
ambiguous terminology → unpredictable results

LLMs “do their best” with what they have, which is sometimes the worst thing you can ask them to do.

3. The Hidden Sources of LLM Failure (No One Tells You About)

a.Fragmented Knowledge Spread Across 20 Tools: Companies store knowledge in: Confluence, SharePoint, Slack, Email, Fileservers, PDFs, Old intranets, Team drives, Legacy tools no one maintains

When your knowledge is scattered, your model becomes scattered too.

b. Conflicting Documentation: Every company has multiple truths. An LLM doesn’t know which one is correct, so it picks whichever looks statistically common. This is how AI produces confident lies.

c. Documents Written for Humans, Not Machines: LLMs struggle with:

vague policy documents
meeting notes with no structure
old process manuals written in prose
tribal knowledge wrapped in jargon

Humans can infer meaning. LLMs can only infer patterns.

d. Outdated Content That Should’ve Been Deleted Years Ago: If you don’t archive aggressively, your LLM will hallucinate aggressively. Old data doesn’t disappear, it competes with new data.

Want Good AI? Fix Your Data First. The most advanced AI teams I’ve worked with do something counterintuitive:

They spend the first 60–70% of the project not on AI but on knowledge engineering.

What does that look like?

1. Canonical truth documents: Define one authoritative source for each critical topic.

2. Aggressive archiving: If the data shouldn’t be used today, it shouldn’t exist today.

3. Semantic cleanup: Make documents machine-readable: clear titles, consistent terminology, structured layouts.

4. Metadata-first thinking: A file without metadata is a liability.

5. Knowledge architecture: Organize information as if a machine will read it, not a human.

The Irony: Data Work Is Unsexy but It’s the Real “AI Work”

People want to talk about:

Phi vs LLaMA
RAG vs fine-tuning
Agents and tools
Vector databases
Memory architectures

But none of that matters if the foundation is broken. The most underrated skill in modern AI is data curation. Not prompt engineering. Not model training. Not agents. Just… The basics. A Simple Rule That Predicts AI Success

Whenever someone asks me: “Will AI work for us?”

My response is always: “Show me your data. All of it. I’ll tell you in 10 minutes.”

Because I have never, not once seen a company with clean, structured, canonicalized data fail at AI. But I have seen dozens with massive model budgets fail because they never cleaned their data.

In final thoughts, AI doesn’t start with models. It starts with meaning. If you want good AI:

Fix the data.
Fix the knowledge.
Fix the structure.
Then bring the model.

In that order. Because the truth is simple: LLMs don’t hallucinate randomly, they hallucinate the mess we feed them.

#AI #LLM #DataQuality #EnterpriseAI #KnowledgeEngineering #AIGovernance #AITransformation #AIReadiness

Sanity Bytes

Thursday, November 20, 2025

Bad Data: Why LLMs Fail before Training starts

No comments:

Post a Comment

Blog Archive