A
colleague of mine once described her organization’s data estate as “a filing
cabinet that caught fire, got rained on and then had someone reorganize it
while wearing oven gloves.”
I laughed. Then I realized she was describing something I’d
seen in a dozen other organizations, and probably something most CTOs would
privately recognize too.
Here’s the uncomfortable truth that the AI vendor pitch
decks skip over: AI doesn’t fix bad data. It amplifies it. Feed a
model incomplete, siloed or biased information and you don’t get a neutral
result. You get confident-sounding nonsense produced at speed, at scale and
often delivered to someone who trusts it.
The scale of the problem in enterprise environments is
significant. According to a 2024 article in CDO Magazine, nearly 70% of
IT leaders describe their data as at least somewhat siloed and almost 40%
call it “completely” or “very siloed.” Separate systems. Separate ownership.
Separate formats. Nobody connected them because, until now, nobody had a
pressing reason to.
AI is that reason. But untangling years of data fragmentation isn’t a sprint task or a tool purchase. It’s a genuine architecture and governance challenge and most organizations are discovering this mid-project, after the AI budget has already been committed.
When I talk to CTOs who’ve been through a GenAI
implementation that struggled, the data issues tend to fall into one of three
categories:
Quality
Data that was “good enough” for quarterly reporting is
almost never good enough to train or tune an AI model. Duplicates, missing
fields, inconsistent formats, records that haven’t been updated since 2019, AI
surfaces all of it. The model can’t distinguish between clean data and dirty
data; it just learns from whatever it’s given.
Accessibility
Even where good data exists, getting to it is often a legal,
contractual, or technical challenge. Data residency rules, privacy regulations,
legacy system limitations, the data you actually need is frequently the data
you can least easily use. I’ve seen organizations spend more time on data
access negotiations than on model development.
Bias
Historical data reflects historical decisions. And
historical decisions, in almost every industry, contain biases we wouldn’t
consciously make today. An AI model trained on this data learns and perpetuates
those biases, sometimes subtly, sometimes in ways that create real regulatory
and reputational risk.
The “garbage in, garbage out” problem isn’t new. But with
AI, the consequences are different in character.
A badly coded application fails in obvious, traceable ways.
You can debug it. A poorly trained AI model fails in subtle, opaque ways, it
produces authoritative-sounding outputs that are just wrong. Users trust it
precisely because it sounds confident. Decisions are made. Things go wrong. And
tracing it back to the data quality issue that started the chain is much harder
than it sounds.
This is not a theoretical risk. It’s happening in production
systems right now.
Here’s the counter-intuitive point that most data strategy
articles miss: data quality is rarely a technology problem. It’s a
politics problem. Data silos exist because they’re protected. A business
unit’s data is often also that unit’s leverage, it reflects their performance,
their customers, their decisions. Sharing it means exposing it. I’ve seen
organizations spend a year selecting a data platform and then another two years
failing to populate it, because the people who own the data have no incentive
to open it up. The technology was ready. The politics wasn’t. Any serious data
strategy has to address both.
The organisations making real headway on AI don’t
necessarily have better AI tools. They have better data foundations. And they
built them before the AI programme started, not alongside it.
A few things they tend to do differently:
- They treat
data quality as a product, not a project. It has owners, standards,
SLAs, and ongoing improvement cycles, not a one-time clean-up effort that
gets deprioritized the moment something more urgent appears.
- They invest
in a data integration layer early. Whether that’s a modern lake-house
architecture, a semantic layer, or a data mesh approach, the ability to
bring data together in a governed, consistent, accessible way is
non-negotiable for anything beyond a basic proof-of-concept.
- They build
a data catalogue so teams actually know what they have. Knowing
what data exists, where it lives, who owns it and what its quality looks
like sounds basic. It’s surprisingly rare, and it’s foundational.
And they embed governance from the start, not as
a compliance box to tick, but as a strategic capability that defines what data
can be used, for what purpose, and under what conditions. Counter-intuitively,
good governance tends to accelerate AI development, because it removes the
uncertainty that slows decisions down.
If you’re a CTO being pushed for AI results on a short
timeline, here’s the conversation worth having with your leadership team: “Our
data needs work before our AI can deliver. Rushing past that doesn’t make us
faster, it makes us faster at building the wrong thing.”
It’s not a comfortable message. But it’s the honest one.
The organizations winning at AI right now didn’t get lucky with their data. They made unglamorous investments in data foundations that their competitors skipped and they’re now reaping the compound interest on that decision. The window to make that investment before it becomes a crisis is closing. For many, it’s already closed.
Next in the series: Part 3 − The leadership fluency gap nobody talks about.
No comments:
Post a Comment