Sanity Bytes: The CTO’s AI Playbook – Part 2: The Problem Isn’t Your AI. It’s Your Data

A colleague of mine once described her organization’s data estate as “a filing cabinet that caught fire, got rained on and then had someone reorganize it while wearing oven gloves.”

I laughed. Then I realized she was describing something I’d seen in a dozen other organizations, and probably something most CTOs would privately recognize too.

Here’s the uncomfortable truth that the AI vendor pitch decks skip over: AI doesn’t fix bad data. It amplifies it. Feed a model incomplete, siloed or biased information and you don’t get a neutral result. You get confident-sounding nonsense produced at speed, at scale and often delivered to someone who trusts it.

The scale of the problem in enterprise environments is significant. According to a 2024 article in CDO Magazine, nearly 70% of IT leaders describe their data as at least somewhat siloed and almost 40% call it “completely” or “very siloed.” Separate systems. Separate ownership. Separate formats. Nobody connected them because, until now, nobody had a pressing reason to.

AI is that reason. But untangling years of data fragmentation isn’t a sprint task or a tool purchase. It’s a genuine architecture and governance challenge and most organizations are discovering this mid-project, after the AI budget has already been committed.

When I talk to CTOs who’ve been through a GenAI implementation that struggled, the data issues tend to fall into one of three categories:

Quality

Data that was “good enough” for quarterly reporting is almost never good enough to train or tune an AI model. Duplicates, missing fields, inconsistent formats, records that haven’t been updated since 2019, AI surfaces all of it. The model can’t distinguish between clean data and dirty data; it just learns from whatever it’s given.

Accessibility

Even where good data exists, getting to it is often a legal, contractual, or technical challenge. Data residency rules, privacy regulations, legacy system limitations, the data you actually need is frequently the data you can least easily use. I’ve seen organizations spend more time on data access negotiations than on model development.

Bias

Historical data reflects historical decisions. And historical decisions, in almost every industry, contain biases we wouldn’t consciously make today. An AI model trained on this data learns and perpetuates those biases, sometimes subtly, sometimes in ways that create real regulatory and reputational risk.

The “garbage in, garbage out” problem isn’t new. But with AI, the consequences are different in character.

A badly coded application fails in obvious, traceable ways. You can debug it. A poorly trained AI model fails in subtle, opaque ways, it produces authoritative-sounding outputs that are just wrong. Users trust it precisely because it sounds confident. Decisions are made. Things go wrong. And tracing it back to the data quality issue that started the chain is much harder than it sounds.

This is not a theoretical risk. It’s happening in production systems right now.

Here’s the counter-intuitive point that most data strategy articles miss: data quality is rarely a technology problem. It’s a politics problem. Data silos exist because they’re protected. A business unit’s data is often also that unit’s leverage, it reflects their performance, their customers, their decisions. Sharing it means exposing it. I’ve seen organizations spend a year selecting a data platform and then another two years failing to populate it, because the people who own the data have no incentive to open it up. The technology was ready. The politics wasn’t. Any serious data strategy has to address both.

The organisations making real headway on AI don’t necessarily have better AI tools. They have better data foundations. And they built them before the AI programme started, not alongside it.

A few things they tend to do differently:

They treat data quality as a product, not a project. It has owners, standards, SLAs, and ongoing improvement cycles, not a one-time clean-up effort that gets deprioritized the moment something more urgent appears.
They invest in a data integration layer early. Whether that’s a modern lake-house architecture, a semantic layer, or a data mesh approach, the ability to bring data together in a governed, consistent, accessible way is non-negotiable for anything beyond a basic proof-of-concept.
They build a data catalogue so teams actually know what they have. Knowing what data exists, where it lives, who owns it and what its quality looks like sounds basic. It’s surprisingly rare, and it’s foundational.

And they embed governance from the start, not as a compliance box to tick, but as a strategic capability that defines what data can be used, for what purpose, and under what conditions. Counter-intuitively, good governance tends to accelerate AI development, because it removes the uncertainty that slows decisions down.

If you’re a CTO being pushed for AI results on a short timeline, here’s the conversation worth having with your leadership team: “Our data needs work before our AI can deliver. Rushing past that doesn’t make us faster, it makes us faster at building the wrong thing.”

It’s not a comfortable message. But it’s the honest one.

The organizations winning at AI right now didn’t get lucky with their data. They made unglamorous investments in data foundations that their competitors skipped and they’re now reaping the compound interest on that decision. The window to make that investment before it becomes a crisis is closing. For many, it’s already closed.

Next in the series: Part 3 − The leadership fluency gap nobody talks about.

Sanity Bytes

Thursday, June 4, 2026

The CTO’s AI Playbook – Part 2: The Problem Isn’t Your AI. It’s Your Data

No comments:

Post a Comment

Blog Archive