The AI Compute Dilemma: Efficiency Should Serve People
The AI compute dilemma: efficiency must serve people, not just metrics. As costs shift from training to usage, one-size-fits-all fixes won't cut it—it's time to rethink AI infrastructure.
The Deloitte piece nails the vibe but undersells the stakes.
The headline frames it well: the AI infrastructure reckoning is about inference economics, not just big heroic training runs. Cost and value are shifting from model creation to model usage. That’s right on the surface level. The problem is what comes next: the argument flattens into a kind of one-size-fits-all optimization story, as if every company is playing the same game on the same board.
Sure, inference cost matters. No CFO has ever said, “I wish our margins were lower.” But inference isn’t just about pennies per query; it’s about ROI per user touch, latency that makes or breaks product-market fit, and the operational complexity of running models at scale without turning your SRE team into a permanent incident response unit. Treating “optimize inference” as a technical checkbox misses the point. Firms don’t just chase cheap flops; they choose where value is captured, and that choice bleeds into product design, user experience, and who ends up owning the customer relationship.
Look at the cloud pitch. You get GPUs and managed services from AWS, Azure, and Google Cloud with elastic billing and global reach. No forklifts, no chilled aisles, no scrambling for capacity when your marketing team actually lands a campaign. But on-prem and edge deployments buy you control: predictable latency, local data processing for nervous regulators, and in some cases a better long-run cost structure if your utilization is high and your workload is steady.
The Deloitte argument gestures toward optimization but doesn’t really put that decision in context. Startups living and dying by unit economics might prefer edge or hybrid strategies just to avoid egress surprises and to guarantee latency for real-time workflows like collaborative editing or live support. Enterprises that live under the gaze of auditors will index more heavily on governance and data provenance than on shaving fractions of a cent off per-inference spend.
Here’s the thing: architecture choices here are strategic, not housekeeping. If inference sits near the user — in an on-prem GPU cluster, a telco edge pod, or even on the device for smaller models — you preserve performance, trust, and a tighter feedback loop. If you centralize inference in someone else’s cloud, you move faster early on but risk two things: vendor lock-in and the gradual commoditization of the interface between your app and your customers. In AI, you want to be the interface, not a quiet line item on another company’s billing dashboard.
We’ve already seen a version of this movie. Netflix famously built its own content delivery network rather than depend forever on third-party CDNs, because owning the last mile determined streaming quality and bargaining power. AI inference is shaping up the same way: companies that treat it as a pure utility may save money on paper while giving up strategic control over a critical layer.
Yeah, no, the bigger blind spots sit outside the usual “optimize your stack” narrative.
First, governance. Who audits models, tracks drift, and certifies outputs when the system is live? Cheaper inference doesn’t look so smart if a model quietly shifts behavior and starts generating risky advice. That’s not a logging issue; it’s a board-level risk question.
Second, energy and sustainability. Price per unit of compute hides carbon intensity and public perception. As AI workloads scale, the reputational hit from ignoring energy mix and efficiency won’t be a side note — it will influence customer procurement and regulatory scrutiny. A cloud region with slightly higher compute costs but a better sustainability profile might be the rational long-term choice.
Third, vendor dependency. Optimize too tightly around one provider’s inference primitives or proprietary runtimes and you’ll absolutely cut costs in the short term. You’ll also make future portability miserable. Once your tooling, monitoring, and security posture are all wrapped around one stack, even negotiating a better contract becomes harder because everyone knows you can’t really leave.
These aren’t ethical niceties; they show up as real constraints on the balance sheet and in deal negotiations. Tight coupling can accelerate your time-to-market, then quietly hem in every strategic move you want to make afterward.
A common pushback is that relentless optimization kills ambition: if teams stare too hard at inference cost, they’ll be scared off from richer models, new modalities, or features that are expensive to serve. That tension is real. Innovation needs room to be wasteful for a while.
But you can be experimental without treating compute like a blank check. Tier your inference: smaller, cheaper models for routine interactions; heavier models reserved for high-value or high-risk moments. Use routing, distillation, and caching. Instrument everything so you know exactly which user journeys warrant triggering the “big gun” model. That’s not financial handcuffing; that’s product design. GitHub Copilot, for example, doesn’t need its most powerful models churning on every keystroke — it can route based on context complexity and user behavior.
There’s another angle the Deloitte piece doesn’t quite chase: organizational structure. Once inference spend becomes a line item that rivals SaaS or even headcount, it stops being “just infra” and turns into a cross-functional decision. Do you centralize AI infra under a platform team, or do you let each product group choose its own stack and accept duplication? How you answer dictates whether you get compounding efficiency — or a patchwork of micro-optimized, incompatible systems that are impossible to govern coherently.
A quick sci-fi detour: Ursula Le Guin imagined societies where power lay in who controlled the rules of communication, not just who had the biggest weapons. Inference economics rhymes with that — the rules about where, when, and under whose control compute fires will quietly separate AI landlords from AI tenants.
Deloitte is right that AI infrastructure is facing a reckoning; the shift to inference economics is real. The companies that take that message seriously will push one step further and treat “optimize compute” as code for “decide who controls latency, trust, and lock-in” — because that’s where the profits, and the pressure, are going to concentrate.