Designing Data Centers for AI: A Practical Checklist for DevOps Teams
A practical checklist for choosing AI colocation or building private clusters—covering power, liquid cooling, networking, and procurement.
Designing Data Centers for AI: A Practical Checklist for DevOps Teams
AI infrastructure has moved from “nice to have” to “must ship now,” and that shift is forcing DevOps, infrastructure, and procurement teams to make decisions that used to belong only to specialized facilities engineers. The challenge is not just finding an AI data center; it is validating whether the site can actually support the workload profile you plan to run, whether that means a colocation buildout or a private cluster with high-density racks, liquid cooling, and serious network engineering. As we covered in our piece on redefining AI infrastructure, the bottlenecks are now power, cooling, and proximity—not just server count. If your team is evaluating vendors, it helps to think like a postmortem writer and a capacity planner at the same time, especially when drawing lessons from resilient platform designs like our guide on resilient cloud architecture under geopolitical risk.
This guide turns the big ideas into an actionable colocation checklist and procurement framework. We will walk through site selection, GPU power planning, RDHx versus direct-to-chip cooling, network backbone requirements, operational readiness, and the contract language DevOps teams should insist on before anyone signs a purchase order. Where many articles stop at “multi-megawatt is important,” this one shows how to verify that claim in real life, what evidence to request, and which tradeoffs matter most when you are trying to ship AI capacity on a deadline. For teams already evaluating vendors, our practical vendor due-diligence pattern in how to vet training vendors is a useful mindset shift: ask for proof, references, and failure modes, not just marketing claims.
1. Start With the Workload, Not the Building
Define the actual AI system you are deploying
The first mistake teams make is treating “AI infrastructure” as a single category. A model-training cluster for LLM fine-tuning has very different requirements from an inference farm, a vector-search service, or a multimodal pipeline that blends batch and streaming workloads. Before you talk to any colocation provider, document the accelerator type, rack count, target utilization, memory footprint, and expected growth over 12 to 24 months. If you cannot state whether your initial deployment is 50 kW per rack or 120 kW per rack, you are not ready to compare sites intelligently.
Workload definition also informs whether your team can tolerate standard air cooling, needs liquid cooling from day one, or should design a hybrid model. Many vendors will say they “support AI,” but that phrase means little until you know your thermal envelope, maintenance windows, and redundancy requirements. You should also model what happens when GPU utilization rises faster than expected, because AI clusters often fail capacity planning through network or cooling constraints long before they fail on compute availability. This is where disciplined planning resembles our approach in designing a capital plan that survives tariffs and high rates: if you do not bake in volatility, your budget will break under stress.
Translate model ambition into power and cooling math
GPU power planning is not a paper exercise. A single AI rack can consume more than 100 kW, and the difference between a “probably works” estimate and a validated utility plan can determine whether your deployment starts in weeks or slips by a year. Break down total facility load into rack-level power, cooling load, networking overhead, storage overhead, and headroom for inefficiencies. Then verify whether the site can deliver not only the desired megawatts, but also the distribution path, breaker sizing, UPS topology, and generator capacity to sustain it.
Teams should ask for both the current usable megawatts and the upgrade path, because many sites advertise future capacity that is not deliverable soon enough for an AI launch schedule. It is useful to distinguish between theoretical capacity and ready-now multi-megawatt availability. If the provider’s roadmap is the main selling point, you may be buying a promise instead of an operational asset. In the same spirit, our guide on fuel, capacity, and route cuts is a reminder that operational promises are only as good as the infrastructure behind them.
Set a “go/no-go” threshold before procurement begins
One of the most useful things DevOps teams can do is create a pre-approved technical threshold. For example, your organization might decide that no site is eligible unless it can support 60 kW racks with liquid cooling, provide two diverse network paths, and confirm utility-backed expansion to at least 5 MW within a defined window. That threshold keeps the buying process from drifting under pressure from sales teams or internal stakeholders who want to move fast. It also gives finance and leadership a transparent standard for comparing sites and negotiating concessions.
Think of this threshold as an incident-runbook decision tree for infrastructure procurement. If a site misses a critical requirement, the answer is not “maybe later.” It is either “redesign the architecture” or “walk away.” This clarity is especially important when your team is choosing between colocation and a private build, because the wrong choice can create long-term stranded assets. For a related example of evaluating service promises against real-world conditions, see our guide to spotting a real record-low deal.
2. Site Selection: The AI Data Center Is a Geographic Decision
Power availability and utility timeline
Site selection starts with electrical reality. An AI data center is fundamentally a power project, and a power project is constrained by utility lead times, substation availability, transmission capacity, and permitting. Ask for the utility letter, interconnect status, transformer lead time, and any dependency on off-site construction. You should also validate whether the provider is quoting firm delivery dates or optimistic targets that depend on third parties with no contractual obligation to you.
For DevOps and infrastructure teams, this means requesting a load study, electrical one-line diagrams, and a phased energization plan. If the provider cannot show how the facility moves from initial energization to final full-build capacity, you should treat that as a risk flag. The best sites are not merely “big”; they are engineered for power availability with enough margin to absorb startup spikes, failure modes, and future cluster expansion. Teams that think in terms of resiliency can borrow from our article on nearshoring, sanctions, and resilient cloud architecture, where location choice is treated as a resilience control, not just a cost variable.
Latency, connectivity, and ecosystem proximity
AI workloads do not only need compute; they need data movement. If your team is training on distributed datasets, serving inference to global users, or syncing to cloud storage, network latency and carrier diversity become operational priorities. Evaluate whether the site has access to dense carrier ecosystems, direct cloud on-ramps, cross-connect options, and enough bandwidth to support checkpointing and replication without saturating links. If your training data sits in another region, transport costs and performance can dominate the economics of the deployment.
Connectivity also affects developer velocity. A site with weak interconnect options can slow down experimentation, create awkward backup flows, and force teams to engineer around limitations instead of building on top of the infrastructure. This is one reason multi-cloud and hybrid environments often benefit from strategically selected colocation. For a related operational lens on network planning, our piece on mesh Wi-Fi upgrade decisions offers a smaller-scale analogy: bandwidth is only valuable when the path is designed for the real workload, not the brochure.
Risk, climate, and continuity considerations
Every site is exposed to physical risk, even if the utility numbers are perfect. Floodplain location, wildfire smoke, hurricane exposure, seismic activity, and water constraints all matter because AI clusters are expensive to move and expensive to idle. The right site selection process should include climate resilience, insurance feasibility, and supply-chain access for replacement parts. You do not want to discover after procurement that a promising facility is easy to energize but difficult to insure or maintain during regional disruption.
It is also worth considering who else is in the ecosystem. A site near other cloud and telecom assets may have better staffing, faster spare-parts delivery, and more mature maintenance support. Conversely, highly isolated sites can become operational islands. That tradeoff mirrors the risk balancing we discuss in mitigating geopolitical and payment risk and in geospatial risk intelligence for resilient meetups, where location choice shapes operational stability.
3. Power Planning for GPU Clusters
From rack density to facility load
Traditional data center planning often assumes modest rack densities, but AI changes the math. A modern cluster can turn a row of racks into a concentrated thermal and electrical load that stresses every part of the chain, from switchgear to cable routing. Your power model should account for sustained load, burst behavior, redundancy, and the fact that GPUs do not always consume power uniformly across workloads. That means planning for both peak training jobs and the lower, but still substantial, load of inference or fine-tuning operations.
At procurement time, require the vendor to show rack-by-rack power budgets and the cooling assumptions behind them. Ask whether the facility supports redundant A/B feeds at the density you need, how power is metered, and whether monitoring is available at the circuit level. If the answer is a vague “yes” without documentation, assume you will have to build your own validation layer. In related operational procurement contexts, our guide on securely connecting smart office devices is a reminder that integration details matter just as much as headline capability.
Battery, generator, and ride-through expectations
AI training jobs do not tolerate sloppy outage handling. If a site loses power, your cluster may lose hours of compute or corrupt long-running distributed jobs. You should therefore understand not only whether the facility has UPS and generators, but also the ride-through time, maintenance testing regimen, fuel contracts, and restoration procedures. Ask how the provider handles simultaneous utility and generator maintenance, because that is when “redundant” systems can become less redundant than advertised.
DevOps teams should coordinate with infrastructure teams to determine whether workload orchestration can survive failover cleanly. Some workloads can checkpoint often; others need application-level protection or queue draining before power transitions. That means the procurement decision is inseparable from software design. If your platform reliability program already uses strong operational discipline, the mindset will feel familiar—similar to the rigor we recommend in redefining B2B metrics for AI-influenced funnels, where the measurement model must match the actual system, not just vanity metrics.
Billing, metering, and FinOps alignment
Power planning is also a FinOps issue. AI infrastructure budgets are often undermined by hidden charges: demand-based power premiums, cooling surcharges, cross-connect fees, metered burst consumption, and penalties for reserved capacity that goes unused. Ask for the full rate card, not a summary. Then model the all-in cost per usable GPU-hour, not just per kilowatt or per rack.
That all-in view lets finance and engineering compare colocation against cloud GPU rentals, hybrid bursting, or a private cluster strategy. It is common for teams to overestimate savings from moving to bare metal because they ignore service overheads and underutilization. To avoid that trap, use the same discipline you would apply in tax planning for volatile years: account for variability, not just the happy path.
4. Cooling Architecture: Liquid Cooling Is Not Optional for Many AI Builds
When air cooling stops being enough
The phrase liquid cooling used to sound exotic. For AI infrastructure, it is rapidly becoming baseline engineering. Once rack densities climb high enough, traditional air-cooling approaches can struggle to remove heat efficiently, especially when hot aisles, cable congestion, and uneven airflow combine to create localized hot spots. If your hardware roadmap includes the latest GPU platforms, you need to assume liquid cooling will be part of the architecture sooner rather than later.
The practical question is which form fits your environment. Air-cooled rows may still work for lower-density support systems, while GPU-heavy rows may need direct liquid support. A sensible procurement plan should define which workloads go into which pod and how the cooling system scales over time. As our article on growth in liquid cooling markets shows, cooling technology shifts the economics of what can be deployed where.
RDHx versus direct-to-chip cooling
RDHx, or rear-door heat exchangers, and direct-to-chip cooling solve different problems. RDHx can be a strong fit when you want to augment existing infrastructure, capture exhaust heat at the rack edge, and avoid a full internal redesign. Direct-to-chip systems, by contrast, move coolant closer to the thermal source and are often better suited for very high-density AI racks where air handling alone cannot keep up. The right choice depends on your hardware generation, deployment timeline, serviceability requirements, and the provider’s operational maturity.
When evaluating a colocation provider, ask whether the cooling loop is designed for current rack densities or just future marketing slides. Find out whether maintenance can be performed without shutdown, how leaks are detected, what isolation procedures exist, and whether the site has experience supporting the exact class of GPU hardware you plan to deploy. You want a provider that can discuss commissioning, pressure testing, water quality, and fault isolation in practical terms, not buzzwords. In adjacent hardware-reliability discussions, our guide on maintaining PCs efficiently is a smaller example of the same principle: the right maintenance approach can materially improve system health.
Water quality, plumbing, and serviceability
Liquid cooling introduces new dependencies: water chemistry, filtration, leak detection, hose routing, quick-disconnect reliability, and access for technicians. These are not minor details. A cooling loop that is technically capable but operationally fragile can create more outage risk than the thermal benefit it delivers. Ask for the planned maintenance intervals, spare-part strategy, and escalation process for coolant-related incidents.
It is also smart to request commissioning records from prior deployments. If the provider has never run dense AI pods at production scale, your team may become their learning exercise. That is not necessarily disqualifying, but it must be reflected in the risk register and the contract. For a broader perspective on proving performance before rollout, our guide to combining reviews with real-world testing is a useful mental model: lab claims are useful, but field validation is what matters.
5. Connectivity and Network Design for AI Clusters
Backbone, east-west traffic, and data gravity
AI clusters generate enormous east-west traffic between nodes, storage, and checkpoint systems. That means your network design should not be optimized only for internet ingress and egress. It needs to handle internal replication, distributed training chatter, backup movement, and retrieval from object storage or feature stores. If you underbuild the network, your GPUs will sit idle waiting for data, and your expensive cluster will underperform despite looking healthy on paper.
When comparing sites, ask how much bandwidth is available per suite, per cage, and per cross-connect. Confirm whether the provider has experience with low-latency, high-throughput workloads and whether network upgrades are incremental or require long lead times. A provider with dense carrier access and clean optical pathways can dramatically reduce implementation friction. Similar principles apply in our article on API-first automation, where speed depends on infrastructure designed around the real transaction flow.
Cloud on-ramps and hybrid design
Even teams building private AI clusters often need hybrid connectivity. You may keep training data in one cloud, checkpoints in another, and inference in a colocation site. In that case, the ability to establish direct cloud on-ramps, private links, and consistent routing policies becomes a major procurement criterion. A site with weak cloud adjacency can force you into expensive and fragile internet-based transit paths.
Ask whether the provider supports private peering, multiple carriers, route diversity, and DDoS protection options. Also confirm who owns the demarcation points and what happens during carrier maintenance windows. These details affect uptime, security, and incident response. For teams dealing with sensitive data or regulated environments, our guide to secure file transfer features is a reminder that connectivity choices can shape the security boundary as much as any firewall rule.
Operational monitoring and failure visibility
Networking failures in AI environments often look like application slowness, not hard outages. That means your observability stack should include packet-loss monitoring, interface saturation, congestion tracking, and telemetry from switches and transceivers. Ask the provider what metrics are exposed, at what granularity, and whether you can integrate them into your own monitoring tools. If the answer is “we have a portal,” push further.
DevOps teams should plan to monitor the entire path from accelerator to storage to cloud egress. This is where infrastructure procurement intersects with observability discipline. Just as our guide on building research-grade datasets emphasizes pipeline integrity, the AI network must be instrumented as a pipeline with measurable failure points, not just a black box.
6. Procurement Checklist: What to Ask Before You Sign
Technical due diligence questions
A good infrastructure procurement conversation should feel like an engineering review, not a sales demo. Ask for specific evidence: one-line electrical diagrams, cooling schematics, rack density limits, utility commitments, commissioning reports, and carrier maps. Ask whether the site supports your target mix of compute, storage, and network equipment, and whether the facility can handle phased deployment without forcing a redesign. If the provider cannot answer deeply, you should assume hidden work will be pushed onto your team.
Here is the core checklist many DevOps teams can adopt immediately: power availability now and later; rack density support; liquid cooling readiness; serviceability access; network redundancy; cloud on-ramp options; spare-parts SLAs; monitoring visibility; maintenance windows; and escalation paths. You should also ask whether the provider has run workloads similar to yours, because AI is not just “servers, but more.” The details matter in a way that mirrors our analysis of prompt tooling for multimedia workflows: the stack only works well when each component is tuned to the workflow.
Commercial terms that matter more than the headline price
Pricing is never just monthly rack rent. Include power escalation clauses, installation costs, cross-connect fees, migration support, liquid-cooling service premiums, and penalties for using reserved space differently than planned. Negotiate for clarity on expansion rights, right of first refusal on adjacent space, and the ability to pre-book power blocks as your cluster grows. These terms can matter more than a small difference in initial price.
You should also be careful about path dependency. A cheap initial deal can become expensive if it traps you in a facility with poor expansion options or limited cooling headroom. That is why experienced teams compare total lifecycle cost, not only launch cost. It is the same reason we recommend disciplined deal evaluation in our record-low deal guide and why procurement should include real usage scenarios instead of brochure math.
Security, compliance, and physical access
AI data centers are high-value targets. Your procurement checklist should cover badge controls, visitor logging, camera coverage, chain-of-custody procedures, and hardware access policies. If you operate regulated workloads, ask about compliance attestations, evidence retention, and whether the provider can support your audit obligations. Physical security is part of the platform, not a separate issue.
DevOps teams should also map who can touch the environment during maintenance, how remote hands operate, and how emergency access is granted. The goal is to preserve safety without creating operational bottlenecks. This mindset aligns with the careful risk handling in our risk management guide, where trust, process, and controls must all be explicit.
7. Build vs Buy: How to Decide Between Colocation and a Private AI Cluster
When colocation is the better move
Colocation makes sense when you need speed, flexibility, and access to a mature power and carrier ecosystem without waiting for a ground-up build. If your team must deploy within a quarter, wants to test a new GPU fleet, or needs to reduce the complexity of campus construction, colocation can be the pragmatic choice. It can also be a good fit when your internal team wants to focus on platform engineering instead of facilities management.
But colocation only works if the provider can genuinely meet your thermal and electrical requirements. If you need liquid cooling, multi-megawatt growth, and customized serviceability, the site must already be engineered for that class of demand. Otherwise, you are buying time at the cost of future limitations. For comparable decision discipline in other domains, see our guide on building a premium library on a budget, where the best option is the one that matches your actual use, not the cheapest headline.
When a private build may be justified
A private AI cluster can be the right answer if your scale is large enough, your deployment horizon is long, and your workload profile is stable enough to justify capital investment. Private builds can provide more control over layout, security, and long-term economics, especially when you are planning repeat expansions. They can also be the better choice when your cooling design is highly customized or when you need to optimize for unique governance constraints.
However, a private build carries more execution risk. Utility timelines, construction delays, cooling commissioning, and staffing can all push the project off schedule. That means the organization needs a mature capital plan and a strong risk committee. The strategic framing is similar to capital planning under uncertainty: you must model delay, not just success.
The decision matrix DevOps should bring to leadership
Leadership often hears “build vs buy” as a financial decision, but for AI infrastructure it is really a delivery decision. The right framework includes time-to-deploy, power certainty, cooling maturity, network ecosystem, operational burden, and exit flexibility. If colocation gets you to production faster with acceptable cost and lower risk, it usually wins. If long-term scale and customized engineering are paramount, private infrastructure may justify its complexity.
Bring an explicit tradeoff matrix to the decision review. Include assumptions, constraints, and the cost of missing launch dates. That clarity prevents procurement from becoming a debate over slogans. Our approach to practical comparisons in testing before trusting applies here too: compare real conditions, not idealized vendor claims.
8. Operational Readiness: Day-2 Is Where AI Infrastructure Succeeds or Fails
Commissioning, runbooks, and acceptance testing
Before you move production workloads into an AI site, require commissioning evidence and acceptance testing for power, cooling, and network paths. This should include failover tests, thermal validation, and documented recovery procedures. If the provider cannot show that the environment was tested under realistic load, your team should not assume the site is production-ready.
Acceptance testing should also involve your own runbooks. Can you drain a rack safely? Can you isolate a failing node without disrupting the pod? Do you know what alerts fire when coolant temperature rises or a circuit overloads? These are not theoretical questions once you are operating at high density. The discipline is similar to the operational rigor behind research-grade data pipelines, where every stage must be observable and repeatable.
Spare parts, staffing, and escalation
AI infrastructure is hard on hardware, and hard on people. Your readiness plan should define spare-part inventory, RMA timelines, hands-on support availability, and escalation paths for cooling or power incidents. If a provider offers remote hands, confirm the actual scope: what can they replace, under what supervision, and within what SLA? The difference between “available” and “effective” can be the difference between a brief interruption and a multi-hour outage.
You should also determine whether your own team has the skills to manage the environment. If not, budget for training, partner support, or a managed-service layer. A common failure mode is underestimating the human workload of a sophisticated site. This mirrors the caution in vendor evaluation: capability is as important as the contract.
Incident response and postmortems
When an AI cluster fails, the postmortem should connect facility events to application symptoms. That means correlating power, cooling, and network telemetry with workload behavior and user impact. If your team cannot trace the blast radius from a circuit issue to a stalled training job, your observability is incomplete. The best AI operations teams treat data center failures with the same rigor they use for software incidents.
Build an incident template that includes root cause, contributing factors, detection gaps, workload impact, and prevention actions. This creates a feedback loop that improves both operations and procurement. It also makes leadership more willing to invest in the next phase because the team can demonstrate learning, not just reaction. For a related perspective on structuring operational knowledge, see how top workplaces use rituals.
9. A Practical Colocation Checklist for AI Teams
Use the following checklist as a starting point when evaluating any AI colocation provider or private-site option. It is intentionally written for practitioners who need to compare sites quickly without missing critical details. You should score each item with evidence, not opinion, and assign a clear owner from DevOps, infrastructure, and procurement.
| Domain | What to Verify | Why It Matters | Evidence to Request |
|---|---|---|---|
| Power capacity | Ready-now MW, expansion MW, utility timeline | Determines launch feasibility and growth path | Utility letters, load study, energization schedule |
| Rack density | Supported kW per rack, phased density support | Prevents thermal and electrical bottlenecks | Rack specs, commissioning reports |
| Cooling | Liquid cooling, RDHx, direct-to-chip readiness | Essential for high-density AI workloads | Cooling schematics, leak detection plan |
| Network | Carrier diversity, cloud on-ramps, cross-connects | Impacts throughput, latency, and resilience | Carrier maps, routing design, SLA terms |
| Operations | Monitoring, remote hands, escalation paths | Defines day-2 reliability | Runbooks, support matrix, incident SLAs |
| Security | Physical access controls, audit support | Protects high-value compute and data | Security policy, compliance attestations |
| Commercials | Rate card, expansion rights, penalties | Shapes total cost of ownership | Contract draft, pricing model, addenda |
Pro Tip: If a provider cannot show you how power, cooling, and network capacity will scale together, treat the site as incomplete. In AI environments, the weakest subsystem becomes the performance ceiling.
10. FAQ: AI Data Center Planning for DevOps Teams
How much power does an AI data center need?
It depends on the workload, but AI clusters frequently require much higher rack densities than traditional enterprise workloads. Teams should plan at the rack level first, then roll up to facility load. For cutting-edge GPU deployments, a single rack may exceed 100 kW, which means the facility must support both the electrical and thermal demands of that density.
Is liquid cooling mandatory for GPU clusters?
Not always, but it is increasingly common for high-density deployments. If your racks are approaching the upper density range, liquid cooling becomes a practical necessity rather than a luxury. The key is to match the cooling system to the hardware roadmap and to confirm serviceability before deployment.
What is the difference between RDHx and direct-to-chip cooling?
RDHx captures heat at the rear of the rack and is often used as an incremental cooling enhancement. Direct-to-chip cooling moves liquid closer to the heat source and is generally better for the highest-density AI hardware. The best option depends on your density, existing facility design, and operational maturity.
How do we choose between colocation and building a private AI cluster?
Use time-to-deploy, power certainty, cooling maturity, connectivity, and operational burden as your main decision criteria. Colocation is usually faster and less operationally complex, while a private build can make sense at very large scale or when you need deep customization. The right answer is the one that supports your launch timeline and risk tolerance.
What should be in an AI colocation checklist?
At minimum: ready-now power, expansion power, rack density support, liquid cooling capability, network diversity, cloud on-ramps, security controls, monitoring visibility, support SLAs, and clear commercial terms. You should also validate utility dependencies, maintenance windows, and the provider’s actual experience with AI hardware.
Related Reading
- Redefining AI Infrastructure for the Next Wave of Innovation - A strategic overview of why power and cooling are now the central AI bottlenecks.
- What Growth in Liquid Cooling Markets Means for Outdoor Tech - A practical look at how liquid cooling is reshaping adjacent hardware categories.
- Nearshoring, Sanctions, and Resilient Cloud Architecture - A risk-first framework for choosing infrastructure locations.
- Designing a Capital Plan That Survives Tariffs and High Rates - Useful for planning large infrastructure purchases under uncertainty.
- Competitive Intelligence Pipelines: Building Research-Grade Datasets from Public Business Databases - Helpful if you want a model for building observable, repeatable infrastructure processes.
Related Topics
Jordan Ellis
Senior DevOps and Cloud Infrastructure Editor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Throttling to Throughput: How to Benchmark AI Rack Performance in Your Facility
Navigating the Tech Landscape: Compliance Lessons from Political Discourse
Workload Identity in AI Agent Pipelines: Why ‘Who’ Matters More Than ‘What’
Building Resilient Payer-to-Payer APIs: Identity, Latency and Operational Governance
Analyzing Outage Patterns: Lessons from Microsoft 365's Performance Issues
From Our Network
Trending stories across our publication group