WP103: Improving PUE with Airflow Management

February 1, 2010

Table of Contents

Introduction

As energy prices continue to rise and concerns about global warming due to carbon emissions continue to grow, there is a growing motive to lower the power usage effectiveness (PUE) of data centers worldwide. The PUE of a data center is defined as:

[PUE=\frac{\text{total facility power}}{\text{total IT power}}]

The Total Facility Power is comprised of all the power delivered to the entire data center, and the Total IT Power is defined as only that which is delivered to the IT equipment. A careful look at this ratio (Figure 1) reveals that the Total Facility Power is dominated by the power to drive the data center cooling system (45%) and the power consumed by the IT equipment (30%). Another way to say this is that 75% of the non-IT power is consumed by the cooling system. By focusing on the power to drive the cooling system and IT equipment as the dominant parameters, an alternative ratio can be defined, which is often referred to as the Cooling Load Factor (CLF):

[CLF=\frac{\text{total cooling power}}{\text{total IT power}}]

The Cooling Load Factor is the total power consumed by the chillers, CRACs, cooling towers, pumps and other cooling related equipment, divided by the total IT equipment power. To accurately determine the total annual cost of power to drive the cooling system for a given data center, one must take into account the kind of cooling unit (gas or liquid), the efficiency of the motors that drive the fan and compressors, as well as the specific geographic location of the data center.

If power measurements of the equipment are not feasible, estimates must be made that often require detailed knowledge from the cooling manufacturer. Rather than focus on the power required by the cooling equipment, one can instead use the cooling capacity of the equipment. In this sense, another modified ratio can be defined: Total Cooling Capacity/IT Power Consumed. This ratio can be defined as the “Cooling Capacity to IT Load Ratio”, and by focusing on these two parameters and attempting to drive this ratio down as close as possible to 1.0, the PUE will also decline in direct proportion.

The cooling for a given data center consists of two primary components: the total capacity of the cooling system, typically measured in tons or kilowatts, and its related airflow, typically measured in cubic feet per minute (CFM). It is important to consider both of these parameters, since the reason for hot spots in many data centers is not the total cooling capacity (this is typically more than adequate) but rather the inability to get the cold air to where it is needed.

Baseline Case

To illustrate this point we will use computational fluid dynamics (CFD) to consider a hypothetical data center of 2500 square feet as illustrated in Figure 2.

For this data center, eight Liebert FH600C cooling units are deployed in a slightly staggered (asymmetric) pattern around the perimeter of the room, creating a total cooling capacity of 1724 kW. The thermal load consists of six rows of equipment racks, each row containing 20 racks, and each rack with a thermal load of 7 kW for a total of 840 kW. This results in a Cooling Capacity to IT Load Ratio of 2.0, a full 100% higher than should be required to cool the equipment. Notice, however, that the airflow supplied by each of the eight FH600C units is only 17100 CFM, creating a total airflow capacity of 136,800 CFM. Each 7 kW rack requires 1091 CFM to keep the temperature rise across the rack to a 20°F maximum, so with 120 racks in the room, the total rack demand is 130,920 CFM, nearly 5% more than is supplied by the Liebert cooling units. This will become a significant consideration when attempting to reduce the overall power consumption.

One way of improving the PUE for this data center is to reduce the Cooling Capacity to IT Load Ratio. The Liebert FH600C uses an 11kW centrifugal blower to supply air to the data center. If we assume that the cost of electricity is $0.10/kW-hr, the annual cost of operating just the blower for this unit would exceed $10,000, and would be nearly twice that amount when including the work done by the compressor. Shutting down one of these units would reduce the PUE and save money. The question, however, is whether or not this can be done without causing excessive temperatures at any of the server inlets.

While shutting down a CRAC unit looks like a viable option, only a CFD model can identify which CRAC is the best one to shut down and whether doing so will result in troublesome hot spots on any of the equipment. Figure 3 illustrates the rack inlet temperatures in the data center with all CRACs operating normally.

As can be seen, there are already hot spots located at the ends of the rack rows. In some cases, the rack inlet temperatures exceed the ASHRAE recommended maximum of 80.6°F. The maximum rack inlet temperature for this case is 82°F and the maximum temperature in the room is 91°F. Turning off both the fan and coil on any of the 8 CRAC units would create a scenario where the total cooling capacity would be sufficient, but due to the lack of proper airflow to some servers, extreme temperatures may result. Using CFD, it is a straightforward matter to test this possibility and find out the consequences when each one of the CRACs is disabled.

To compare scenarios, a CFD model was created using CoolSim that allowed a series of 8 simulations to be run concurrently, each with a CRAC unit shut off in a round robin fashion. A summary of the simulation results is presented in Table 1.

The best case, highlighted in green, corresponds to the elimination of CRAC F (lower corner on the left in Figure 3). It has the smallest impact on the maximum rack inlet temperature, and drives up the maximum temperature in the room by only 3 degrees, from 91°F to 94°F, according to the detailed CFD output reports. The resulting Cooling Capacity to IT Load Ratio decreases by 1/8 or 12.5% when this CRAC is disabled, reducing the annual operating cost by thousands of dollars. But even in the best case when CRAC F is shut off, the rack inlet temperatures still reach a peak of 85°F in one of the racks, exceeding the ASHRAE recommended maximum for inlet temperature. Therefore the approach of simply turning off one or more CRAC units will not work for this data center without first making some adjustments to the room configuration to improve the thermal efficiency.

Improving Thermal Efficiency

There are two common methods for improving the thermal efficiency of data centers: hot aisle containment and cold aisle containment. To help understand which is more effective on this specific data center, the initial model can be quickly modified to consider each scenario so that the outcomes can be compared. There are several things to consider when trying to decide which approach is best for a given data center. For example, cold aisle containment is typically less expensive to implement because perforated tiles are often located near the rack inlets and therefore less duct work is required. But by fully containing the cold supply air, the rack exhaust drives the ambient room temperature up. Depending on the resulting room temperature, this approach may not be comfortable for service technicians or administration personnel working in the room.

The opposite problem occurs with hot aisle containment, as the entire room becomes part of the cold supply, driving the ambient room temperature downward. In this scenario, however, there is additional heat contributed by other objects in the room such as walls, UPSs, lights, and other equipment. The additional heat tends to increase the ambient temperature in the room, but if the supply air is well directed towards the rack inlets, the additional heat will have less impact on the equipment. Cost is also a primary decision factor as containment strategies of any kind require modifying the data center while in operation.

Building virtual models of these two approaches can help ferret out which one is optimal for a given data center layout. While complete cold aisle containment is possible in a data center with a room return, complete hot aisle containment is not, since it requires a ceiling return. Thus two partial containment strategies were considered in which impermeable walls are positioned at the ends of either the hot or cold aisles. The modified scenarios are shown in Figure 4.

Table 2 shows a comparison of the two containment approaches with all CRACs on using the maximum rack inlet temperature and maximum room temperature as common metrics. In both cases, no other heat sources in the room were included. Both methods drop the maximum rack inlet temperature compared to the original case with no containment, but the partial cold aisle containment strategy is preferable. The difference between the strategies may be due to the fact that there are three containment regions for the cold aisle containment case compared to two for the hot aisle containment case. More contained space may lead to reduced mixing between the hot and cold air in the room. For the cold aisle containment strategy, the maximum inlet temperature drops by 4 degrees to 78°F, compared to a drop of only 1 degree for the hot aisle containment case. Partial cold aisle containment leads to a 7 degree drop in the maximum ambient room temperature as well.

Using partial cold aisle containment, the issue of reducing power consumption by the cooling system can be considered once again. In Table 3, the results of a CRAC failure analysis indicate that if the data center now operates with CRAC C turned off, the maximum rack inlet temperature is the same as it was in the baseline case with all CRACs on. The maximum rack inlet temperature is still above the ASHRAE recommended maximum value (80.6°F), but it is well below the ASHRAE allowed maximum value (90°F). This exercise is evidence of the importance of using flow simulation to assess modifications to a data center and determine which, if any, cooling units can be disabled to improve data center efficiency.

Another benefit to using a containment strategy is the improvement in overall reliability of the facility. Without any containment, the CRAC failure analysis predicted worst case rack inlet temperatures as high as 91°F, above the ASHRAE allowed maximum value. However, by adding a simplified partial cold aisle containment solution, the reliability of the data center has been increased. That is, while the data center can be run with all 8 CRACs on, the results show that if any unit except CRAC D fails or must be taken down for servicing, the maximum rack inlet temperatures will not exceed 90°F.

In summary, this particular data center was used to illustrate how CFD can be used to compare some of the many techniques available to improve PUE. When striving to improve PUE, data center managers should focus on the Cooling Load Factor as a primary target, along with the purchase of energy star rated equipment. If the cooling power consumption values are not readily accessible, then focusing on the Cooling Capacity to IT Load Ratio is a reasonable alternative. To test if reductions in cooling are feasible, CFD can be effectively used to compare and contrast alternative approaches. Of course, modeling is not meant to be a substitute for good engineering. CFD models are based on assumptions, so the results should be validated with measurements to ensure that the model represents real world phenomena. Yet modeling will always produce a relative comparison between one design approach with another and is a helpful mechanism for supporting the decision making process.

The PUE metric is most heavily influenced by the power to drive the IT load and the cooling necessary to sustain the resulting thermal load. By focusing on how the cold air is delivered to the servers and the hot air is returned to the CRACs, the thermal efficiency of a data center can be improved significantly. Understanding the air flow patterns presents opportunities to reduce the existing cooling capacity and its related costs, improve the reliability of the data center, or add more IT equipment to an existing data center without the need to add more cooling capacity. Any of these outcomes will also reduce the overall data center PUE. By focusing on improving airflow, managers can get more output from existing cooling capacity without expensive capital expenditures associated with adding or upgrading cooling units. With today’s high density servers and increased rack thermal loads, traditional “back of the envelope” calculations are not sufficient without the aid of a CFD modeling tool.