Service continuity for HPC systems
Cooling issues with HPC systems
Service continuity for HPC systems is a direct result of the level of redundancy provided for each technical component: power supply and cooling. It is important to distinguish between the availability of computing systems and the security of the data (no data loss) processed by the HPC system.
As a result of the two points mentioned above, an HPC system typically incorporates varying levels of redundancy, depending on the specific component: HPC service nodes, storage for HPC data, and HPC computing. The first two components generally receive the most attention in terms of redundancy (high-quality, secure power supply and cooling).
The main challenge to achieving very high availability in HPC systems is cooling—and specifically, heat dissipation at the point of use (within the rack).
HPC systems are becoming increasingly powerful, and electricity costs will continue to rise. It is therefore essential to achieve the highest possible efficiency. This requires minimizing the use of compressor mode in thermal processing (cooling).
In a traditional system, this means that the temperature of the water used to carry heat away from the racks (network equipment, servers, storage systems) must be increased.
This drive for higher performance must proceed in tandem with the sharp increase in electrical density per U (GPU utilization, CPU type, and the number of CPUs and GPUs per U) in HPC systems.
The "all-air" cooling method becomes difficult to maintain at densities exceeding 25 kW per rack.-
What solution should be implemented to compensate for this lack of cooling?
To address these various issues, alternative methods to underfloor air distribution and inter-rack cooling modules—such as cold aisles, direct liquid cooling (DLC), and immersion cooling—are being implemented. While all these technologies are well-established, the high capacity they provide now outweighs the challenges they present (operational, security, and availability).
Service continuity for HPC systems therefore depends on finding the optimal balance between capacity, data security, and performance. The cooling technologies selected are central to achieving this balance.

Build your own containerized data center
Because your future data center needs to be tailored to your business and your scalability requirements, Module-it has developed a dynamic configurator that lets you find the data center that meets your needs in just one minute.