In the decades since Seymour Cray developed what is widely regarded as the world’s first supercomputer, CDC 6600 (opens in a new tab), an arms race was underway in the High Performance Computing (HPC) community. The goal: to increase efficiency by all means and at all costs.
Driven by advances in computing, storage, networking, and software, the performance of leading systems has grown a trillion times since the CDC 6600 was launched in 1964, from millions of floating point operations per second (megaFLOPS) to quintillions (exaFLOPS).
The current owner of the crown, a colossal supercomputer from the USA named Borderis capable of achieving 1.102 exaFLOPS in the High Performance Linpack (HPL) test. But it is suspected that there are even more powerful machines at work elsewherebehind closed doors.
The arrival of so-called exascale supercomputers is expected to benefit virtually all sectors – from science to cybersecurity, healthcare to finance – and pave the way for powerful new AI models that would otherwise take years to train.
However, the increase in speed of this magnitude has a price: energy consumption. Full throttle, Frontier consumes up to 40 MW (opens in a new tab) power, about the same as 40 million desktop computers.
Supercomputers have always been about pushing the limits of what’s possible. But as the need to minimize emissions becomes more and more obvious and energy prices continue to rise, the HPC industry will need to reassess whether it is still worth sticking to its original guiding principle.
Efficiency versus efficiency
One of the organizations at the forefront of this problem is the University of Cambridge, which in collaboration with Dell Technologies has developed a number of energy-efficient supercomputers at the forefront of the project.
The Wilkes3 (opens in a new tab)for example, it only occupies the hundredth position in general performance charts (opens in a new tab)but ranks third in Green500 (opens in a new tab)ranking of HPC systems based on efficiency per watt of energy consumed.
In an interview with TechRadar ProDr. Paul Calleja, director of Research Computing Services at the University of Cambridge, explained that the institution is far more interested in building highly productive and efficient machines than extremely powerful.
“We are not interested in large systems as they are highly specific point solutions. But the technologies used in them have a much wider application and will allow the systems to run an order of magnitude slower in a much more cost-effective and energy-efficient way, ”says Dr. Calleja.
“This way you democratize access to computers for many more people. We are interested in using technologies designed for these great era systems to create much more sustainable supercomputers for a wider audience. ”
In the coming years, Dr. Calleja also predicts an increasingly strong focus on energy efficiency in the HPC sector and the wider data center community, where energy consumption accounts for more than 90% of the cost, we were told.
The recent fluctuations in energy prices related to the war in Ukraine have also made supercomputers drastically more expensive, especially in the context of exascale computing, further illustrating the importance of efficiency per watt.
In the context of Wilkes3, the university found that there were a number of optimizations that helped improve the level of performance. For example, by lowering the clock frequency with which some components operated, depending on the load, the team was able to achieve a power consumption reduction of 20-30%.
“Within a particular architectural family, clock speed has a linear relationship with performance, but a square relationship with energy consumption. He’s a killer, ”explained Dr. Calleja.
“Lowering the clock frequency reduces power consumption much faster than performance, but also increases the time it takes to complete the task. So we should pay attention not to the energy consumption in operation, but to the real energy used for the task. There is such a sweet spot. “
Software is king
In addition to tuning hardware configurations for specific workloads, there are also a number of optimizations that can be made elsewhere in the context of storage and networking, and in related disciplines such as cooling and rack design.
However, when asked where exactly he would like to see the resources devoted to improving energy efficiency, Dr. Calleja explained that the focus should be primarily on software.
“Hardware is not a problem, it’s about application performance. This will be the main bottleneck moving forward, ”he said. “Exascale systems today are based on GPU architectures and the number of applications that can run efficiently at large scale on GPU systems is small. ”
“To really take advantage of today’s technology, we need to put a lot of emphasis on application development. The development cycle goes on for decades; The software we use today was developed 20-30 years ago, and it’s difficult when you have such long-lived code that needs to be redesigned. “
The problem, however, is that the HPC industry hasn’t made a habit of thinking about software first. Historically, much more attention has been paid to hardware because, says Dr. Calleja, “it’s easy; you just buy a faster chip. You don’t have to think smart.
“Although we had Moore’s Law, with CPU doubling every eighteen months, nothing needed to be done [on a software level] to increase efficiency. But those times are gone. Now, if we want progress, we have to go back and redesign the software. “
Dr. Calleja has reserved some praise for Intel in this regard. As server Hardware space becomes more diverse from the vendor’s point of view (in most cases this is a positive development), application compatibility may become a problem, but Intel is working on a solution.
“One of the differentiators I see for Intel is that it invests an awful lot [of both funds and time] down oneAPI ecosystem, to develop code portability in various types of silicon. We need these kinds of toolchains to enable future applications to take advantage of the emerging silicon, ”he notes.
Separately, Dr. Calleja called for a greater focus on “scientific need”. Too often, something goes wrong in translation, which causes a mismatch between the hardware and software architecture and the actual needs of the end user.
A more vigorous approach to cross-industry collaboration, he says, would create a “circle of success” of users, service providers and vendors, which would translate into performance benefits and the perspective of efficiency.
Future on the zetta scale
Typically, with the fall of the symbolic milestone of exascale, attention will now turn to the next one: the zettascale.
“Zettascale is just another flag on earth,” said Dr. Calleja, “a totem that highlights the technologies needed to achieve another milestone in computer advances that are unattainable today.”
“The world’s fastest systems are extremely expensive based on the scientific results. But they are important because they show the art of what is possible and push the industry forward. “
Whether systems capable of achieving one zettaFLOPS capacity, a thousand times more productive than the current crop, can be developed in a sustainable manner will depend on the industry’s ability to invent.
There is no binary relationship between performance and energy efficiency, but a healthy dose of units will be required in each sub-discipline to provide the necessary performance gains in the appropriate power envelope.
Theoretically, there is a golden ratio of efficiency to energy consumption, and it can be said that the benefits to society of HPC justify spending on carbon dioxide emissions.
The exact number will of course remain elusive in practice, but the mere pursuit of an idea is by definition a step in the right direction.