Wednesday, June 01, 2022

Frontier is the Fastest Supercomputer on the Planet - 1 June 2022

Announced on Monday, 30 May 2022, the Frontier computer is the fastest computer on the planet, driven by AMD EPYC CPUs and AMD Instinct GPUs.  I was a manager on the research projects that led to this success.  The research ideas were turned into reality by a large team of dedicated engineers across a variety of disciplines.  The AMD part of the story began in about 2014.

The AMD Research "FastForward" project on supercomputing started in about 2014 when the fastest computer in the world was 33.86 Petaflop/second (0.034 Exaflops/sec).  That machine was the Tianhe-2, a supercomputer developed by China’s National University of Defense Technology.  The Top500 list tracks the fastest computers, and in 2014, it looked like this.  In the research project, our challenge was to invent a 1 Exaflop/sec (1000 Petaflops/sec) machine that used 20 Megawatts of power; at the time, it was predicted that such a machine would require nearly 100 MW if built using then-current technology.  Further, it required so many chips that reliability calculations gave dismal predictions: the machine might not stay up long enough between failures to deliver the useful results.  It took years of research from 2014 - plus the blood and sweat of hundreds of engineers across many disciplines - but the Cray branch of HPE delivered Frontier at 1.1 Exaflops/sec using 21 MW of power.  News report here and here from The Next Platform.

AMD EPYC CPUs and AMD Instinct GPUs (with AI/ML extensions) form the heart of the Oak Ridge Frontier machine.  Once the system is tuned, it is expected to deliver 1.5 Exaflops/sec on the Hi-perf Linpack (HPL) benchmark.    

When the FastForward (FF) project started, AMD had a declining number of systems on the Top500 list and I doubt any of them had AMD GPUs.  The list was dominated by Intel (CPUs), IBM (CPUs), and Nvidia (GPUs).  The Tianhe-2 (China) and Fugaku (Japan) were unusual.  There were a lot of people, many within AMD, who thought AMD Research was wasting its time on supercomputers and high-performance computing.  The AMD FF project received some outside funding from the US Government's Exascale Computing Project (ECP) that allowed us (AMD Research) a bit of independence to pursue the HPC (high-performance computing) research.  The external funding even helped AMD Research survive and grow during times when the rest of AMD was shrinking and suffering.  FastForward was followed by DesignForward, FF2, DF2, and finally PathForward (PF) funding from the ECP.  This money did not cover the costs of the research, but it provided a reliable core of funding around which we could build a stable series of projects.  

Innovation was rampant in the FF work and continued into successive projects.  The AMD Research group varied in size (it grew), but at one point, it represented about 1% of the AMD engineering population.  At the peak, we produced about 40% of the AMD patents. Corporate-wide.  The average Research member was 40x as productive as the average development engineer when comparing patents. With an increased emphasis on patents, the productivity of the rest of the corporation has risen and the Research group produces about 25% of the AMD patents.

There was a strong body of publications coming from AMD Research.  The 50-odd researchers produced more peer-reviewed papers for major conferences than companies like Intel, Apple, Microsoft, and Google that had research staffs that were literally 10x the size.  To be fair, this was not because AMD Research was smarter, but because AMD had a liberal publication policy.  This publication policy changed dramatically in about 2018-2019.  A publication went out (not from Research!) that revealed AMD confidential information, and so the valves were closed for a while.  After internal debate, the valves were slightly reopened, but never to the same level of disclosure that had previously been allowed.  Certain topics were not eligible for publication because the very topics were sensitive, and this caused significant internal friction.  But such are the requirements of corporate research as distinguished from academic research.  The AMD Research publication remains strong but is not as voluminous as the historical level.

The next machine is to be El Capitan at Lawrence Livermore National Labs, and it is projected to deliver over 2 Exaflops with the next generation of AMD CPU and GPU silicon.