AI Training Strategies Tested on World's Fastest Supercomputer

AI Training Strategies Tested on World's Fastest Supercomputer
by Clarence Oxford
Los Angeles CA (SPX) May 16, 2024

Researchers at Oak Ridge National Laboratory (ORNL) investigated training techniques for a significant AI model using the Frontier supercomputer.

The study led by Sajal Dash, Feiyi Wang, and Prasanna Balaprakash utilized Frontier, the world's first exascale supercomputer, for initial stages of training on a large language model. They tested how models with 22 billion, 175 billion, and 1 trillion parameters could run across 128 and later 384 of Frontier's more than 9,400 nodes. The team did not complete the training of a full model.

Large language models aim to mimic human brain patterns in learning and recognizing words and numbers, improving over time with more training. The goal is to create a model that can apply learned knowledge to new, unfamiliar tasks.

Traditionally, the resources needed for such training are held by private companies, limiting research opportunities and verification. Frontier's supercomputing power, however, offers new possibilities for training AI models more efficiently.

"Traditionally, this process has relied on expert knowledge or on trial and error," said Prasanna Balaprakash, ORNL's director of AI programs. "One of the highlights of our work in this study is the automation of identifying high-performing strategies among a vast array of options. We leveraged DeepHyper, an open-source scalable tuning software, to automatically determine the optimal settings. We plan to extend this automated approach to fine-tune system-level performance and enhance efficiency at an...

Training a large language model with a trillion parameters from start to finish without optimizations would take months, even at Frontier's speeds. The ORNL study looked at data parallelism, breaking a large problem into smaller parts to reach a solution faster, to train AI and transfer training across different GPU frameworks.

"It's about finding the best combination of training strategies while getting the best throughput," Dash said. "Most deep-learning frameworks target the GPUs made by NVIDIA rather than the GPUs made by AMD that power Frontier. We wanted to see if existing models could run on Frontier, how to make the best use of Frontier's computing power and how to make that level of performance possible across GPU platforms.

"We can't train a model this size on a single GPU or a single node, for example, and every time we cross the barrier between nodes that requires more communication that consumes more time. How do we slice up the model across GPUs so that we can fit and train the model without losing too much time and energy communicating between nodes?"

The researchers found a blend of parallelism strategies worked best when tailored to the computing platform but said their work's far from finished.

"The efficiency we achieved on Frontier with this model was decent but not decent enough," Wang said. "At extreme scale, we achieved 30% efficiency - which means we left about 70% of Frontier's computing power on the floor. We need much more optimization to make the machine more efficient at this scale."

Next steps include training a model further with peer-reviewed scientific data across more nodes.

"This study and our findings aren't so much a manual as a potential set of guidelines for users training a large model," Dash said. "They can draw from our experience to decide how to use Frontier's resources to train their particular model and make the most effective use of their allotted computing time."

The study was presented at the International Supercomputing Conference High Performance 2024 in Hamburg, Germany. Collaborators included Isaac Lyngaas, Junqi Yin, Xiao Wang, and Guojing Cong of ORNL and Romaine Egele of Paris-Saclay University.

The study focused on optimizing the use of GPUs for training AI, with each of Frontier's nodes relying on four AMD MI250X GPUs.

The training ran for a few hours on about 100 million tokens of test data, a small fraction of the data needed for a trillion-parameter model.

"This study was largely an exercise to show we can train this particular size of model on Frontier at this particular scale with this particular level of efficiency," Wang said. "We didn't get anywhere near the finish line of a complete large language model."

Research Report:Optimizing Distributed Training on Frontier for Large Language Models