24/7 Space News
TECH SPACE
AI Training Strategies Tested on World's Fastest Supercomputer
illustration only
AI Training Strategies Tested on World's Fastest Supercomputer
by Clarence Oxford
Los Angeles CA (SPX) May 16, 2024

Researchers at Oak Ridge National Laboratory (ORNL) investigated training techniques for a significant AI model using the Frontier supercomputer.

The study led by Sajal Dash, Feiyi Wang, and Prasanna Balaprakash utilized Frontier, the world's first exascale supercomputer, for initial stages of training on a large language model. They tested how models with 22 billion, 175 billion, and 1 trillion parameters could run across 128 and later 384 of Frontier's more than 9,400 nodes. The team did not complete the training of a full model.

Large language models aim to mimic human brain patterns in learning and recognizing words and numbers, improving over time with more training. The goal is to create a model that can apply learned knowledge to new, unfamiliar tasks.

Traditionally, the resources needed for such training are held by private companies, limiting research opportunities and verification. Frontier's supercomputing power, however, offers new possibilities for training AI models more efficiently.

"Traditionally, this process has relied on expert knowledge or on trial and error," said Prasanna Balaprakash, ORNL's director of AI programs. "One of the highlights of our work in this study is the automation of identifying high-performing strategies among a vast array of options. We leveraged DeepHyper, an open-source scalable tuning software, to automatically determine the optimal settings. We plan to extend this automated approach to fine-tune system-level performance and enhance efficiency at an...

Training a large language model with a trillion parameters from start to finish without optimizations would take months, even at Frontier's speeds. The ORNL study looked at data parallelism, breaking a large problem into smaller parts to reach a solution faster, to train AI and transfer training across different GPU frameworks.

"It's about finding the best combination of training strategies while getting the best throughput," Dash said. "Most deep-learning frameworks target the GPUs made by NVIDIA rather than the GPUs made by AMD that power Frontier. We wanted to see if existing models could run on Frontier, how to make the best use of Frontier's computing power and how to make that level of performance possible across GPU platforms.

"We can't train a model this size on a single GPU or a single node, for example, and every time we cross the barrier between nodes that requires more communication that consumes more time. How do we slice up the model across GPUs so that we can fit and train the model without losing too much time and energy communicating between nodes?"

The researchers found a blend of parallelism strategies worked best when tailored to the computing platform but said their work's far from finished.

"The efficiency we achieved on Frontier with this model was decent but not decent enough," Wang said. "At extreme scale, we achieved 30% efficiency - which means we left about 70% of Frontier's computing power on the floor. We need much more optimization to make the machine more efficient at this scale."

Next steps include training a model further with peer-reviewed scientific data across more nodes.

"This study and our findings aren't so much a manual as a potential set of guidelines for users training a large model," Dash said. "They can draw from our experience to decide how to use Frontier's resources to train their particular model and make the most effective use of their allotted computing time."

The study was presented at the International Supercomputing Conference High Performance 2024 in Hamburg, Germany. Collaborators included Isaac Lyngaas, Junqi Yin, Xiao Wang, and Guojing Cong of ORNL and Romaine Egele of Paris-Saclay University.

The study focused on optimizing the use of GPUs for training AI, with each of Frontier's nodes relying on four AMD MI250X GPUs.

The training ran for a few hours on about 100 million tokens of test data, a small fraction of the data needed for a trillion-parameter model.

"This study was largely an exercise to show we can train this particular size of model on Frontier at this particular scale with this particular level of efficiency," Wang said. "We didn't get anywhere near the finish line of a complete large language model."

Research Report:Optimizing Distributed Training on Frontier for Large Language Models

Related Links
Oak Ridge National Laboratory
Innovative and Novel Computational Impact on Theoryand Experiment Program
Space Technology News - Applications and Research

Subscribe Free To Our Daily Newsletters
Tweet

RELATED CONTENT
The following news reports may link to other Space Media Network websites.
TECH SPACE
Amazon cloud division head unexpectedly steps down
San Francisco (AFP) May 14, 2024
The head of Amazon's AWS cloud computing business, Adam Selipsky, who was helping lead the company's expansion into AI, told workers he was stepping down Tuesday. Amazon Web Services is a key subsidiary of the tech giant, having made $25 billion worldwide in the first quarter, capitalizing on the growing appetite among businesses for remote computer and artificial intelligence services. In a memo to staff, Selipsky said he was leaving with "mixed emotions," but "given the state of the business a ... read more

TECH SPACE
Boeing Starliner crewed mission postponed to May 17

Boeing's Starliner set for first crewed mission to ISS

Boeing's Starliner joins select club of crewed US spaceships

NASA Doubles Down, Advances 6 Innovative Tech Concepts to New Phase

TECH SPACE
SpaceX Starlink flight lifts off in Florida; 2nd launch of day planned for California later

Long March 6C rocket joins fleet with successful inaugural launch

White Sands propulsion team evaluates 3D-printed engine component for Orion

SSC partners with Perigee Aerospace for satellite launches from Esrange

TECH SPACE
Mars agriculture simulations show promise and challenges

Manganese discovery on Mars suggests ancient Earth-like conditions

NASA launches commercial studies to facilitate Mars robotic science

NASA Scientists Gear Up for Solar Storms at Mars

TECH SPACE
International Support for China's Chang'e-6 Lunar Mission

Shenzhou XVII astronauts safely back from Tiangong space station

Shenzhou XVIII crew takes command at Tiangong space station

Shenzhou XVIII astronauts enter space station

TECH SPACE
South Australian space companies embark on growth mission with new UniSA program

Ovzon introduces two new satellite communication services based on Ovzon 3 technology

Rocket Lab Posts Strong First Quarter with Significant Revenue and Growth Projections

Inred and SES expand satellite internet coverage in Colombia's Amazonas

TECH SPACE
Energy transition risks critical mineral shortage: IEA

Microbial Enzyme Could Make Plastics Biodegradable

SwRI investigates boiling processes in partial gravity

AI Training Strategies Tested on World's Fastest Supercomputer

TECH SPACE
Ozone's influence on exoplanetary climate dynamics highlighted in new research

Genomes of multicellular algal relatives reveal evolutionary clues to plant origins

Webb telescope's study suggests life on exoplanet remains unconfirmed

Nightside clouds reveal new insights on giant exoplanet Wasp-43b

TECH SPACE
UAF scientist clarifies Jupiter's magnetospheric dynamics with new data

Webb telescope details weather patterns on distant exoplanet

Juno mission reveals volcanic landscapes on Io

Probing liquid water beyond Earth with advanced radar technology

Subscribe Free To Our Daily Newsletters




The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.