24/7 Space News
ROBO SPACE
NASA-IBM team up for large language models for advanced research
illustration only
NASA-IBM team up for large language models for advanced research
by Derek Koehl for NASA News
Washington DC (SPX) Jun 27, 2024

Collaborations with private, non-federal partners through Space Act Agreements are a key component in the work done by NASA's Interagency Implementation and Advanced Concepts Team (IMPACT). A collaboration with International Business Machines (IBM) has produced INDUS, a comprehensive suite of large language models (LLMs) tailored for the domains of Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics and trained using curated scientific corpora drawn from diverse data sources.

INDUS contains two types of models; encoders and sentence transformers. Encoders convert natural language text into numeric coding that can be processed by the LLM. The INDUS encoders were trained on a corpus of 60 billion tokens encompassing astrophysics, planetary science, Earth science, heliophysics, biological, and physical sciences data. Its custom tokenizer developed by the IMPACT-IBM collaborative team improves on generic tokenizers by recognizing scientific terms like biomarkers and phosphorylated. Over half of the 50,000-word vocabulary contained in INDUS is unique to the specific scientific domains used for its training. The INDUS encoder models were used to fine tune the sentence transformer models on approximately 268 million text pairs, including titles/abstracts and questions/answers.

By providing INDUS with domain-specific vocabulary, the IMPACT-IBM team achieved superior performance over open, non-domain specific LLMs on a benchmark for biomedical tasks, a scientific question-answering benchmark, and Earth science entity recognition tests. By designing for diverse linguistic tasks and retrieval augmented generation, INDUS is able to process researcher questions, retrieve relevant documents, and generate answers to the questions. For latency sensitive applications, the team developed smaller, faster versions of both the encoder and sentence transformer models.

Validation tests demonstrate that INDUS excels in retrieving relevant passages from the science corpora in response to a NASA-curated test set of about 400 questions. IBM researcher Bishwaranjan Bhattacharjee commented on the overall approach: "We achieved superior performance by not only having a custom vocabulary but also a large specialized corpus for training the encoder model and a good training strategy. For the smaller, faster versions, we used neural architecture search to obtain a model architecture and knowledge distillation to train it with supervision of the larger model."

INDUS was also evaluated using data from NASA's Biological and Physical Sciences (BPS) Division. Dr. Sylvain Costes, the NASA BPS project manager for Open Science, discussed the benefits of incorporating INDUS: "Integrating INDUS with the Open Science Data Repository (OSDR) Application Programming Interface (API) enabled us to develop and trial a chatbot that offers more intuitive search capabilities for navigating individual datasets. We are currently exploring ways to improve OSDR's internal curation data system by leveraging INDUS to enhance our curation team's productivity and reduce the manual effort required daily."

At the NASA Goddard Earth Sciences Data and Information Services Center (GES-DISC), the INDUS model was fine-tuned using labeled data from domain experts to categorize publications specifically citing GES-DISC data into applied research areas. According to NASA principal data scientist Dr. Armin Mehrabian, this fine-tuning "significantly improves the identification and retrieval of publications that reference GES-DISC datasets, which aims to improve the user journey in finding their required datasets." Furthermore, the INDUS encoder models are integrated into the GES-DISC knowledge graph, supporting a variety of other projects, including the dataset recommendation system and GES-DISC GraphRAG.

Kaylin Bugbee, team lead of NASA's Science Discovery Engine (SDE), spoke to the benefit INDUS offers to existing applications: "Large language models are rapidly changing the search experience. The Science Discovery Engine, a unified, insightful search interface for all of NASA's open science data and information, has prototyped integrating INDUS into its search engine. Initial results have shown that INDUS improved the accuracy and relevancy of the returned results."

INDUS enhances scientific research by providing researchers with improved access to vast amounts of specialized knowledge. INDUS can understand complex scientific concepts and reveal new research directions based on existing data. It also enables researchers to extract relevant information from a wide array of sources, improving efficiency. Aligned with NASA and IBM's commitment to open and transparent artificial intelligence, the INDUS models are openly available on Hugging Face. For the benefit of the scientific community, the team has released the developed models and will release the benchmark datasets that span named entity recognition for climate change, extractive QA for Earth science, and information retrieval for multiple domains. The INDUS encoder models are adaptable for science domain applications, and the INDUS retriever models support information retrieval in RAG applications.

Research Report:INDUS: Effective and Efficient Language Models for Scientific Applications

Learn more about the Science Discovery Engine here.

Related Links
LLM Science Expert at IBM
All about the robots on Earth and beyond!

Subscribe Free To Our Daily Newsletters
Tweet

RELATED CONTENT
The following news reports may link to other Space Media Network websites.
ROBO SPACE
NBC brings AI version of legendary broadcaster to Olympic coverage
San Francisco (AFP) June 26, 2024
US media giant NBCUniversal on Wednesday announced that it will use the AI version of a legendary sports broadcaster to narrate personalized daily recaps of Olympic game events. Narration by Hall of Fame announcer Al Michaels generated using artificial intelligence will voice the recaps that will be personalized to individual viewers of NBC's Peacock streaming service. Well-known broadcaster Michaels has worked a combined nine Olympic Games for NBC Sports and ABC Sports during his career, accord ... read more

ROBO SPACE
NASA Explores the Potential of Fungi to Grow Space Habitats

Proba-3 tests formation flying systems on Earth

Chang'e 6 mission raises potential for China-US space cooperation

NASA picks SpaceX to carry ISS to its watery graveyard after 2030

ROBO SPACE
SpaceX completes Starlink launch, brings Direct to Cell satellite total to 103

20 Years after 'Hyper-X', UVA team makes NASA hypersonic breakthrough

The science behind splashdown

SSC and Firefly Aerospace plan joint satellite launches from Esrange

ROBO SPACE
NASA's Mars Odyssey Captures Huge Volcano, Nears 100,000 Orbits

NASA Releases Key Moon to Mars White Papers

Martian Meteorite Impacts Provide a 'Cosmic Clock' for Planetary Dating

Mapping Mars with Open Science Tools

ROBO SPACE
Hainan Launch Center Completes Construction for First Mission

Ten make the cut for China's fourth batch of astronauts

China announces first astronaut candidates from Hong Kong, Macau

China Open to Space Collaboration with the US

ROBO SPACE
Terran Orbital and Hanwha Systems Form Strategic Partnership

Moon Base Construction: ESA's Innovative Use of 3D-Printed Space Bricks

SES completes euro 3 billion acquisition financing syndication

Iridium Expands Satellite Time and Location Service to Europe and Asia Pacific

ROBO SPACE
Amazon to build 'top secret' cloud for Australia's spies

Icesat-2 Resumes Data Collection After Solar Storms

MIT researchers identify routes to stronger titanium alloys

Scientists probe chilling behavior of promising solid-state cooling material

ROBO SPACE
Artificial greenhouse gases may indicate alien terraforming

Hydrothermal Vents on Ocean Worlds Could Support Life, UC Santa Cruz Study Finds

Iron meteorites hint that our infant solar system was more doughnut than dartboard

Watery Planets Orbiting Dead Stars Could Be Good Candidates for Life Study

ROBO SPACE
NASA's Juno Observes Lava Lakes on Jupiter's Moon Io

Understanding Cyclones on Jupiter Through Oceanography

Unusual Ion May Influence Uranus and Neptune's Magnetic Fields

NASA's Europa Clipper Arrives in Florida for Launch Preparation

Subscribe Free To Our Daily Newsletters




The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.