Property-Structure-Process Relationship Trees and Contextual Understanding of Scientific Manuscripts using Large Language Models of Artificial Intelligence

When and Where

Nov 29, 2023
8:15am - 8:30am

Sheraton, Second Floor, Back Bay A

Presenter

Maciej Tomczak

Daniel Cieslinski

Payden Brown

Yang Park

Ju Li

Stefanos Papanikolaou

Co-Author(s)

Maciej Tomczak¹,Daniel Cieslinski¹,Payden Brown²,Yang Park²,Ju Li²,Stefanos Papanikolaou¹

National Centre for Nuclear Research¹,Massachusetts Institute of Technology²

Abstract

Maciej Tomczak¹,Daniel Cieslinski¹,Payden Brown²,Yang Park²,Ju Li²,Stefanos Papanikolaou¹

National Centre for Nuclear Research¹,Massachusetts Institute of Technology²

The ever expanding corpus of scientific manuscripts contains vast amount of knowledge and descriptions of various experimental settings. As new articles are published daily, it is impractical for any single individual to grasp all that information. However, for each manuscript, the scientific human mind further trains a tree of connections between Properties (eg. hardness, strength, conductivity of materials), Structures (eg. crystalline type, defect content, composition) and Processes (eg. annealing, cold work). This tree of PSPs represents the human strategy for dimensional reduction in processing scientific manuscripts, and requires to quickly and efficiently find required information in the published works and compare it with similar texts or other sources. Recent advancements in natural language processing (NLP) have given rise to high-performing foundation models, in particular large language models (LLM) such as OpenAI’s GPT-4. These powerful models are capable of complex tasks such as text summarization and creative content generation. However, utilization of these models comes at a significant computational cost and contextual understanding is commonly limited. In this work, we develop small and efficient fine-tunable models for capturing PSPs in scientific manuscripts, using Elsevier's database. We also propose to incorporate the Retrieval Augmented Generation (RAG) approach alongside our fine-tuning methods for ensuring a more robust and reliable model for capturing PSPs in scientific manuscripts. We investigate the statistics of various ways of fine-tuning LLMs, and also extract PSPs in pre-defined sets. Using texts focused only on nuclear materials research from Journal of Nuclear Materials, we evaluate LLMs of different sizes and different strategies to determine their suitability as knowledge bases for scientists.