Maciej Tomczak1,Daniel Cieslinski1,Payden Brown2,Yang Park2,Ju Li2,Stefanos Papanikolaou1
National Centre for Nuclear Research1,Massachusetts Institute of Technology2
Maciej Tomczak1,Daniel Cieslinski1,Payden Brown2,Yang Park2,Ju Li2,Stefanos Papanikolaou1
National Centre for Nuclear Research1,Massachusetts Institute of Technology2
<br/>The ever expanding corpus of scientific manuscripts contains vast amount of knowledge and descriptions of various experimental settings. As new articles are published daily, it is impractical for any single individual to grasp all that information. However, for each manuscript, the scientific human mind further trains a tree of connections between Properties (eg. hardness, strength, conductivity of materials), Structures (eg. crystalline type, defect content, composition) and Processes (eg. annealing, cold work). This tree of PSPs represents the human strategy for dimensional reduction in processing scientific manuscripts, and requires to quickly and efficiently find required information in the published works and compare it with similar texts or other sources.<br/><br/>Recent advancements in natural language processing (NLP) have given rise to high-performing foundation models, in particular large language models (LLM) such as OpenAI’s GPT-4. These powerful models are capable of complex tasks such as text summarization and creative content generation. However, utilization of these models comes at a significant computational cost and contextual understanding is commonly limited. In this work, we develop small and efficient fine-tunable models for capturing PSPs in scientific manuscripts, using Elsevier's database. We also propose to incorporate the Retrieval Augmented Generation (RAG) approach alongside our fine-tuning methods for ensuring a more robust and reliable model for capturing PSPs in scientific manuscripts.<br/><br/>We investigate the statistics of various ways of fine-tuning LLMs, and also extract PSPs in pre-defined sets. Using texts focused only on nuclear materials research from Journal of Nuclear Materials, we evaluate LLMs of different sizes and different strategies to determine their suitability as knowledge bases for scientists.