Kevin Cruse1,2,Viktoriia Baibakova1,2,Maged Abdelsamie2,Kootak Hong2,3,Christopher Bartel2,4,Amalie Trewartha2,5,Anubhav Jain2,Carolin Sutter-Fella2,Gerbrand Ceder1,2
University of California, Berkeley1,Lawrence Berkeley National Laboratory2,Chonnam National University3,University of Minnesota4,Toyota Research Institute5
Kevin Cruse1,2,Viktoriia Baibakova1,2,Maged Abdelsamie2,Kootak Hong2,3,Christopher Bartel2,4,Amalie Trewartha2,5,Anubhav Jain2,Carolin Sutter-Fella2,Gerbrand Ceder1,2
University of California, Berkeley1,Lawrence Berkeley National Laboratory2,Chonnam National University3,University of Minnesota4,Toyota Research Institute5
Understanding the reaction pathways in complex oxide synthesis is an ongoing goal in materials science. Clues to understanding these pathways can be gathered by monitoring the persistence of intermediate phases as impurity phases in the final sample. With the goal of better understanding the formation of impurity phases in BiFeO<sub>3</sub> thin film synthesis through the sol-gel technique, we have constructed a high-quality dataset of 340 synthesis procedures and outcomes extracted manually from 178 scientific articles. From this dataset, we built a decision tree model that reinforces important experimental heuristics for the avoidance of phase impurities, but ultimately shows limited predictive capability. Under the assumption that this limited performance is due to hidden variables and the need for more data, we used the text-mined dataset to inform several experiments aimed at reproducing results from the literature as well as proposing new syntheses to explore under-represented regions of the synthesis condition space. In our investigation, we identify important features in controlling phase purity that are corroborated by known heuristics in the field, such as annealing temperature and Bi:Fe metal ratio, as well as other less frequently studied indicators, such as solution stirring temperature and precursor solution concentration. We also highlight the limitations of building predictive models for complex synthesis tasks based on text-mined data alone, which may often be due to the incompleteness of synthesis descriptions in the literature. Nevertheless, we show how such a dataset can be made useful by informing new controlled experiments and forming a better understanding for impurity phase formation in this complex oxide system.