Luis Rangel DaCosta1,Katherine Sytwu2,Catherine Groschner1,Mary Scott1,2
University of California, Berkeley1,Lawrence Berkeley National Laboratory2
Luis Rangel DaCosta1,Katherine Sytwu2,Catherine Groschner1,Mary Scott1,2
University of California, Berkeley1,Lawrence Berkeley National Laboratory2
Computer vision and other machine learning frameworks provide an exciting opportunity to accelerate and automate traditional analysis tools in the realm of materials characterization. In high-resolution transmission electron microscopy (HRTEM), a technique which provides precise atomic structural information of nanoscale structures, computer vision and other machine learning tools could enable the study of population-wide structural statistics on the atomic scale for a materials system. Such a powerful scale of information could greatly accelerate the development and design of next-generation advanced nanomaterials and functional devices, which will require precise knowledge and control of atomistic material characteristics. However, popular methods like supervised learning with neural networks require a huge amount of high quality data with accompanying labels. Acquiring and labeling data that covers the vast distribution of structures and experimental imaging conditions is an extremely work-intensive task, especially when considering it must be done for each new experiment. Importantly, there is also no access to ground truth with experimental data—even labels provided by an expert can be noisy or wrong.<br/>These challenges in acquiring sufficient, high-quality, high-diversity experimental data for machine learning training in HRTEM can be overcome by utilizing simulated databases, which provide large corpuses of data with accurate ground-truth labels for arbitrary experimental conditions and materials. In this work, we develop a high-throughput computational workflow that generates readily-usable simulated databases for machine learning tasks. In particular, we consider the task of segmenting gold nanoparticles in HRTEM images and characterize how the simulated data quantity, quality, and diversity affect the training and performance of machine learning models.<br/>Our computational workflow begins with our newly developed software package, Construction Zone, an open-source Python package that abstracts away lower level tasks involved in creating complex nanoscale structures and provides a clear and extensible interface for easy algorithmic creation of structures for simulation tasks. Our package is built upon open-source tools like Pymatgen and the Atomic Simulation Environment and can be used to generate structures that suitably sample the complete structural space of a materials system. Generated structures can then be passed into simulation tasks of choice, whether that is molecular dynamics, DFT, or, in our case, simulation of TEM imaging. After generating a library of gold nanoparticles on amorphous carbon substrates of varying sizes, number and type of planar defects, and orientation, we simulate a comparably sized library of TEM images with the high-performance image-simulation code Prismatic. We then apply arbitrary sets of image defoci, aberrations, and noise conditions to sample a diverse set of imaging conditions to create data that closely mimics experimental images. With access to the generated structures, we calculate noise-free ground-truth labels directly from simulation for each generated image.<br/>Finally, we characterize the performance of neural network models trained purely on simulated HRTEM data as well as models refined by a small body of experimental data (transfer learning). In particular, we examine the impact of simulated dataset size (number of images), quality (fidelity of simulation), and diversity (as measured by distributions of structures and imaging conditions) on the relative performance of segmentation of experimental micrographs of gold nanoparticles. We also determine baseline strategies and benchmarks for sufficient datasets for developing performant machine learning models that utilize simulated data, and discuss how these computational workflows and techniques can be easily extended to more complex analysis tasks such as defect and grain analysis or multimodal experiments.