Yves Gaetan Nana Teukam1,2,Matteo Manica1,Francesca Grisoni2,Teodoro Laino1
IBM Research Europe - Zurich1,Technische Universiteit Eindhoven2
Yves Gaetan Nana Teukam1,2,Matteo Manica1,Francesca Grisoni2,Teodoro Laino1
IBM Research Europe - Zurich1,Technische Universiteit Eindhoven2
Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Besides the increased reaction rates, they present remarkable characteristics to enable more sustainable reactions: mild conditions, less toxic solvents, and reduced waste. Billion years of evolution have made enzymes extremely efficient. However wide adoption in industrial processes requires faster design using in-silico methodologies, a daunting task far from being solved. The majority of methods operate by introducing mutations in an existing amino acid (AA) sequence using a variety of assumptions and strategies to introduce variants in the original sequence. More recently, machine learning and deep generative networks have gained popularity in the field of protein engineering by leveraging prior knowledge on protein binders, their physicochemical properties, or the 3D structure. Here, we cast the problem of enzyme optimization as an evolutionary algorithm where mutations are modeled via a generalized autoregressive language model trained on fragments of AA sequences from UniProtKB. Relying on a pre-trained language model, we apply transfer learning and train a Random Forest as the scoring model on a dataset of biocatalysed chemical reactions to drive the optimization process. With our approach, using the least amount of assumptions, we can adapt active sites to perform new reactions. Our methodology allows designing enzymes with higher predicted biocatalytic activity, emulating the evolutionary process occurring in nature by sampling optimal sequences modeling the underlying proteomic language.