Researchers Develop Active Learning Workflow to Optimize Drug Design

TACC’s Frontera supercomputer aids machine learning workflow for COVID-19 drug screening


The Carnegie Mellon researchers developed an efficient automated workflow for identifying compounds with high binding affinity to the target protein among thousands of congeneric ligands. Automated machine learning combined with molecular dynamics-based free energy calculations orchestrated by active learning allows unbiased and efficient search for a small set of best-performing molecules. Starting from the structure of a known SARS-CoV-2 PLpro inhibitor (shown on the left), the researchers screened a library of 1.3 billion commercially available compounds with similar substructure (highlighted in green) and identified several compounds with more than 100-fold improvement in predicted binding affinity (the best performing molecule is shown on the right).

Adapted from a press release by Heidi Opdyke, Carnegie Mellon University.

At the height of the COVID-19 pandemic researchers from Carnegie Mellon University's Department of Chemistry built computer simulations of COVID-related protein inhibitors in an attempt to identify drug candidates that could treat the virus.

In doing so, they developed an efficient automated workflow for identifying new compounds that could be used to develop future pharmaceutical treatments for a wide range of diseases or conditions.

The work by Evgeny "Eugene" Gutkin, a doctoral candidate in Professor of Chemistry Maria Kurnikova's research group, and Filipp Gusev, a graduate student in the joint CMU-Pitt P.D. program in Computational Biology and Assistant Professor of Chemistry Olexandr Isayev's group, is described in a paper published in the American Chemical Society's Journal of Chemical Information and Modeling and highlighted on the journal's cover.

When looking for new drug candidates, experts often consider potential repurposing for known molecules or their modification. Computational techniques add a level of computer-aided design for additional insights to find those enhancements faster.

"This approach requires expertise in the field to narrow down options but historically has been biased because it doesn't consider underexplored areas in chemical space," Gusev said. "The computational approach we developed and applied is agnostic to those biases because it's purely data-driven."

Gusev and Gutkin were looking for potent inhibitors of SARS-CoV-2 papain-like protease, compounds that disrupt the replication of coronavirus. In this instance they were identifying compounds with the lowest protein-ligand binding free energy, which is a crucial indicator of drug potency, among thousands of molecules with the same common substructure.

Through an automated workflow that started with 1.6 billion commercially available molecules and narrowed to some 8,000 candidates, they were able to find 133 compounds that performed better than the known inhibitor and 16 of these showed more than 100-fold improvement in binding affinity, which in theory leads to significantly better inhibitory activity.

"Our hit rate outperformed that expected of traditional expert medicinal chemist-guided campaigns," Gutkin said.

Through active learning and automated machine learning approaches, the researchers' methods got information from calculations 20 times faster than a brute force approach where calculations are performed for all molecules included in the focused set of 8,000 molecules.

Identifying compounds and designing drug candidates from a known starting point, a process known as lead optimization, is a looming challenge for modern computational chemistry. Computationally intensive campaigns are limited by the availability of computational resources for molecular dynamics simulations as well as the difficulty of performing computations in a high-throughput manner.

The computational challenges are multi-faceted and required the orchestration of multiple Python packages and coding to achieve the desired functionality. On top of that, the simulations required big datasets of billions of molecules.

For this work, the researchers used the Frontera supercomputer at the Texas Advanced Computing Center and the Bridges-2 system at the Pittsburgh Supercomputing Center.

“We utilize both CPU and GPU nodes on Frontera, depending on specific workflow. This specific run is not a Texascale size that run on half or the entire Frontera system, but it still required a substantial compute. We started using Frontera from an early user program. And such computing capabilities helped us to solve multiple problems in chemistry and drug discovery resulting in several high-profile publications,” Isayev said.

“I think there is overall very low public awareness about computing and computational chemistry being used for the benefit of society. However, in reality many drugs that are being approved started from some kind of computer simulation of protein-ligand interactions,” Isayev added.

"A key outcome of this project is the development of the workflow that combines machine learning and molecular dynamics-based free energy calculations," Gutkin said. "We are planning to refine and optimize the workflow further and apply it to design potent inhibitors for other molecular targets."

“The need is really high, as AI/ML methods can really make a difference. The interest in the chemistry community to these methods is high too,” Isayev added.

Funding for this research was supported by the DSF Charitable Foundation, the COVID-19 HPC Consortium and the National Science Foundation.

"The beauty of the method is that it is transferable," Isayev said. "We applied it to COVID-19, and also we're testing it in a couple of other projects."