Healthcare and Life Sciences

Developing Antibody Language Models With NVIDIA BioNeMo

Objective

To deliver new drugs to patients as quickly as possible by developing antibody language models with NVIDIA BioNeMo™, streamlining the drug discovery workflow. Promoting the construction of a human-in-the-loop drug discovery platform that integrates humans, AI, and robotics. Focus on drug discovery research without thinking about algorithms and parameter optimization.

Customer

Astellas Pharma Inc.

Use Case

Generative AI / LLMs

Products

BioNeMo
DGX A100
DGX H100

Constructing a Human-in-the-Loop Drug Discovery Platform That Integrates Humans, AI, and Robotics

Astellas Pharma Inc., one of Japan’s leading pharmaceutical companies, has developed its own antibody language model, astABpLM, utilizing NVIDIA’s generative AI framework for drug discovery, BioNeMo, to efficiently predict the properties of new antibodies in antibody drug discovery. At the same time, the company utilizes generative AI to generate diverse 3D structures of compounds in drug discovery using chemical compounds, achieving speed more than 50 times faster than the conventional method. For the compute environment, it’s using a DGX™ H100 at the Tokyo-1 drug discovery innovation hub provided by Xeureka, a subsidiary of Mitsui & Co.

Astellas Pharma Inc.

Modality Informatics
Deputy General Manager
Kenichi Mori

Astellas Pharma Inc.

Focusing on Drug Discovery Research Without Thinking About Algorithms and Parameter Optimization

Challenge

To streamline the drug discovery process, which can take as long as 10 to 20 years, Astellas is working to digitalize the entire drug discovery value chain. Particularly in the research phase, the company is working to build a human-in-the-loop drug discovery platform (research environment) that integrates humans, AI, and robots. “The goal of digitization is to deliver innovative new drugs to patients as quickly as possible. That’s what it’s all about,” explains Kenichi Mori, the company’s Deputy General Manager of Modality Informatics, who is promoting the digital transformation of research.

Among the various drug discovery modalities, antibody drug discovery utilizes the mechanism of antibodies. Antibodies, also called immunoglobulins, are proteins that bind to specific antigens such as cancer cells, bacteria, and viruses to stop their function.

In order to develop antibody drugs, it’s necessary to measure the binding and physical properties of new antibodies that could be candidates as new drugs for antigens and evaluate whether they are viable as drugs. Physical properties here refer to properties such as structural stability, solubility, viscosity, and cohesion. Some physical properties take time to measure, so if they can be predicted before measurement, the process can be shortened.

Natnael Hamda, Manager of Modality Informatics and lead engineer at Astellas Pharma, has focused on protein language models (pLMs) as a means of predicting the physical properties of antibodies. This method models a protein composed of 20 amino acids to express it in terms of language in 20 characters, which is useful for structural analysis and functional prediction.

“We thought that since antibodies are also composed of proteins, the standard pLM could be applied. Although pLM-based features demonstrated better accuracy than traditional bioinformatics features in predicting general protein properties such as thermal stability, the model significantly underperformed in predicting antibody-specific properties, both in terms of accuracy and generalization,” Hamda said.

The reasons for this, he speculates, are as follows: “The difference is that proteins have evolved over time into complex structures, while antibodies have adapted to their target antigens. We also know that the basic protein principle1 that ‘structure determines function’ may not hold true in some cases. For this reason, we believe that the normal pLM did not work for the antibodies.”

¹This is called Anfinsen’s dogma, after Dr. C. Anfinsen, the biochemist who proposed it.

Solution

To address this problem, Hamda decided to develop his own language model specific to antibodies. He named the model “astABpLM,” short for “Astellas Antibody Pre-trained Language Model.”

The Observed Antibody Space (OAS) database collected and provided by the University of Oxford, UK, was used as the antibody data for training.2 The data size was 2.4 billion sequences, which were preprocessed using NVIDIA’s RAPIDS™ suite for data science to prepare the dataset for training.

The model used was ESM-1nv, which was optimized by NVIDIA based on the ESM-1 language model for proteins, developed by Meta AI Labs. ESM-1nv is provided as part of NVIDIA BioNeMo, a generative AI platform for drug discovery. “The timing was just right to start accessing BioNeMo, so I immediately decided to use ESM-1nv. It is optimized for NVIDIA GPUs, plus it has support from NVIDIA, which made it very easy to use,” Hamda said.

For the training, he employed a unique method whereby the heavy chains (H-chains) and light chains (L-chains) that make up the antibody are trained separately (see the illustration). “Since heavy chains and light chains are biologically distinct, we thought we could maximize the richness of the OAS data by training them separately,” Hamda said.

One NVIDIA DGX A100 is used as the hardware. The heavy chain model astABpLM_VH and light chain model astABpLM_VL completed training in approximately 65 and 37 hours, respectively.

In addition to the development of the antibody language model astABpLM described above, the company is using generative AI to generate a variety of 3D structures for low- and mid-molecular weight compounds, including PROTACs (proteolysis targeting chimera), as part of its research workflow.

It developed a unique workflow to rapidly generate 3D structures of compounds using a torsional diffusion model that learns dihedral angles of atomic groups based on the GEOM dataset,3,4 which contains the structures of 37 million different compounds.

² OAS: https://opig.stats.ox.ac.uk/webapps/oas/

³ Torsional diffusion: Jing et al. 2022, https://arxiv.org/pdf/2206.01729.pdf

⁴ GEOM: https://github.com/learningmatter-mit/geom

Results

The antibody-specific language model, astABpLM, has been incorporated into existing antibody property prediction workflows and is being used to discover new antibodies that may be candidates for new drugs. “Using astABpLM has certainly improved the accuracy of our predictions of physical properties,” says Mori. Hamda also pointed out the advantage of the company having its own model, which allows it to handle not only embedding, but also the probability of each amino acid residue, as necessary.

On the other hand, the company’s proprietary workflow for molecular conformational screening of compounds has enabled a 50- to 60-fold increase in speed compared to conventional methods. Explaining the results, Hamda notes, “We now get results in as little as 15 seconds, compared to the previous environment, which took several hours to a day.”

Both individuals point to the further use of NVIDIA BioNeMo as the way forward. Hamda explains, “In addition to the ESM-1nv used for astABpLM, we are making use of the various models and capabilities offered by NVIDIA BioNeMo, including MegaMolBART for small molecules.” Mori added, “I think one of the advantages of NVIDIA BioNeMo is that we can focus on our research without having to think about optimizing algorithms and parameters when we are conducting drug discovery. We look forward to continuing to add a variety of models and features to support the diversity of modalities.”

Finally, Mori sums up the situation as follows: “A paradigm shift in drug discovery research is about to occur as a result of the convergence of high-performance computing environments and generative AI. Through NVIDIA BioNeMo and Tokyo-1, we will continue to shorten the overall drug discovery pipeline and ultimately bring innovative new drugs to patients as quickly as possible.”

Astellas is one of the participating members of Tokyo-1,5 an innovation hub for drug discovery launched by Xeureka, a subsidiary of Mitsui & Co. The concept is to enhance the efficiency of drug discovery research while utilizing the new high-performance NVIDIA DGX H100.

⁵ Tokyo-1: https://tokyo-1.ai/

“A paradigm shift in drug discovery research is about to occur as a result of the convergence of high-performance computing environments and generative AI. Through NVIDIA BioNeMo and Tokyo-1, we are committed to shortening our drug discovery pipeline and bringing innovative new medicines to patients as quickly as possible.”

Kenichi Mori
Astellas Pharma Inc.

Astellas Pharma Inc.

Manager of
Modality Informatics and
Lead Engineer
Natnael Hamda

Overview of the Development of the Proprietary Antibody Language Model astABpLM

Independent training of the VH and VL chains using the optimized ESM-1nv as a backbone.
The model can be trained using DGX Cloud (one node, eight A100 GPUs).

Up to 65 hrs for astABpLM_VH
Up to 37 hrs for astABpLM_VH
Initially, only 10% of data was trained using the existing infrastructure.

The data sets of heavy chains (the red part of the Y) and light chains (the blue part of the Y) that form antibodies were given separately to ESM-1nv for training.

Scaleway

Ready to Learn More?

To learn more about NVIDIA solutions for healthcare and life sciences, contact us.

Get in Touch