Higher Education / Research

Ohio State University Transforms Conservation Science With Biology Foundation Models

Objective

Researchers at The Ohio State University are digitizing living organisms for broad science and conservation efforts through BioCLIP 2—a Tree of Life foundation model for half of the world's named species. This model is trained on the largest and most diverse dataset to date, which the team also assembled, TreeOfLife-200M.

Facing global species data gaps, researchers used NVIDIA A100 and H100 GPUs to build BioCLIP 2, consistently achieving the best or top-two performance for species identification and zero-shot recognition across almost one million taxa. The model unlocks emergent ecological and evolutionary insights. BioCLIP 2 was recently featured at the annual NeurIPS AI research conference.

Customer

Ohio State University

Topic

Simulation / Modeling / Design

Products

NVIDIA Data Center / Cloud

Key Results

BioCLIP 2 achieved top performance for species identification across more than 900,000 taxa with NVIDIA accelerated computing.

The model preserved and identified intra-species variations without explicit training.

Trained on the world's largest biology dataset, comprising 214 million images, using NVIDIA GPUs.

Many Species, Not Enough Data

Biodiversity provides essential ecosystem services, including clean water, breathable air, and productive soil. To properly study, understand, and manage ecosystems to monitor and protect the world's biodiversity, researchers require good data.

However, the scientific community faces a significant biodiversity data problem: missing and skewed data. The quality and depth of biological data are often tied to the presence and wealth of local human populations, leading to data heavily concentrated in urban areas and national parks. Important locales, like the tropics, were underrepresented.

From a taxonomic perspective, vertebrates—especially mammals and birds—have abundant data, but invertebrates and fungi lack comprehensive information. For instance, half of the 160,000 species monitored by the International Union for Conservation of Nature (IUCN) Red List are data deficient, including major species like orcas.

The lack of comprehensive and balanced data made it difficult to understand long-term environmental impacts, evolutionary relatedness, and ecological interactions.

Plant species clusters improve with model training

The scatter plots show plant species better separated as the model is trained. The intra-species variations also form clusters, making them easier to separate.

Creating a Foundation Model to Bridge Critical Data Gaps

Tanya Berger-Wolf, the director of the Translational Data Analytics Institute and a professor at The Ohio State University, led a team at the US National Science Foundation-funded Imageomics Institute to address the data gap. The team built the largest and most comprehensive biology dataset available and trained a foundation model using this data.

The team's initial efforts, leveraging hierarchical taxonomic labels and CLIP-style contrastive training, achieved pronounced species classification accuracy on about a quarter of the world's named species. However, it was unclear what properties would emerge if these methods were scaled up. Existing large-scale biological datasets like TreeOfLife-10M and BioTrove also had limitations in terms of scale and diversity, impacting the generalizability of the model.

Previous AI models developed were often restricted to narrow objectives like species classification and could not tackle broader biological visual tasks—such as habitat classification or trait prediction—without explicit training. Additionally, the team was concerned that extensive training could cause intra-species variations—such as life stages or sexes—to collapse, making fine-grained distinctions difficult.

The researchers needed to massively increase the size of their dataset and train a new model in a stabilized environment that would still be able to make granular distinctions in the biological data.

Scaling Model Training With NVIDIA Accelerated Computing

NVIDIA's high-performance GPUs provided the parallel processing power needed for every step of the TreeOfLife-200M dataset curation and BioCLIP 2 model training.

Before training could even begin, the TreeOfLife-200M needed to be developed. Curating this dataset involved combining vast, disparate sources—like Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EOL), BIOSCAN-5M, and FathomNet—and developing a custom taxonomic alignment package, TaxonoPy, to standardize 1.36 million noisy taxonomic hierarchies down to 952,000 unique taxa. This dataset was retrieved and organized using distributed-downloader tools, which were stored in high-performance computing (HPC) file systems, indicating a robust data pipeline designed to feed the GPUs efficiently.

Once the dataset was complete, training began utilizing a variety of NVIDIA GPUs to develop BioCLIP 2:

Broad Training on NVIDIA A100s: As part of the US National Science Foundation National AI Research Resource (NAIRR) pilot program, the team used a combined allocation at the Ohio Supercomputer Center and the Pittsburgh Supercomputing Center, running 64 NVIDIA A100 GPUs for 290 hours.
BioCLIP 2 Training on H100s: The primary training run for the final BioCLIP 2 model was conducted on 32 NVIDIA H100 GPUs for 10 days.
Ablation Studies on H100s: To test different modeling decisions, ablation studies were run on the smaller TreeOfLife-10M dataset using 8 NVIDIA H100 GPUs for 100 epochs.
Evaluation and Analysis on a Single A100: These experiments involved freezing the trained models and extracting visual embeddings to test performance on tasks such as habitat classification, trait prediction, and agricultural disease detection.

"NVIDIA GPUs, known for their capabilities in accelerating AI and scientific research, were integral to making these discoveries possible by providing the necessary computational backbone."

Tanya Berger-Wolf
Director, Translational Data Analytics Institute at The Ohio State University

Discovering Unexpected Ecological Relationships Through AI

The researchers discovered profound scientific potential through the training and development of BioCLIP 2. Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives.

Contrary to the expectation that fine-grained differences might collapse after extensive training, BioCLIP 2 preserved intra-species variations. This allows the model to differentiate and align these variants. For example, it can cluster different plant diseases on leaves within the same plant species and distinguish between juveniles and adults or males and females within a single animal species.

The embedding distribution of different species spontaneously aligned with their ecological relationships without explicit training. For example, the model captured variations in beak sizes among Darwin's finches and distinguished between freshwater and non-freshwater fish species.

Additionally, BioCLIP 2—trained using NVIDIA GPUs—outperformed its predecessors in numerous categories, including:

18.0% average improvement in species classification accuracy over BioCLIP
10.9% average performance improvement over strong vision-language and vision-only baselines on diverse biological visual tasks
30.1% improvement over the original CLIP model in species classification
22.8% performance gap over BioCLIP on camera trap images, demonstrating enhanced robustness
Over 20% zero-shot improvement on CameraTrap, Fungi, and Rare Species test datasets classifications

BioCLIP 2's capabilities present significant opportunities for further scientific discovery, biodiversity research, and conservation decision-making using AI. The model enables more efficient and automated scientific workflows, opening new avenues for understanding functional and ecological relationships among species.

Learn more about how NVIDIA accelerated computing can help scientific breakthroughs at unprecedented speeds.

Read the Paper