Higher Education / Research
Researchers at The Ohio State University are digitizing living organisms for broad science and conservation efforts through BioCLIP 2—a Tree of Life foundation model for half of the world's named species. This model is trained on the largest and most diverse dataset to date, which the team also assembled, TreeOfLife-200M.
Facing global species data gaps, researchers used NVIDIA A100 and H100 GPUs to build BioCLIP 2, consistently achieving the best or top-two performance for species identification and zero-shot recognition across almost one million taxa. The model unlocks emergent ecological and evolutionary insights. BioCLIP 2 was recently featured at the annual NeurIPS AI research conference.
Ohio State University
Simulation / Modeling / Design
Key Results
Biodiversity provides essential ecosystem services, including clean water, breathable air, and productive soil. To properly study, understand, and manage ecosystems to monitor and protect the world's biodiversity, researchers require good data.
However, the scientific community faces a significant biodiversity data problem: missing and skewed data. The quality and depth of biological data are often tied to the presence and wealth of local human populations, leading to data heavily concentrated in urban areas and national parks. Important locales, like the tropics, were underrepresented.
From a taxonomic perspective, vertebrates—especially mammals and birds—have abundant data, but invertebrates and fungi lack comprehensive information. For instance, half of the 160,000 species monitored by the International Union for Conservation of Nature (IUCN) Red List are data deficient, including major species like orcas.
The lack of comprehensive and balanced data made it difficult to understand long-term environmental impacts, evolutionary relatedness, and ecological interactions.
The scatter plots show plant species better separated as the model is trained. The intra-species variations also form clusters, making them easier to separate.
Tanya Berger-Wolf, the director of the Translational Data Analytics Institute and a professor at The Ohio State University, led a team at the US National Science Foundation-funded Imageomics Institute to address the data gap. The team built the largest and most comprehensive biology dataset available and trained a foundation model using this data.
The team's initial efforts, leveraging hierarchical taxonomic labels and CLIP-style contrastive training, achieved pronounced species classification accuracy on about a quarter of the world's named species. However, it was unclear what properties would emerge if these methods were scaled up. Existing large-scale biological datasets like TreeOfLife-10M and BioTrove also had limitations in terms of scale and diversity, impacting the generalizability of the model.
Previous AI models developed were often restricted to narrow objectives like species classification and could not tackle broader biological visual tasks—such as habitat classification or trait prediction—without explicit training. Additionally, the team was concerned that extensive training could cause intra-species variations—such as life stages or sexes—to collapse, making fine-grained distinctions difficult.
The researchers needed to massively increase the size of their dataset and train a new model in a stabilized environment that would still be able to make granular distinctions in the biological data.
NVIDIA's high-performance GPUs provided the parallel processing power needed for every step of the TreeOfLife-200M dataset curation and BioCLIP 2 model training.
Before training could even begin, the TreeOfLife-200M needed to be developed. Curating this dataset involved combining vast, disparate sources—like Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EOL), BIOSCAN-5M, and FathomNet—and developing a custom taxonomic alignment package, TaxonoPy, to standardize 1.36 million noisy taxonomic hierarchies down to 952,000 unique taxa. This dataset was retrieved and organized using distributed-downloader tools, which were stored in high-performance computing (HPC) file systems, indicating a robust data pipeline designed to feed the GPUs efficiently.
Once the dataset was complete, training began utilizing a variety of NVIDIA GPUs to develop BioCLIP 2:
"NVIDIA GPUs, known for their capabilities in accelerating AI and scientific research, were integral to making these discoveries possible by providing the necessary computational backbone."
Tanya Berger-Wolf
Director, Translational Data Analytics Institute at The Ohio State University
The researchers discovered profound scientific potential through the training and development of BioCLIP 2. Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives.
Contrary to the expectation that fine-grained differences might collapse after extensive training, BioCLIP 2 preserved intra-species variations. This allows the model to differentiate and align these variants. For example, it can cluster different plant diseases on leaves within the same plant species and distinguish between juveniles and adults or males and females within a single animal species.
The embedding distribution of different species spontaneously aligned with their ecological relationships without explicit training. For example, the model captured variations in beak sizes among Darwin's finches and distinguished between freshwater and non-freshwater fish species.
Additionally, BioCLIP 2—trained using NVIDIA GPUs—outperformed its predecessors in numerous categories, including:
BioCLIP 2's capabilities present significant opportunities for further scientific discovery, biodiversity research, and conservation decision-making using AI. The model enables more efficient and automated scientific workflows, opening new avenues for understanding functional and ecological relationships among species.
Learn more about how NVIDIA accelerated computing can help scientific breakthroughs at unprecedented speeds.