Global Public Sector
Korean AI company Trillion Labs is advancing sovereign AI development to overcome the challenge of building highly accurate LLMs for Korea.
Trillion Labs
Generative AI / LLMs
Building large language models (LLMs) demands huge datasets—often exceeding two trillion tokens—where data processing tasks such as deduplication and shuffling are time-consuming and computationally expensive.
For Trillion Labs, a Korean AI startup building scalable multilingual models, this challenge is heightened by the lack of high-quality Korean language data. While English-speaking countries enjoy a variety of foundation models and datasets, languages like Korean have scarce public data and are considered “low-resource.” Popular models tend to struggle in these low-resource languages.
Investment in local infrastructure, data curation, and local workforce skills is imperative to achieving sovereign AI capabilities and enabling them to build capable models that can benefit the public. This effort by Trillion Labs allows for more participation from the local ecosystem with a context-savvy model.
Trillion Labs requires investments in high-quality datasets to improve Korean language accuracy. Computational delays in data curation and training limit the amount of experiments, thus preventing quick iteration and validation of improvements in model performance and innovation.
To address these challenges, Trillion Labs used NVIDIA NeMo Curator, which utilizes GPU processing to accelerate the creation of high-quality datasets, reducing data engineering time. Using GPUs, computational tasks can take mere hours. The data curation pipeline begins by identifying Korean language and custom filters built on top of NeMo Curator’s heuristic and custom filters. Then, deduplication occurs to remove sentences with similar meanings, maintaining a high-quality, accurate, and unbiased dataset. NeMo Curator can extract high-quality Korean language data from resources.
“Deduplication is one of the most time-consuming processes when handling a very large dataset,” said Jason Park, Co-Founder at Trillion Labs. “The time saved from NeMo Curator GPU acceleration was the most significant benefit we experienced from using the library.”
Trillion Labs has a new approach to data processing through the GPU-accelerated NeMo Curator pipeline, which:
“This newfound agility elevates our entire analytics lifecycle, turning our data platform into a launchpad for continuous innovation and competitive advantage,” said Jay Shin, CEO and Co-Founder at Trillion Labs.
Additionally, NVIDIA products help Trillion Labs accelerate the innovation process. Breakthroughs in the feedback loop better coordinate prototypes to production. Faster iterations lead to better models, which means new use cases and a deeper industry collaboration.
At a global scale, the benefits are tangible. AI applications are more diverse and high quality. With more rapid productization, the market and public have more opportunities for innovation at both the corporate and personal level, meaning more access to AI applications. Datasets are smaller, but higher quality, with the removal of repeated and low-quality content in the curation process. Training models with this refined dataset results in shorter overall training times and huge savings in cost and energy.
Developing and deploying language models within countries enables cultural preservation and local economic development. Local investment in AI infrastructure will aid in retaining talent and fostering innovation.
NVIDIA NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization.