Global Public Sector

Trillion Labs Cuts Data Prep From Days to Hours to Train Highly Accurate LLMs

Objective

Korean AI company Trillion Labs is advancing sovereign AI development to overcome the challenge of building highly accurate LLMs for Korea.

Customer

Trillion Labs

Use Case

Generative AI / LLMs

Overcoming Computational Barriers in Large-Scale Korean LLM Development

Building large language models (LLMs) demands huge datasets—often exceeding two trillion tokens—where data processing tasks such as deduplication and shuffling are time-consuming and computationally expensive. 

For Trillion Labs, a Korean AI startup building scalable multilingual models, this challenge is heightened by the lack of high-quality Korean language data. While English-speaking countries enjoy a variety of foundation models and datasets, languages like Korean have scarce public data and are considered “low-resource.” Popular models tend to struggle in these low-resource languages.   

Investment in local infrastructure, data curation, and local workforce skills is imperative to achieving sovereign AI capabilities and enabling them to build capable models that can benefit the public. This effort by Trillion Labs allows for more participation from the local ecosystem with a context-savvy model. 

Trillion Labs requires investments in high-quality datasets to improve Korean language accuracy. Computational delays in data curation and training limit the amount of experiments, thus preventing quick iteration and validation of improvements in model performance and innovation.

Key Takeaways

  • Training time was reduced from 24 hours on CPU to 3.4 hours on 8x H100s for 1.7 trillion tokens of Korean language data.
  • 5% accuracy improvement for the Korean language using 100B tokens of curated data.
  • By switching to GPU-accelerated data curation, Trillion Labs achieved up to 10x savings in both cost and energy consumption compared to traditional CPU-based approaches.
  • Up to 7x accelerated data processing, enabling Trillion Labs to allocate its time to building, testing, and innovating instead of waiting 10+ hours, or even days for lengthy data processing.

Time, Cost, and Energy Savings for Large-Scale Data Processing and Model Training

To address these challenges, Trillion Labs used NVIDIA NeMo Curator, which utilizes GPU processing to accelerate the creation of high-quality datasets, reducing data engineering time. Using GPUs, computational tasks can take mere hours. The data curation pipeline begins by identifying Korean language and custom filters built on top of NeMo Curator’s heuristic and custom filters. Then, deduplication occurs to remove sentences with similar meanings, maintaining a high-quality, accurate, and unbiased dataset. NeMo Curator can extract high-quality Korean language data from resources.

“Deduplication is one of the most time-consuming processes when handling a very large dataset,” said Jason Park, Co-Founder at Trillion Labs. “The time saved from NeMo Curator GPU acceleration was the most significant benefit we experienced from using the library.”

NeMo Curator is based on DASK, a Python library created for parallel and distributed data processing. A multitude of its features work similarly to the data analysis tool pandas, enabling data scientists and developers to get started easily.

Rapid Data Processing Fuels AI Innovation and Market Impact

Trillion Labs has a new approach to data processing through the GPU-accelerated NeMo Curator pipeline, which:

  • Creates dozens more “what-if” experiments by searching new processing approaches or cleaning strategies, when they would have previously been too costly.
  • Delivers production-ready datasets sooner by accelerating downstream activities—training, validation, and deployment.
  • Elevates analytics by turning Trillion Labs’ data platform into a launchpad for competitive advantage and continuous innovation.

“This newfound agility elevates our entire analytics lifecycle, turning our data platform into a launchpad for continuous innovation and competitive advantage,” said Jay Shin, CEO and Co-Founder at Trillion Labs.

Additionally, NVIDIA products help Trillion Labs accelerate the innovation process. Breakthroughs in the feedback loop better coordinate prototypes to production. Faster iterations lead to better models, which means new use cases and a deeper industry collaboration.

At a global scale, the benefits are tangible. AI applications are more diverse and high quality. With more rapid productization, the market and public have more opportunities for innovation at both the corporate and personal level, meaning more access to AI applications. Datasets are smaller, but higher quality, with the removal of repeated and low-quality content in the curation process. Training models with this refined dataset results in shorter overall training times and huge savings in cost and energy. 

Developing and deploying language models within countries enables cultural preservation and local economic development. Local investment in AI infrastructure will aid in retaining talent and fostering innovation.

NVIDIA NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization.

Related Customer Stories