NVIDIA’s Nemotron-CC pipeline is now part of the NeMo Curator, marking a significant advancement in the creation of high-quality datasets for large language models (LLMs). This dataset utilises a vast collection of 6.3 trillion tokens sourced from Common Crawl to greatly improve LLM accuracy, as highlighted by NVIDIA.
Enhancements in Data Curation
This new pipeline overcomes the drawbacks of traditional data curation, which typically eliminate potentially valuable data through filtering methods. By using an ensemble of classifiers and synthetic data rephrasing, the Nemotron-CC generates 2 trillion tokens of top-notch synthetic data, reclaiming up to 90% of what would normally be lost.
Key Features of the Pipeline
The data curation process starts with extracting text from HTML using tools such as jusText and FastText for language detection. It proceeds with deduplication to eliminate duplicates, using NVIDIA RAPIDS libraries to enhance efficiency. The approach incorporates 28 heuristic filters for quality assurance and utilises a PerplexityFilter module for additional refinement.
Quality is ensured through a group of classifiers that evaluate and categorise documents based on their quality, enabling focused synthetic data generation. This allows for developing varied QA pairs, distilled content, and structured knowledge lists from the text.
Advantages of LLM Training
Utilising the Nemotron-CC dataset for LLM training results in marked improvements. For example, a Llama 3.1 model trained on a one trillion-token portion of Nemotron-CC achieved a 5.6-point boost in the MMLU score compared to models using standard datasets. Moreover, models that incorporate long-horizon tokens from Nemotron-CC experienced a 5-point increase in benchmark evaluations.
How to Begin with Nemotron-CC
The Nemotron-CC pipeline is accessible to developers looking to pretrain foundation models or adapt training for specific domains. NVIDIA offers comprehensive tutorials and APIs for tailoring the pipeline to meet particular requirements. Its integration with NeMo Curator facilitates the smooth development of pretraining and fine-tuning datasets.
To learn more, check out the NVIDIA blog.
Image source: Shutterstock