In large language models (LLMs), the landscape of pretraining data is a rich blend of diverse sources. It spans from common English to less common languages, including casual conversations and scholarly texts, and even extends to modalities like images and speeches. Within this mix, the data interact in complex ways, sometimes aligning well, diverging, and occasionally conflicting. The challenge lies in fine-tuning the proportions of this mix, leveraging the strengths of each domain while minimizing potential conflicts through which the resulting models gain enhanced capabilities, a testament to the valuable insights gained from extensive real-world use.
Despite being elusive in figuring out an ideal training data mixture, most existing practices tune the mixture through heuristics to upsample a proportion of high-quality or underrepresented data without disclosing the concrete criteria in detail. Predicting whether these data strategies are effective before finishing the training run is hard. Inspired by advancements in scaling laws that show model losses on a given set of evaluation data are quantitatively predictable for a wide range of variables, there’s an exciting prospect. If this principle also applies to mixture proportions, they could estimate the performance of the resulting model before even commencing training.
Researchers from Fudan University and Shanghai AI Laboratory introduced data mixing law and prediction pipeline, which solves the problem of accurately predicting the validation loss for a mixture of training domains under a fixed model size and amount of training data. Researchers carried out a Pilot Study on Domain Losses under Two-domain Mixtures to predict model losses regarding data mixtures. This is achieved by training 70M and 160M language models on the mix of Github and Pile-CC subsets from the Pile dataset with five different mixture proportions for Github. All the models are trained with a batch size of 1M tokens for 30k steps, which is 30B tokens.
This paper addresses various challenges in optimizing data mixtures. Some of them are (a) Discovery of quantitative predictability of model performance regarding data mixture, summarizing this into a functional relationship, namely the data mixing laws. (b) Proposed a pipeline to predict the model performance of large-scale training on different mixture proportions but only experiments on small models with few training data through nested scaling laws of training steps, model sizes, and data mixing laws. (c) Experimental verification of the reliability of data mixing laws and prediction pipeline, showing its effectiveness in optimizing model performance, balancing model capabilities, and the prospects of guiding the design of the data schedule.
Developing a pipeline for loss prediction involved training the models on the mixture of RedPajama and validating against the validation set of the Pile. A series of 70M, 160M, 305M, and 410M models for 30B tokens were trained to adhere to the scaling laws of training steps and model sizes. Remarkably, the model trained on the optimized mixture achieves performance comparable to that of one trained on the default mixture, but with just 73% of the steps. It eventually surpasses the default mixture’s performance, requiring 48% more steps, underscoring the pipeline’s effectiveness in mixture optimization.
In conclusion, this paper introduces data mixing law and prediction pipeline, which solves the problem of accurately predicting the validation loss for a mixture of training domains under a fixed model size and amount of training data. The nested use of scaling laws of training steps, model sizes, and data mixture makes predictions with only experiments at small scales, enabling the reuse of existing experiments and reducing computation costs. This study will further facilitate quantitative studies and theoretical analysis with an increasing focus on data engineering.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 39k+ ML SubReddit
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.
Be the first to comment