DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

2023-06-03

In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. DoReMi improves perplexity across all domains, even when it downweights a domain. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

Link [ https://arxiv.org/abs/2305.10429v1 ]

Previous Article

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

2023-06-03

In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. DoReMi improves perplexity across all domains, even when it downweights a domain. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

Link [ https://arxiv.org/abs/2305.10429v1 ]

Copyright © 2024 All rights reserved

Rss

Atom