

We deduplicate three that are: Wiki-40B, C4, and RealNews–as well as the One Billion Word Language Model Benchmark (Chelba et al., 2013),ĭataset commonly used for evaluation. Since many of these datasets are not public, ( 2021) for instance introduced PANGU- α, a family of models with up to 200B parameters that were trained on a non-public corpus of cleaned and filtered Chinese-language documents from CommonCrawl and other sources. Non-English models necessarily use different datasets Zeng et al.

( 2020) used high quality processed Wikipedia text from 40 different languages to train monolingual 141.4M parameter language models. Other models are trained on more curated Internet sources-for example Guo et al. ( 2020) on a cleaned version of common crawl called C4. ( 2019) on a restricted subset filtered to news domains called RealNews,Īnd T5 Raffel et al. ( 2020) with the addition of book datasets, ( 2019) is trained on WebText, a dataset of web documents highly ranked on Reddit-however this dataset was not made available publicly.Ī common dataset starting point is CommonCrawl, an index of public webpages.Īmong the models trained on CommonCrawl include These current state-of-the-art models are trained on internet text.įor example, the GPT-2 family of models Radford et al. We perform our analysis on Transformer-based decoder-only language models (Vaswani et al., 2017) trained for open-ended text generation. While we believe our results are independent of model architecture, 2 Related Work Large language model datasets.

We then examine the impact of deduplication on test perplexity (§ 6.1) and on the frequency of emitting memorized content (§ 6.2).įinally, we analyze to what extent perplexity on existing, released models are skewed as a result of overlap between the train and test/validation splits (§ 6.3). In the remainder of this paper we present our text deduplication framework in § 4, and study the extent of duplicate content in common NLP datasets (e.g., C4, Wiki-40B, and LM1B) in § 5. To summarize, data duplication offers significant advantages and no observed disadvantages. ( 2020),īy training on higher quality data the models can reach higher accuracy faster. In some cases deduplication reduces perplexity by up to 10 %.įurther, because recent LMs are typically limited to training for just a few epochs Radford et al. Deduplicating training data does not hurt perplexity: models trained on deduplicated datasets have no worse perplexity compared to baseline models trained on the original datasets.
