Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning

BookCorpus has helped train at least thirty influential language models (including Google’s <a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html" rel="noopener ugc nofollow" target="_blank">BERT</a>, OpenAI’s <a href="https://openai.com/blog/language-unsupervised/" rel="noopener ugc nofollow" target="_blank">GPT</a>, and Amazon’s <a href="https://huggingface.co/amazon/bort" rel="noopener ugc nofollow" target="_blank">Bort</a>), according to <a href="https://huggingface.co/datasets/bookcorpus" rel="noopener ugc nofollow" target="_blank">HuggingFace</a>. But what exactly is inside BookCorpus? This is the research question that <a href="https://www.nickmvincent.com/" rel="noopener ugc nofollow" target="_blank">Nicholas Vincent</a> and I ask in a <a href="https://arxiv.org/abs/2105.05241" rel="noopener ugc nofollow" target="_blank">new working paper</a> that attempts to address some of the “documentation debt” in machine learning research — a concept discussed by <a href="http://faculty.washington.edu/ebender/" rel="noopener ugc nofollow" target="_blank">Dr. Emily M. Bender</a> and <a href="https://ai.stanford.edu/~tgebru/" rel="noopener ugc nofollow" target="_blank">Dr. Timnit Gebru</a> et al. in their <a href="https://dl.acm.org/doi/abs/10.1145/3442188.3445922" rel="noopener ugc nofollow" target="_blank">Stochastic Parrots paper</a>. <a href="https://towardsdatascience.com/dirty-secrets-of-bookcorpus-a-key-dataset-in-machine-learning-6ee2927e8650">Click Here</a>