Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning

<p>BookCorpus has helped train at least thirty influential language models (including Google&rsquo;s&nbsp;<a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html" rel="noopener ugc nofollow" target="_blank">BERT</a>, OpenAI&rsquo;s&nbsp;<a href="https://openai.com/blog/language-unsupervised/" rel="noopener ugc nofollow" target="_blank">GPT</a>, and Amazon&rsquo;s&nbsp;<a href="https://huggingface.co/amazon/bort" rel="noopener ugc nofollow" target="_blank">Bort</a>), according to&nbsp;<a href="https://huggingface.co/datasets/bookcorpus" rel="noopener ugc nofollow" target="_blank">HuggingFace</a>.</p> <p><em>But what exactly is inside BookCorpus?</em></p> <p>This is the research question that&nbsp;<a href="https://www.nickmvincent.com/" rel="noopener ugc nofollow" target="_blank">Nicholas Vincent</a>&nbsp;and I ask in a&nbsp;<a href="https://arxiv.org/abs/2105.05241" rel="noopener ugc nofollow" target="_blank">new working paper</a>&nbsp;that attempts to address some of the &ldquo;documentation debt&rdquo; in machine learning research &mdash; a concept discussed by&nbsp;<a href="http://faculty.washington.edu/ebender/" rel="noopener ugc nofollow" target="_blank">Dr. Emily M. Bender</a>&nbsp;and&nbsp;<a href="https://ai.stanford.edu/~tgebru/" rel="noopener ugc nofollow" target="_blank">Dr. Timnit Gebru</a>&nbsp;et al. in their&nbsp;<a href="https://dl.acm.org/doi/abs/10.1145/3442188.3445922" rel="noopener ugc nofollow" target="_blank">Stochastic Parrots paper</a>.</p> <p><a href="https://towardsdatascience.com/dirty-secrets-of-bookcorpus-a-key-dataset-in-machine-learning-6ee2927e8650"><strong>Click Here</strong></a></p>
Tags: dirty secrets