Dirty Secrets of BookCorpus, a Key Dataset in Machine Learning

BookCorpus has helped train at least thirty influential language models (including Google’s BERT, OpenAI’s GPT, and Amazon’s Bort), according to HuggingFace.

But what exactly is inside BookCorpus?

This is the research question that Nicholas Vincent and I ask in a new working paper that attempts to address some of the “documentation debt” in machine learning research — a concept discussed by Dr. Emily M. Bender and Dr. Timnit Gebru et al. in their Stochastic Parrots paper.

Click Here

Tags: dirty secrets