T5: Text-to-Text Transformers (Part One)

The transfer learning paradigm is comprised of two main stages. First, we pre-train a deep neural network over a bunch of data. Then, we fine-tune this model (i.e., train it some more) over a more specific, downstream dataset. The exact implementation of these stages may take many different forms. In computer vision, for example, we often pre-train models on the ImageNet dataset using a supervised learning objective. Then, these models perform supervised fine-tuning on the downstream dataset (i.e., the task that we are actually trying to solve). Alternatively, in natural language processing (NLP), we often perform <a href="https://cameronrwolfe.substack.com/i/76273144/self-supervised-learning" rel="noopener ugc nofollow" target="_blank">self-supervised</a> pre-training over an unlabeled textual corpus. Combining large, deep neural networks with massive (pre-)training datasets often leads to impressive results. This finding was found to be especially true for NLP. Given that raw textual data is freely available on the internet, we can simply download a massive textual corpus, pre-train a large neural net on this data, then fine-tune the model on a variety of downstream tasks (or just use zero/few-shot learning techniques). This large-scale transfer learning approach was initially explored by BERT [2], which pre-trained a <a href="https://cameronrwolfe.substack.com/i/76273144/transformer-encoders" rel="noopener ugc nofollow" target="_blank">transformer encoder</a> over unlabeled data using a <a href="https://cameronrwolfe.substack.com/i/76273144/training-bert" rel="noopener ugc nofollow" target="_blank">masking objective</a>, then fine-tuned on downstream language tasks. <a href="https://towardsdatascience.com/t5-text-to-text-transformers-part-one-6b655f27c79a">Visit Now</a>