Introduction to Natural Language Processing with PyTorch (2/5)
<p>In the previous unit, we have learnt to represent text by numbers.<br />
Here, we’ll explore some of the approaches to feeding variable-length text into a neural network to collapse the input sequence into a fixed-length vector, which can then be used in the classifier.</p>
<p>To begin with, let’s load the <strong>AG_News</strong> dataset and build the vocabulary.<br />
To make things shorter, all those operations are combined into <em>load_dataset</em> function of the accompanying Python module:</p>
<pre>
!pip install -r https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/requirements.txt
!wget -q https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/torchnlp.py</pre>
<pre>
import torch
import torchtext
import os
import collections
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)
print("Vocab size = ",vocab_size)</pre>
<pre>
Loading dataset...
Building vocab...
Vocab size = 95812</pre>
<h2>Bag-of-Words (BoW) Representation</h2>
<p>Sometimes we can figure out the meaning of a text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like weather, snow are likely to indicate weather forecast, while words like stocks, dollar would count towards financial news.</p>
<blockquote>
<p>Bag-of-Words (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.</p>
</blockquote>
<p>Note: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text. Below is an example of how to generate a BoW representation for a text using vectorization defined previously</p>
<p><a href="https://medium.com/@thevnotebook/introduction-to-natural-language-processing-with-pytorch-2-5-538d7b167b34"><strong>Read More</strong></a></p>