Text Preprocessing Part

Text preprocessing is an integral part of Natural Language Processing as no machine learning model can process textual data directly. In this part we will see in detail about the different Text Vectorization mechanisms that the AI developers community has to offer… I would suggest you to read my previous blog on general and high level overview of text preprocessing methods before the vectorization part, but it is not mandatory as you do not need to read that to understand this part of my text preprocessing series. <a href="https://medium.com/@sanjithkumar986/text-preprocessing-for-nlp-part-1-99b1a99b17bc" rel="noopener">Text Preprocessing for NLP part — 1</a> Now let us understand why we need to vectorize the textual data before starting with the actual vectorization part, for starters who have no idea about computers or generally have no idea about fields related to computer science… Computers simply cannot process text, they can only process numbers so it is important that we have to take this into consideration and employ efficient ways to convert our text to numbers(generally into a vector). <ul> <li>Vectorization is a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support. This approach has been there ever since computers were first built, it has worked wonderfully across various domains, and it’s now used in NLP.</li> <li>In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.</li> </ul> There are many vectorization techniques these are some important and widely used ones: <ol> <li>Bag of Words / Count Vectorization</li> <li>Word Embeddings (Word2Vec, GloVe, FastText)</li> <li>TF-IDF (Term Frequency-Inverse Document Frequency)</li> <li>Trained Embeddings (Transformer-based Models, BERT Embeddings)</li> </ol> In This second part we will see about Bag of Words in depth and I will talk about the other two topics in an upcoming articles. <a href="https://medium.com/@sanjithkumar986/text-preprocessing-part-2-3156bc5edef1">Website</a>

Text Preprocessing Part — 2