Artificial Intelligence has lots of application. One of which is Natural Language Processing or in short NLP. NLP is the study or processing of natural texts to find the interesting patterns or details in natural texts for other purposes like tagging a post automatically. NLP contrary to the general belief, as being very advance computer science, started in early 1950s.
There are various applications of NLP like machine translation(eg: Google Translate), Part of Speech tagging(eg: tagging a word whether it is a noun or verb etc. ), automatic summarization(eg: Text Compactor), natural language generation(eg: Alice, a software that interacts with human.) etc. You can read about it more here.
In this post, I am explaining the terminologies used in NLP.
- Natural Text
Natural text is any text which is human readable in general. All languages that human writes are natural language can used for language processing.
Eg: English, Hindi, French
- Labeled Text
Basically a marked text which can then be used by various machine learning or NLP algorithms to label or predict annotation for un-annotated or simply unmarked text.
Eg: A movie review tagged with positive or negative sentiment.
- Data Set
The data on which we intend to perform the task of NLP is data set.
Eg: Data set of human genome.
The data set has certain attributes in general which we use while performing processing. A column in a database table can be considered as attribute.
Eg: “Wind speed” in weather data.
When there is list of pre-defined labels then the problem of assigning labels to the data set is called classification.
Eg: Classifying news articles to their categories, Classifying the tumor benign or malignant.
When there is a data set and using that we have to predict a attribute for other attributes, then it is a problem of regression. It is not in general used in NLP.
Eg: Prediction of weather, Prediction of house rent based on house area.
When there are no pre-defined labels but all we know is to find similar type of objects then the problem with which we are dealing is clustering.
Eg: Clustering the similar kind of reviews or posts.
- Supervised Learning
Learning algorithm which learns, actually trains the model which we will talk about in next post, from the previously labeled data. Learning algorithms can be of various types it can be either statistical based or probability based or simple if-else or something else. Some of the algorithms are Naive Bayes, Decision Trees, K Nearest Neighbors, Support Vector Machines etc. Supervised learning is generally used when we have huge or considerable amount labeled data.
Eg: Part of Speech Tagging, Severity Prediction for the Bug Data.
- Unsupervised Learning
Algorithms which trains the model when there is no label available. These algorithms are generally created by using a “intuition” or heuristic which is nothing but a observation of the data. Some of the algorithms are Artificial Neural Networks, K-Means, Single Linkage etc. The general field where it is used is clustering.
Eg: Clustering of news articles by their categories, Clustering of users by their data.
- Semi-Supervised Learning
This includes both supervised and unsupervised learning. It is used generally on data which have some of it is labeled and largely unlabeled. It learns from both labeled data and labeled after unsupervised learning.
The task of processing the data set prior to use it for learning the language models. What we basically do in this part of our program is we modify the data set to convert it into a corpus, data which we can feed to the algorithms as they are generic that is work for various kinds of data set. Hence the need is to convert our data so that we can use those algorithms. Below are the three methods which are used in general in NLP pre-processing.
Gram is a unit to measure quantity. In case of NLP gram can be associated with both words and characters.
Character N-Gram would mean considering ‘n’ consecutive characters at once and same for word. The most general cases are Uni-Gram, single unit, and Bi-Gram, two units.
Eg: I live in New York.
Uni-Gram => [‘I’, ‘live’, ‘in’, ‘New’, ‘York’]
Bi-Gram => [‘I live’, ‘live in’, ‘in New’, ‘New York’]
We break the text into tokens which are in general certain length of words. These extracted phrases or words are called tokens. In simple words, we extract the words or group of certain length of phrases possible from the sentence or text.
- Stop Words Removal
Not all words in the language have actual meaning, that is absence of which we do not lost the true meaning or at least gist.
Eg: Conjugations, prepositions. ‘they’, ‘then’, ‘under’, ‘over’, ‘is’, ‘an’, ‘and’ etc. You can find the list on MySQL website.
Note: List of stop words depends on the context of the problem.
Reducing a word to its root word is called stemming. It has same meaning what we study while we learn about words.
Eg: ‘education’, ‘educate’, ‘educating’ has the root word ‘educate’.
The tokens which after all kind of pre-processing required and are ready to be fed to our algorithm are called features. Features are those units which we have extracted from the data set and supply to our algorithm so that it can learn. The difference between tokens and features is that when tokens are ready for the algorithm they are called features.
Eg: Set of words extracted from data set which we can pass to classifier.
- Features Extraction
Extraction of features from the tokens. There are various methods to so which we will read about later.
- Features Selection
Selection of features from the extracted tokens. Not all tokens are useful, not all kind of data is useful. We have to decide which ones will be good to play with.
- Training and Testing
There are two parts, first is trainig where we feed the features to the algorithm then it trains a model or lets say creates a model for those features. Second part is testing where we use the model algorithm has created to test how well it is doing.
There are various methods to evaluate a model. Some common evaluation methods are Accuracy and Precision, Recall, F1-Score.
Note: All the above things may not actually required in all NLP problems but these are some common terms which every NLP enthusiast should be familiar with. There is also a chance of inaccuracy in definition as I tried to explain things in the easiest way possible.