NLP Overview

Natural language processing is one the most significant and quickly developing sphere of modern data science. The main idea behind NLP is to allow person communicate with computer in their own language language. Today NLP allows computer to understand human speech, interpret it, analyse and generate text as well as generate summary based on it.

NLP in everyday applications

1.1. It is also interesting to evaluate aspects of human’s life, in which natural language processing is already highly integrated. The most proper and widely used example is complex and exiting idea of combing functionality of search engines with NLP related algorithms, as modern search engines can be described as the main source of information for most of the people.

NLP allows to make request in browser verbally, in this case it will be speech recognition ore Speech-to-Text, but if you want to make request with the plain text, after entering the first symbol in browser search text box, text generation algorithm may propose the a few options, with the highest probability, for the remains part of your request.

However, today it is not even needed to open browser, with the voice assistant installed, with use of verbally pronounced shortcut and the following information request, voice assistant can perform set of action, based on the purpose. For example voice assistant can be integrated in smart home system, meaning the user will perform requests (verbally, while being inside building, or with the use of his device, with voice assistant being installed on it, possible it may require a specific application) related to in-system IoT devices management, while voice assistant will be performing those tasks. Here is an interesting link for those, who want to know more about how NLP works in voice assistant applications in Russian (https://habr.com/ru/company/yandex/blog/339638/)

and in English (https://quoracreative.com/article/voice-search-evolution).

In more specific/ narrow-focused topics (to state the importance and wide range of usage of NLP)

1.2. Except improvement daily tasks, natural language processing is used in other, more advanced spheres, for example

machine translation, information retrieval, text categorization, text summarisation and sentiment analysis &opinion mining. All those topics will be reviewed in the following part of the article, as well as additional resources related to each technology.

Machine Translation

1.2.1. The main purpose of machine translation is self-explanatory, the model retrieves text in one language and returns a sequence, with the same meaning, but in another language. Machine translation is relatively an old task, that is why there are three main approaches to be used. Rule-based Machine Translation , Statistical Machine Translation and Neural Machine Translation. In case of RBMT and SMT, the sequence of tasks is performed on input sequence, to be modified. In case of NMT only one artificial neural network is required, in order to solve translation problem.

Here is the code example of Seq2Seq model to try (https://towardsdatascience.com/how-to-implement-seq2seq-lstm-model-in-keras-shortcutnlp-6f355f3e5639).

A link to an additional article is provided, in order to make the boundary between technological differences more visible and clear (https://towardsdatascience.com/machine-translation-a-short-overview-91343ff39c9f).

Information Retrieval

1.2.2. In comparison to machine translation, information retrieval allows to solve completely different set of problems, although they both are parts of NLP sphere.

IR allows user to create a query, which describes the requirements to a set of specific files to be found within a dataset, which might relate to query’s requirements. IR may be confused with Data Mining, although Data Mining is process of discovering patterns in data and make predictions according to the pattern been found. In case in IR, the algorithm will work properly, if dataset is related to the query. Additional, more specified information about Information Retrieval is available here (https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_information_retrieval.htm) and more detailed description of the differences between Information Retrieval and Data Mining (https://www.quora.com/What-is-the-difference-between-information-retrieval-and-data-mining-How-is-big-data-related-to-these-two-different-techniques).

Text Categorisation

1.2.3. Text classification is self-explanatory, it allows to classify text with supervised model, trained on texts labelled by needed categories. The best approach to understand this concept, will be to examine code example, here is a good article with detailed code example (https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a).

Text Summary

1.2.4. Text summarisation allows to conduct shorter version of the text, containing only relevant perts of the text. In general, there are two different approaches for summary conduction, extraction and abstraction. The generalised description can be described in the following way, text summary algorithm allows to compare each sentence relevancy, extract and merge them, consequently the summary is generated.

Relevant, additional article (https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-an-overview-68ded5717a25) and code proactive example (https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70) are available here.

Sentiment Analysis & Opinion Mining

1.2.5. Sentiment Analysis & Opinion Mining is one the most interesting and hard researched areas of Natural Language Processing.

In general it allows to extract information about meta-summary from the text, to be precise: opinions, sentiments, evaluations, attitudes, and emotions.

It is pretty complex flied and the best approach to get familiar with this topic would be to understand the theoretical part first (https://towardsdatascience.com/using-nlp-to-figure-out-what-people-really-think-e1d10d98e491)

and then start exploring practical example (https://github.com/SenticNet/personality-detection/).

Additionally, here is a really good explanation of Sentiment Analysis & Opinion Mining (https://lexitron.nectec.or.th/public/LREC-2010_Malta/pdf/385_Paper.pdf), considering Twitter as a source of data.

Roadmap for beginners

2.1. It is always better to start from the theory and there is a lot of natural language processing related articles, with detailed explanation

on (<https://towardsdatascience.com/search?q=NLP) and (https://medium.com/search?q=NLP>).

2.2. Then, if you liked the topic, it is a must to start doing practical tasks, the pest frameworks to use would be TensorFlow (https://www.tensorflow.org/tutorials/text/)

and PyTorch (https://pytorch.org/tutorials/beginner/transformer_tutorial.html). If you want to explore more advanced code examples, the best souse is definitely (https://www.kaggle.com/search?q=NLP).

2.3. If you want to go throw courses, I can recommend you (https://www.coursera.org/specializations/natural-language-processing), or you can simply take the one you liked the most (https://www.coursera.org/search?query=NLP&).

Summary

As you probably already noticed, NLP develops quickly, is already and, in the nearest future, will be integrated in many aspects of our life, as well as allow to solve many fundamental problems.

Creator of the article Daniil Kozlov