Skip to content

Latest commit

 

History

History
75 lines (38 loc) · 3.62 KB

nlp-learning.md

File metadata and controls

75 lines (38 loc) · 3.62 KB

1 NLP Core

Reference: https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/

  • Tokenization – process of converting a text into tokens
  • Tokens – words or entities present in the text
  • Text object – a sentence or a phrase or a word or an article

Setup:

  1. sudo pip install -U nltk
  2. import nltk nltk.download()

2 Text Preprocessing: The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps: [Create graph of the process]

  • Noise Removal
  • Lexicon Normalization
  • Object Standardization

2.1 Noise Removal

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

TODO: Write Medium Article and Create regex to remove common things, -> Create a github .md file or Medium page to list regex like email, hyperlink, etc

2.2 Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar.

The step converts all the disparities of a word into their normalized form (also known as lemma).

Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to the low dimensional space (1 feature), which is an ideal ask for any ML model.

The most common lexicon normalization practices are :

2.2.1 Stemming

Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

2.2.2 Lemmatization

Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

2.3 Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

More on the cleaning text: https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

3 Text to Features (Feature Engineering on text data)

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings.

3.1 Syntactic Parsing

Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.

3.1.1