Natural language processing (NLP)

NATURAL LANGUAGE PROCESSING

* In the world of machine learning we are working with numeric data, if we have textual data we are converting into the numeric data(just like dummy variable)

* where as in the world of text mining or (NLP) we take textual data and build models , we are finding hidden pattern from textual data

* Reteriving data from somewhere(google, social media) is not a text mining

*Based on words you get some insight and trying to use those insight and sell this is text mining

* NLP is based on hidden feature of text

* There is difference between text mining and NLP , let us understand it with example

let suppose if Rishabh wrote a status in facebook that he is going to newyork , so from this text we get some insight , so here text mining is we recommend rishabh about flight , hotel in newyork so these things are text mining, but nlp is one level deep here we are trying to find the hidden feature of text , that is we will recomend rishabh some course related to english as english is spoken there.

* In the world of NLP your words are going to become a variable

*Text mining Steps

#Given data(comments on social media, Tweets, Sales Report, Email, Blogs, Word document , document(structured, semi-structured, unstructured)

#Text Preprocessing ( extraction of word(Tokenization, POS tagger, ),)

#Feature Exatraction extract good subset of words( stop words, Stemming(same words), Lemmatization(different word with same meaning)

#Feature Selection

#Text mining methods

# Result evaluation

*Major NLP libraries are

NLTK, sckit - learn, TextBlob, spaCy

NLTK( natural language toolkit)

* Corpus is collection of text

* gutenberg is group of document

* PlaintextCorpusReader - to read your own corpus file

* tokenize refer to spliting up a larger body into smaller lines

*word_tokenize gives the list of words

* sent_tokenize gives the list of sentence

-----------------------------------------------------------------------------

Regular expression (re)(regex)

* search function is important in re

* ^a (caret) this means start with a , this is also a case senisitive

* ed$ (dollar) dollar represent end with , here it means find the words that end with ed

* a* means a should be 0 or many times

* a+ means a should be 1 or many times

* r'<a><.*><man>' it tell start with a , middle m * laga h mtlb kuch bhi hosakta h and end with man

*findall allow you to find different type of pattern

* split function in regular expression

* split is seperating each word

* \s - split with space

* \w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_] )

* \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_] ).

* "\w+[-']+\w+" - here we are saying that find all the words which has hyphen and apostrophe

--------------------------------------------------------------------------------

*Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. ...

* PorterStemmer - it remove the word with same name ( singular , plural form ko remove krta hai)

* LancasterStemmer - it remove the word with same name

* lemmatization - it remove the word with same meaning

* Stemming does not do a dictionary lookup but lemmatization does

* The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary

* pos_tag it represent the part of speech tag

* Brown is novel which is built in nlp

* print(text.similar('woman'))

#so output is these are the word that are talked about when woman is talked about

# they internally look on the part of speech based on the part of speech they identify these words

*so what natural learninig process do is it not only focus on words it also focus on parts of speech

* similar function is good in nlp - as nlp is analysing the part of speech to group the word that related to that words

* unigram tagger basically meri samjh se ek algorithim ki tarah hai , basically isme hm kuch data dete hai ( data kuch is prakar hota hai ki usme kuch words likhe hote hai aur word ka part of speech likha hota hai ,) , fir hm is naye word dete hai aur bolte hai ki predict kro ki ye kon sa part of speech hai

*how unigram tagger is tagging is it is going back and looking at based on one word it is going to tag a new word that is why it is called as unigram

* hm unigram tagger mai accuracy wagera sb kuch check kr skte hai

* Ngram tagger , basically n words ko leta hai aur uske context mai hme naye word ka part of speech batata hai

*# Unigram is not v efficient not accurate

# Ngram gvs higher accuracy

# I don't know what to say

# 1-gram is --> I, don't, know, what, to, say

# 2-gram is --> I don't, don't know, know what, what to, to say

# 3-gram is --> I don't know, don't know what, know what to, what to say

# ...

# Concept of N-Gram tagging

# Assign 1 tag to word based on context of word occurence with prev words

#is

# ngram tagging

# say know is first appearing as verb 1 time and 2 times as ADJ in its

#previous contexts

# it assigns it as ADJ

* The choice of model completely depand upon type of dataset

* Naive Bayes classifier is an algorithim , used for sentiment analysis, email spam detection, categorization of document, and language detection

* In probability world naive bayes is used to find the conditional probability

* In naive bayes classifer train - fit , classify - predict

* behind the scene nayve baise builds its own table of probability

* Naive bayes classifer is used when dependent variable is categorical means either positve aur negative , if there is neutral option then we will use multinomial naive bayes

*Beautifulsoup is a package for web scrapping its kind of some what beautify the html ( to make html neat and clean)

*urllib is package for reading the html

* Stop words are kind of unimportant words these words are not very very important , stopword are like in , an , the , they , are ,etc

* Here we use weighted model or term frequency model to find the summary of any website

*We convert the textual data into numerical data , and how you convert the textual data into numerical data , that is what weighted model is

* Term Frequency model - terms means word , frequency is number of times , simply counting how many times something is occuring , now here textual data is converted into the numerical data

* Basically what we have done in text summarisation is first we clean the data , then we calculate the weight of each word , then we take a sentence of length 25 , and add the weight of each word in sentence , and we show the top sentence with highest weight , and that is our summary

* Word2Vec

* in tfidf we are converting textual data into two dimensions, in the word2vec we are converting it into 32 dimension

*word2vec extract the inner meaning of sentence, relationship between the data that is exactly what word2vec find

sachin tendulkar is roger federer of cricket

sachin tendulkar - cricket = roger federer-tennis

*and this what a semantic meaning is , tfidf does not find the semantic meaning

* we have to scrape the huge dataset for this

* dimension is nothing but a group of feature

*PCA stand for principal component analysis

*PCA is a dimension reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the original set. That sounds cool, you can reduce the features of your data but not retaining the most of the information needed.

*Word2Vec looks at the data globally and learns meaningful semantic relations which it encodes into fixed-sized vectors.

*There are many use cases of word2vec

*with respect to count-based methods, on different language tasks such as semantic relatedness, synonym detection, concept categorization, selectional preferences and analogy.

*discovering new chemical compounds with specific properties

*uncover novel relationships between diseases and disease-genes associations.

*Recommendations , user’s search history, purchase history, places visits history, click sessions and so on

*IN the word2vec there is yeild attribute which automatically pre process data like removing stop words , removing numbers, commas, apstrophe,

* It has feature of finding the most similar words

* similarity between two different words

*You can even use Word2Vec to find odd items given a list of items

*Multinomial naive bayes

*instead of using nayves biase here we are using multinominal becoz navyes bais is binary (means 0 or 1 ) and here we have three 0,1,2

* Bag of words

* Bag of words is used to convert the textual data into numerical feature vector with a fixed size.

* vector space model is group of vector when you represent it into a cartesian co ordinnate that is called vector space model

*Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. ...

*Disadvantages: Bag of words leads to a high dimensional feature vector due to large size of Vocabulary, V. Bag of words doesn't leverage co-occurrence statistics between words. In other words, it assumes all words are independent of each other.

* Count vectorizer

* Scikit-learn's CountVectorizer is used to convert a collection of text documents to a vector of term/token counts.

*The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints. And that's to be expected – as explained in the documentation quoted above, TfidfVectorizer() assigns a score while CountVectorizer() counts.

* Tf-idf

*Google search uses ti-idf model

* tf-idf stand for term frequency - inverse document frequency

* term-frequency model = term means words , frequency is no. of times , simply counting how many times something is occurring , now textual data is converted into numerical data

*So in tf idf

first we find the tf of documents , then we find the idf of documents , then we calculate the tf*idf of documents ,

similarly we find the tf of query then we find the idf of query then we get a tf*idf

then through vector space model we can check the similarity score of all the documents and query

this is tf - idf

* how we calculate the normalized tf - we calculate the number of words occuring in the documents , but in large document the frequency of word is much higher so we normalize it by term frequency by total number of terms

* IDF is inverse document frequency , we calculate because sometimes the terms that occur too frequently has little power , where as the term that occur less has more power, so what idf is about it weight down the effect of too frequently or less frequently occuring terms

**Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well. Bag of Words vectors are easy to interpret.

* Problem with bag of words and tf -idf

*Semantic information of word is not stored . Even if in tf-idf we give more importance to the unimportant word

* One of the problem is overfitting also , what does it mean is like we give 70 % of data to rishabh he predict well on 70 % data but he will not work well on 30% of data

*Feature engineering

* IN feature engineering we do different things , here we can add colum also , in example of ham message(useful message)and spam message( fraud message) we add an extra colum called length , as through length we can tell that which message is spam and which is ham , as spam message are too lengthy , where as ham message are not tool lengthy

*RECOMMENDATION SYSTEM

* There are many type of recommendation product recommendation, movie recommendation, music recommendation, social connection recommendation

# Popularity based Recommendation System

* It is based on simple count statistic , jis chiz mai jada log jayenge use jada popularity milegi

*basically it rely on purchase history data

* Cannot produce personalized result

# Correlation based Recommendation system

* Here we are using a pearson correlation , to recommend an item which is most similar to the item already choosen

* Pearsons

r =1 highly correlated

r=0 no corelation

r =-1 hightlyy negative corelated

* In our project what we have done is hmne jo hotel ko sbse jada rating mili hai use consider kiya , lekin jada rating hme sahi data nhi deti kyoki agar man lo ki koi resturant kal khula hai , to usme to rating kam hi hongi na , so here we are taking a mean and considering the mean , fir hm us hotel ko consider kr rahe hai jiska sb se jada mean hai, fir us hotel ki rating ko hm baki hotel ki mean rating se correlate kr rahe hai , aur fir hme top 5 hotel le rahe hai jiska correlation kafi high hai, aur fir hmne jo hotel select kiya tha uska cusine dekhte hai , aur cusine ko baki ke hightly correlated hotel se match krte hai, agar cusine match krti hai to hm us hotel ko , aur place ko recommend kr denge ...

#Classification-based Collaborative Filtering Systems

* This is done with a simple machine learning algorithim ( more precisly with logistic regression or Navie bayes algorithim) from this u can predict that whether he is going to purchase or not based on independent variable given

# Content Based Recommender

* Recommend an item based on its feature and how similar they are to feature of other item in dataset

* Uses a Nearest neighbor algorithim

- it is also know as memory based system

- it is good for recommending and item similar to feature given

* Precision = The number of item that i liked that were also recommend to me / number of item that were recommended

- this tells us how relevent were the recommendation

- if precision is 0.87 that means 87% of the recommended item are the items that the user actually liked

*Recall = The number of item that i liked that were also recommend to me/ number of item that i liked

- this tells us how completely did the recommender system predict the item that i liked

- if recall is 0.89 that means 89% of the user preferred items were recommended.

* f1 score - f1 score is mean of precision and recall

# Model-based Collaborative Filtering Systems Movie Recommendation System

* Utility matrix is a large matrix

* Factorization is putting large number into two number

24=6*4

* SVD(singular value decompostion)

- svd is similar to pca it is used for dimensionality reduction

- so if dimension is less , processing will be easy , that's why we are using svd

- in high level svd is reducing the big matrix into the small matrix

- svd is used to decopose the utility matrix

- in our movie recommendation we take all the action movie lover people put them in one category , similarily we take all the romantic movie lover and put them in one category , you may have hundered of individual movies and when you categoriese it will reduce to 10 or 20 and that is what svd is doing it is reducing the dimension

- Latent variable = latent mean hidden , we use latent variable to understand the most important feature n_component is 12 that means latent variable is 12 , and these variable are going to capture most important feature

* Then we use a pearson correlation to find the similarity between the movie.

Search This Blog

Data Science

Natural language processing (NLP)

Comments

Post a Comment

Popular posts from this blog

All about Machine learning

Machine Learning

OS in Python