Ml-2

*With textual data we cannot perform mathematical function that's why we are converting the textual data into dumy variable

*visiulisation gives you correlation but remember corelation does not means causation
causation is an impact or effect

* We are learning machine learning because machine learning gives us causation, cofficient  is actually a causation

* causation is actually a deep level of correlation

* the differcenc between the actual and predicted is called an error

* in statsmodel ols (ordinary least square) class give us a method called summary , with summary we can do the statistical analysis of data sets

* Signinficance level is considered as 0.05

* if p value for certain colum is greater than 0.05 we will remove that coloum because that colum is garbage that is unimportant variable

*R 2 tells us how point is closed to the line . but if number of variable increase r2 will also increase , but as we know profit does not depand upon phone number(unimportant variable) , so r2 did not know which is important variable and which is unimportant variable , so thats why adjusted r2 will come in the concept , we use adjusted r2 to know which is important variable . The value of adjusted R2 will high for important variables


2. Logistic Regression

*Don’t get confused by its name! It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s).

*In logistic regression in the best fit line we apply the sigmoid function and the best fit line converted into Sigmoid curve

*Sigmoid curve is technical mantra of Logistic Regression, you give data to algorithim and algorithim will convert it into sigmoid curve

*Independent variable can be numeric or categorical

*You can convert a Pandas DataFrame to Numpy Array to perform some high-level mathematical functions supported by Numpy package.

*Friendly mantra of logistic regression is yes or no

------------------------------------------------------------------
STEPS FOR LOGISTIC REGRESSION

* step 1

Partitioning the data
  > here is random samples
> Not necessary from my point of view

* Step 2

Univariate analysis

>univariate analysis is analysing the one variable at a time
> so how we can analyse the one variable at a time , we can analyse by finding the descriptive statistic of the variables like minmum, maximum, mean, median, mode, standard deviation ,and percentile value
>Purpose of doing univariate analysis

... Identifying the outliers ( an outlier is an observation point that is distant from other observation) and then cap the outliers to make it normal

...to find the missing values, for categorical variable missing value replaced with catergory with high frequencey and for numerical variable can replaced with median or mean

*Step 3

Bivariate analysis

> bivariate analysis is analysing the two variable at a time ( that is independent variable and dependent variable)

>indicate the impact of independent variable on dependent variable

>Purpose of bivariate analysis

... Missing value treatement , like that we saw in univariate analysis

....Variable reduction - eliminate the independent variable that has no relationship with dependent variable
null hypothesis, means there is no difference
So in bank loan data whether you have one dependent or two dependent the percentage of default is same,so it means there is no difference , no difference means you are accepting the null hypothesis , and in real world they say don't reject the null hypothesis , so this is indirectly scenario of p value greater than 0.05

.... Category grouping for dummy variable creation

    #if we see on the purpose credit taken there are 10 values so we have to make 9 dummy variable which become very complex
#let's suppose if there are thousand dumy variable then what we do
# so basically hm krenge kya hm in variable ka group banalenge percentage ke according aur ko divide kr denge 4 hisse mai to isme sirf 3 dummy variable lagenge

#Numerical variable can also converted into dummy variable to improve performance

#Here we are not converting into dummy variable with help of label encoder and one hot encoder , here we are converting them into dummy variable with some bold way 

*
Step 4

> Check for multi collinerty
#when we say two independent variable are highly co -related  that is multi collinearity , and there is problem with multicollinearity as it does not give you good R2 so the prediction will not be good
#Multi collinearity occur when a variable is derived from other variable in dataset
#So we should remove the multicollinearity  , all the variable with high multicollinearity (VIF >1.5) needs to be removed from model

#we mostly perform multicollinearity when there is hundereds of variable

*Step 5

Model Building

 #logistic regression work on maximum -likelihood estimation(MLE)
algorithim , MLE is nothing but a probability , means  internally logistic regression gives us probability

#tips
print(model_data['money'].value_counts(),'\n')
#to check any coloum of model_data coloum name money

--------------------------------------------------------------------------------
* Confusion Matrix

confusion matrix is table which give us true value vs predicted value

True       0.0  1.0
Predicted          
0.0        136   32
1.0         32   54


What does it mean?

let suppose we have diabetes dataset so here is true value vs predicted value which means that (0 means no diabetees, 1 means have diabetes) here our model predict that 136 paitent has no diabetess and it is actually a true that 136 patient has no diabetes, but here our model predict that 32 patient have no diabetes but acutually those 32 patient have diabetess so this is (true negative), similarly our model predict that 32 patient have diabtess but actually those 32 patient have no diabetess and here our model predict that 54 patient has diabetes and they actually have diabetess


confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

* We do standard Scaling to bring features into the same units
from sklearn.preprocessing import StandardScaler
sc = StandardScaler() 
#we give scaled data to our model
X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)
#here we get everything in range of -2 to +2 so that the all coloum come in the same unit
---------------------------------------------------------
  3. Decision Tree

* Decision tree is used for classification and regression problem both
* it can extract values even with the little data
* well it has some disadvantage also that sometimes the information or result are biased
* based on the group of questions we are making a group of decision and these questions are like tree , the only thing is that it is like a inverse tree that is root on the top
* so group of questions and group of decision is friendly mantra of decision tree
*here we have to convert data to decision tree

* we need to put the best question in the top , so our decision tree will not go the hundereds of level deep
*high entropy means messy data, low entropy means clean
* the very top of the tree is called the root node or just root
* blue one are called the internal nodes
* green node are leaf node
* root node is decided by gini index
* as none of leaf node are 100% yes or 100% no  they are all considered as impure , so to determine which seperation is best , we need a way to measure and compare impurity and there are bunch of way to measure impurity but I am just going to focus on a very popular one called Gini
*so the feature with lower gini impurity score that feature is best and used as a root node
* the two popular algorithim for decision tree are CART (classification and regression Trees) and CHAID (Chi-Square Automatic Interaction Detection) 
*cart uses gini index where as chaid uses chi-square statistic, cart always splits the data into only two nodes where as chaid can split the data into more than two nodes
* cart splits the data as much as possible and then the tree is pruned using validation data to minimize classification error where as chaid stops growing the tree when no further gain can be made in differentiating the segments
*cart example - classify the iris flower species into three categories sentosa , virginca and versicolor
meko lgta hai shyd multilevel ke liye hai
* chaid example - classify the customers who will respond to the marketing campaign , aur ye shayad binary type
* Minimum information and maximum knowledge is friendly mantra of decision tree
* entropy and gini index are some technical mantra of decision tree
* Sometimes kya hota hai , ki decision tree will go hundereds of level deep in such case processing will be more , A better decision tree is instead of going hundered level deep if you are able to go two or three level deep and come back and say whether someone is going to purchase or not 
* A better decision tree is that which have minimum information and maximum knowledge.
*
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.
----------------------------------------------------------------------------------
4. RANDOM FOREST

* Random Forest is the king of machine learning
* Suppose there is match between india and australia and if i ask to vidhya who is going to win , vidhya as a indian he will going to tell that India will be going to win  as australia is a tough team so her prediction is biased , but if we ask the group of people the biased will reduce , so the wisdom of crowd is better than wisdom of individual , so if you ask the random people biased will reduce and that is friendly mantra of Random forest
*Forest is a group of tree and here this is group of decision tree
*
In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

* so how does it work basically , what happening is when you give data to it , it will give you the different samples of data from each sample it will build  a each decision tree like it will build a 500 decision tree
* not only it is taking seperate rows ,it is also taking the seperate colum in the sample
*As the tree are based on random selection of data as well as variable they are random tree
*to har decision tree yes or no dengi aur jo maximum hoga wo hamara answer hoga agar yes maximum hai to yes answer hoga aur agar no maximum hai to no answer hoga
*some features of random forest are it run efficiently on large database , it can handle thousands of input variable, it maintain the accuracy when some data is missing also , it has too many feature that's why too many people prefer it and for the same reason it is called the king of Machine learning
*Random forest can be used for both regression and classification, and when the number of variable is very high it will work very good.
---------------------------------------------------------------------------------
 5. K- means Clustering

* It is a  unsupervised machine learning
* clusters are groups
* Clustering uses algorithim for organizing data into clusters
* Friendly mantra of clustering is that it seperates the cat and dog
* Clustering find the hidden pattern of data 
* within the clusters similarity is high
* between the cluster the similarity is low
* basically it is mathematics which find the group
* In the mathematical level it is based on principal of vector
* how we find that the cat and dog are disimilar , we find by correlation and distance
* In k-means clustering it will select a random point and that random point are called centroid and from centroid it will calculate a distance from each  point , and after calculating a distance all the point which are closed to particular centroid formed a group.
* In this new group we calculate the mean of the group , here comes with new x and y values and that means a new point , so we say this centroid has move from this point to new point (x, y)
* so there are many groups so we got many centroid , so again repeat the process calculate the distance between all the centroids and which are closer form a group , so what happen new groups are formed
* Grouping, similarity  , dissimilarity is second , third and fourth mantra of clustering respectively
* Characterstic is fifth mantra of clustering
-------------------------------------------------------------
6. Association Rule Mining

* Association rule mining is type of unsupervised learning
* In data mining, association rule  learning is popular and well researched method for discovering interesting relations between variables in large databases.
*the rules found in the sales data of a supermarket would indicate that if a customer buys onion and potatoes together, he or she likely to also buy hamburger meat.
* Such information can be used as the basis  for decision about marketing activities such as eg promotional pricing or product placements
*The confidence formula written out would like something like: Interpreted as: How often items in B appear in transactions that contain A only. The third measure called the lift or lift ratio is the ratio of confidence to expected confidence. Expected confidence is the confidence divided by the frequency of B.
*
----------------------------------------------------------
7. Support Vector Machine

* SVM is very good and need a lot of computing power, as it is not used 10 year back because it needs a lot of computing power
* It is the latest and Greatest
* It gives us high accuracy just like random forest
* In linear regression our mantra is point should be close to the line , but here is opposite the point should be as far as possible
* We need to consider a line which is as far away from the points and that line is called maximum margin
* the closest point on the opposite side of line is called support vector
* the line which is far away is first mantra 
* apples which look like orange and orange which look like apple is second mantra
* We cannot separate the non-linearable data , we cannot draw the line in non-linearable data, so we bring most sophisticated concept which is called projected into higher dimension and that is third mantra
* if you have one dimension projected into another dimension that is called projected into higher dimension , by doing this you can achieve the separation
*The mathematics of svm is firstly it apply a function , then again it apply a function to achieve the separation of line
*like if we convert 2d data to 3d data to achieve separation, then after achieving separation we again projecting back to 2d space
* When you project from 2d to 3d or 3d to 4d then it is ok , but if you project into 20 dimension , it involves lots of computation , that's why people are not using it.
* that is why another function comes into action and this is called  Gussian RBF kernel
*what kernel function does is , whey you have high dimension data it takes high dimension data and in three dimension it is able to achieve the separation without really projecting it into multiple dimension
* So the result of this is computing power will reduced, and so it will run fast and does not take too much time
* What kernel trick is doing is it is now able to achieve the separation using gussian curve without really projecting into higher dimension, and what we do is we dictate the gamma
* SVM is mostly use for classification problem
*# we can do hyper parameter tunning in any algorithim 
# in decision tree we are adjusting the depth that is hyperparameter tunning4
# random forest we are adjusting the number of trees, that is hyperparameter tunning in random forest
# in svm you adjust the cost of error (c) and gamma , that is hyperparameter tunning in svm
# you don't need to project into higher dimension , but you need adjust the c and gamma and check where acuracy is high that is hyperparameter tuning
#in the real world they will say can u tune this model

* We can use different function to check for accuracy, and the function which gives the high accuracy we will take that function
function or kernel kuch bhi kah lo
* how do you define a true positive in confusion matrix , so the answer is positive mean it is actually positive which is one , and true means we are predicting as positive, which means our prediction is true which is true positive
*Binary classification means 2 , if you have more than two that means it is multi nomial classification
* In binary classification we can measure the performance of model by three way accuracy, sensitivity and specificity
, sensitivity is true positive, where as specificity is true negative
sensitivity = TP/TP+FP
specificity = TN/FP+TN
*We should not only try to get high accuracy we should also try to get high sensitivity and specificity
* You answer 70 question well but could not answer remaining 30 question well that is called overfitting 
* You does not perform well in 70 questions, but luckily you perform well in 30 question that is underfitting
* Cross validation is group of partitioning , you do a multiple partioning , if you do single partitioning data may be bias, that is called a bad training data, so if you do cross validation your bias will come down
* Earlier we have measure the performance of model with three parameters accuracy, sensitivity and specificity now fourth is confusion matrix and fifth is roc auc

* the model cannot be used that is accuracy paradox

Comments

Popular posts from this blog

All about Machine learning

Machine Learning

OS in Python