All about Machine learning

         Building a performing Machine learning Model


* data preparation > feature enginerring>data modeling>performance measure> performance improvement

* this is highly iterative process , to repeat until your model reaches a satisfying performance

* Let's see the steps

1. Data preparation 

* query your data - basically you can query your data using pandas , this will give you a dataframe with your raw data

* clean your data - 1st step is to deal with missing values, and if a colum contain too much missing value remove that coloum ,and 2nd step is remove the outlliers ,and you need to remove them , you can remove it by using your mind or you can use a robust methods to remove the outliers

* Format data - this is basically a encoding of categorical variables , you can use label encoding or one hot encoding

2. Feature Engineering

* A feature is an individual measurable property of a phenomenon being observed

* like for to predict the price of apartment the features  are size , location, floor, elevator, room . the number of features you will be using is called the dimension

* Feature engineering is the process of transforming raw data into relevant features

*your feature should be informative, discriminative and non-redundant

* feature engineering usually includes feature construction , feature transformation, dimension reduction

* Feature construction means tuning raw data into informative features that best represent the underlying problem and that the algorithim can understand

* basically yaha colum yani features ka naam rakhna hota hai , aisa name jisse saari row relate kr paye

* feature construction is where you will need all the domain expertise and is key to the performace of your model

* Feature Transformation :- Basically it is adding an extra feature or transforming the feature

* Dimension Reduction :- dimension reduction is the process of reducing the number of features used to build the model, with the goal of keeping only informative, discriminative non-redundant features

* there are many benefit of doing this faster computation , less space required,and it increased model performance

* there are two way for dimension reduction first is feature selection and second is feature extraction 

*feature selection is the process of selecting the most relevant features among your existing feature, to keep relevant features only we will remove features that are non informative , non discriminative and redundant

* we remove the feature that is highly correlated to other feature  and we determine it by pearson product moment correlation coefficient matrix

* Feature extraction : the most comman algorithim for the feature extraction is Principal Component Analysis (PCA)

*PCA makes an orthogonal projection on a linear space to determine new features called principal components ,

basically in higher level what is happening is there is reduction of two feature into single one

3. Data Modeling

* as we are going to train a model on your data using a learning algorithim

data + algorithim = model

* There are basically two type of machine learing supervised learning, and unsupervised learning , when the traning set contain labels that is output it is called supervised learning, and when the training set contain no label , only feature it is called unsupervised learning

*Supervised learning algorithims are used to build two different kind of models first is regression ( to predict the continuous value)

and second is classification to predict the discrete value (example : predict how many stars I am going to rate a movie on Netflix (0,1,2,3,4,5) 

* there are something called parametric vs non parametric algorithim 

* Parametric algorithims are those with set of parameters of fixed size , the linear regression is said to be parametric algorithim

*there are some pros and cons of parametric algorithim , pros is it is simpler, faster, require less data to yield good performance, and the cons are these are suited for simple problems 

*Non parametric algorithim that does not make strong assumption about the form of the maping function are non parametric algorithim , by not making assumption  , they are free to learn any functional form (with an unknown number of  parameters ) from the traning data.

* A decision tree is a nonparametric algorithim

* the pros of non parametric algorithim are that its performance will be likely be higher than parametric algorithim and cons are it is slower , require large amount of data and overfitting

* There is no such thing as best  algorithim which is why choosing the right algorithim is one tricky part in machine learning

* Hyperparameter are parameters of the algorithim , they are not to be confused with the parameter of the model like in 

linear regression hyperparameter are fit_intercept

Random forest hyperparameter are n_estimator , criterion

K-mean hyperparameter are init

* Building a performing ML model is all about making the right assumptions about your data , choosing the right learning algorithim for these assumptions

Ensemble Learning

* Using multiple learning algorithims together  for the same task , it gives us better prediction than individual learning model

* What we do is we took a  different sample of data and give it to an different algorithim 

* We decide the result by voting , let suppose if we use three models , out of which if two models says yes and one model says no then our result is yes

* Bagging means building different models from different subsamples of training datasets

* Boosting means building a multiple model each of which learns to fix the prediction errors of prior model in the chain

4. Performance Measure

* use your model to predict the labels in your own datasets

like

y_pred =  model.Predict(y_test)

* Use some indicator to compare the predicted values with the real values

comparison = some_indicator( Y , y_pred)

* KFold cross validation consists in repeating the training/CV random splitting process k times to come up with an average performace measure

* In higher level in Kfold cross validation we do is we took the k samples of training data set and check the performance and at last we took the average of unbiased performance measure 

* Always remember Create a dirty but complete model as quick as possible to iterate on it afterwards

This is the right way to go!

5. Performance Improvement 

* Reason of underperformance is Underfitting and Overfitting

Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data

* if your model is underfitting , it might be because you did not give it enough informative features

* Boosting is solution to the underfitting

* Boosting then combines all the weak learners into a single strong learner.

Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

* In higher level underfitting means bhot kam padhai ki aur overfitting means bhot jada Padhai kr li , dono mai hi nuksaan hai

* Solution to the  Overfitting is Regularization 

* Regularization aims to reducing overfitting by adding a complexity terms to the cost function.

Comments

Popular posts from this blog

Machine Learning

OS in Python