All about Machine learning
Building a performing Machine learning Model
* data preparation > feature enginerring>data modeling>performance measure> performance improvement
* this is highly iterative process , to repeat until your model reaches a satisfying performance
* Let's see the steps
1. Data preparation
* query your data - basically you can query your data using pandas , this will give you a dataframe with your raw data
* clean your data - 1st step is to deal with missing values, and if a colum contain too much missing value remove that coloum ,and 2nd step is remove the outlliers ,and you need to remove them , you can remove it by using your mind or you can use a robust methods to remove the outliers
* Format data - this is basically a encoding of categorical variables , you can use label encoding or one hot encoding
2. Feature Engineering
* A feature is an individual measurable property of a phenomenon being observed
* like for to predict the price of apartment the features are size , location, floor, elevator, room . the number of features you will be using is called the dimension
* Feature engineering is the process of transforming raw data into relevant features
*your feature should be informative, discriminative and non-redundant
* feature engineering usually includes feature construction , feature transformation, dimension reduction
* Feature construction means tuning raw data into informative features that best represent the underlying problem and that the algorithim can understand
* basically yaha colum yani features ka naam rakhna hota hai , aisa name jisse saari row relate kr paye
* feature construction is where you will need all the domain expertise and is key to the performace of your model
* Feature Transformation :- Basically it is adding an extra feature or transforming the feature
* Dimension Reduction :- dimension reduction is the process of reducing the number of features used to build the model, with the goal of keeping only informative, discriminative non-redundant features
* there are many benefit of doing this faster computation , less space required,and it increased model performance
* there are two way for dimension reduction first is feature selection and second is feature extraction
*feature selection is the process of selecting the most relevant features among your existing feature, to keep relevant features only we will remove features that are non informative , non discriminative and redundant
* we remove the feature that is highly correlated to other feature and we determine it by pearson product moment correlation coefficient matrix
* Feature extraction : the most comman algorithim for the feature extraction is Principal Component Analysis (PCA)
*PCA makes an orthogonal projection on a linear space to determine new features called principal components ,
basically in higher level what is happening is there is reduction of two feature into single one
3. Data Modeling
* as we are going to train a model on your data using a learning algorithim
data + algorithim = model
* There are basically two type of machine learing supervised learning, and unsupervised learning , when the traning set contain labels that is output it is called supervised learning, and when the training set contain no label , only feature it is called unsupervised learning
*Supervised learning algorithims are used to build two different kind of models first is regression ( to predict the continuous value)
and second is classification to predict the discrete value (example : predict how many stars I am going to rate a movie on Netflix (0,1,2,3,4,5)
* there are something called parametric vs non parametric algorithim
* Parametric algorithims are those with set of parameters of fixed size , the linear regression is said to be parametric algorithim
*there are some pros and cons of parametric algorithim , pros is it is simpler, faster, require less data to yield good performance, and the cons are these are suited for simple problems
*Non parametric algorithim that does not make strong assumption about the form of the maping function are non parametric algorithim , by not making assumption , they are free to learn any functional form (with an unknown number of parameters ) from the traning data.
* A decision tree is a nonparametric algorithim
* the pros of non parametric algorithim are that its performance will be likely be higher than parametric algorithim and cons are it is slower , require large amount of data and overfitting
* There is no such thing as best algorithim which is why choosing the right algorithim is one tricky part in machine learning
* Hyperparameter are parameters of the algorithim , they are not to be confused with the parameter of the model like in
linear regression hyperparameter are fit_intercept
Random forest hyperparameter are n_estimator , criterion
K-mean hyperparameter are init
* Building a performing ML model is all about making the right assumptions about your data , choosing the right learning algorithim for these assumptions
Ensemble Learning
* Using multiple learning algorithims together for the same task , it gives us better prediction than individual learning model
* What we do is we took a different sample of data and give it to an different algorithim
* We decide the result by voting , let suppose if we use three models , out of which if two models says yes and one model says no then our result is yes
* Bagging means building different models from different subsamples of training datasets
* Boosting means building a multiple model each of which learns to fix the prediction errors of prior model in the chain
4. Performance Measure
* use your model to predict the labels in your own datasets
like
y_pred = model.Predict(y_test)
* Use some indicator to compare the predicted values with the real values
comparison = some_indicator( Y , y_pred)
* KFold cross validation consists in repeating the training/CV random splitting process k times to come up with an average performace measure
* In higher level in Kfold cross validation we do is we took the k samples of training data set and check the performance and at last we took the average of unbiased performance measure
* Always remember Create a dirty but complete model as quick as possible to iterate on it afterwards
This is the right way to go!
5. Performance Improvement
* Reason of underperformance is Underfitting and Overfitting
* Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data
* if your model is underfitting , it might be because you did not give it enough informative features
* Boosting is solution to the underfitting
* Boosting then combines all the weak learners into a single strong learner.
* Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
* In higher level underfitting means bhot kam padhai ki aur overfitting means bhot jada Padhai kr li , dono mai hi nuksaan hai
* Solution to the Overfitting is Regularization
* Regularization aims to reducing overfitting by adding a complexity terms to the cost function.
Comments
Post a Comment