Competitive aspect of Machine learning
* Data ko split krna jruri hai, isse hamara model rough tough banenga
* Data ko split krne ke do traike traditional way hai train_test_split, aur doosra tarika hai cross validation , cross validation is more sophisticated one
* remember when we use k fold or cross validation , we don't partition it , we don't fit it , we don't predict it
*There are some feature selection techniques are there
first is Selectkbest, second is RFE (recursive feature elimination)
* then we have something for dimensionalty reduction and that is pca(Principal component analysis)
*# kya hota hai ki koi bhi chiz 3 dimension mai hai,
# aur usme hame mushkil ho rahi hai ki ye a, b,c hai
# to use hm two dimension mai daal denge , aur isse hme a,b,c clear ho jayenge
# So by reducing the dimension , you are acheving the separation that is pca
* See selectkbest is selecting individual Feature , RFE is giving multiple feature at a time and pca is kind of telling that with less feature you can achieve a maximum accuracy here comes the fourth one Extra tree classifier
* Extra tree classifier is like a random forest here we are asking our model which are most important feature
one with high value is most important feature
* Dummy variable create krne ki ninja technique
pd.get_dummies(dataset)
* YOu can also use keras to_categorical method which is similar to one hot encoding
* if you have more 0 in a row in your dataset , you replace them with nan and then remove all the row which have nan , this is first strategy
* We have another strategy also and that is replace all the nan with mean
* Another strategy is use simpleimputer
* Remember earlier what we did is we remove all the row which have nan , but this is not good practice because there is loss of data , that is why we are studying all these strategy
* We also check for different algorithms to check which give max accuracy
* Now let's talk about scaling
* Machine learning put higher weightage on features which have higher scale
* We have different scaling techniques
* Min Max scaler (so what min max scaler do is it takes the min value and then max value and scale it )
* Standard Scaler(# vals with a mean of 0 & sd of 1
# takes ea feature val
# calcs the mean of ea fea
# subtracts the mean
# then div by sd
# then v get these values
# this is the 2nd pre-processing technique)
* Normalization( Normalization, when v hv a lot of 0 vals, missing vals)
# so what you do is try all these scaling technique and see which scaling technique gives you maximum accuracy
* Binarizer ( this we use rarerly)
it tell if value is less than 0 we give 0 otherwise 1
--------------------------------------
*# The difference between decision tree and random forest is in decision tree we use only one tree where as in random forest we use group of tree and this is called ensemble learning
* # here we used bagging classifier, decision tree and random forest both are bagging
# bagging is like 500 judge make decison parallel and boosting is like one judge take decision then other judge will take decision and so on..
# In bagging it is parallel process and in boosting it is sequential process
import pickle # Fit the model model = LogisticRegression() model.fit(X_train, y_train) # save the model to disk filename = 'finalized_model.sav' pickle.dump(model, open(filename, 'wb')) #ye code bhejta hai finalized_model .sav file mai save hoga aur fir hm ise emai ke jariye bhej denge # some time later...
# is code ki madad se hm us file ko khol skte Hai # load the model from disk loaded_model = pickle.load(open(filename, 'rb')) result = loaded_model.score(X_test, y_test) print(result)
* Underfitting ka solution hai boosting
* Overfitting ka solution hai hyperparameter tuning
Comments
Post a Comment