Here I m going to run Support Vector machine with different kernels(linear,gaussian,polynomial) and also tune the various parameters such as C ,gamma and degree to find out the best performing model .
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
# Any results you write to the current directory are saved as output.
Importing all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Reading the comma separated values file into the dataframe
df = pd.read_csv('../input/voice.csv')
df.head()
Checking the correlation between each feature
df.corr()
Checking whether there is any null values
```python _cell_guid=”af637554-1c8d-cdf4-c76e-6429511dbcf7” df.isnull().sum()
```python
df.shape
print("Total number of labels: {}".format(df.shape[0]))
print("Number of male: {}".format(df[df.label == 'male'].shape[0]))
print("Number of female: {}".format(df[df.label == 'female'].shape[0]))
Thus we can see there are equal number of male and female labels
df.shape
There are 21 features and 3168 instances.
Separating features and labels
X=df.iloc[:, :-1]
X.head()
Converting string value to int type for labels
from sklearn.preprocessing import LabelEncoder
y=df.iloc[:,-1]
# Encode label category
# male -> 1
# female -> 0
gender_encoder = LabelEncoder()
y = gender_encoder.fit_transform(y)
y
Data Standardisation
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance). It is useful to standardize attributes for a model. Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data.
# Scale the data to be between -1 and 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
Splitting dataset into training set and testing set for better generalisation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Running SVM with default hyperparameter.
from sklearn.svm import SVC
from sklearn import metrics
svc=SVC() #Default hyperparameters
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
Default Linear kernel
svc=SVC(kernel='linear')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
Default RBF kernel
svc=SVC(kernel='rbf')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
We can conclude from above that svm by default uses rbf kernel as a parameter for kernel
Default Polynomial kernel
svc=SVC(kernel='poly')
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(y_test,y_pred))
Polynomial kernel is performing poorly.The reason behind this maybe it is overfitting the training dataset
Performing K-fold cross validation with different kernels
CV on Linear kernel
from sklearn.cross_validation import cross_val_score
svc=SVC(kernel='linear')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') #cv is cross validation
print(scores)
We can see above how the accuracy score is different everytime.This shows that accuracy score depends upon how the datasets got split.
print(scores.mean())
In K-fold cross validation we generally take the mean of all the scores.
CV on rbf kernel
from sklearn.cross_validation import cross_val_score
svc=SVC(kernel='rbf')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') #cv is cross validation
print(scores)
print(scores.mean())
CV on Polynomial kernel
from sklearn.cross_validation import cross_val_score
svc=SVC(kernel='poly')
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy') #cv is cross validation
print(scores)
print(scores.mean())
When K-fold cross validation is done we can see different score in each iteration.This happens because when we use train_test_split method,the dataset get split in random manner into testing and training dataset.Thus it depends on how the dataset got split and which samples are training set and which samples are in testing set.
With K-fold cross validation we can see that the dataset got split into 10 equal parts thus covering all the data into training as well into testing set.This is the reason we got 10 different accuracy score.
Taking all the values of C and checking out the accuracy score with kernel as linear.
The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.
Thus for a very large values we can cause overfitting of the model and for a very small value of C we can cause underfitting.Thus the value of C must be chosen in such a manner that it generalised the unseen data well
C_range=list(range(1,26))
acc_score=[]
for c in C_range:
svc = SVC(kernel='linear', C=c)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
C_values=list(range(1,26))
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(C_values,acc_score)
plt.xticks(np.arange(0,27,2))
plt.xlabel('Value of C for SVC')
plt.ylabel('Cross-Validated Accuracy')
From the above plot we can see that accuracy has been close to 97% for C=1 and C=6 and then it drops around 96.8% and remains constant.
Let us look into more detail of what is the exact value of C which is giving us a good accuracy score
C_range=list(np.arange(0.1,6,0.1))
acc_score=[]
for c in C_range:
svc = SVC(kernel='linear', C=c)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
C_values=list(np.arange(0.1,6,0.1))
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(C_values,acc_score)
plt.xticks(np.arange(0.0,6,0.3))
plt.xlabel('Value of C for SVC ')
plt.ylabel('Cross-Validated Accuracy')
Accuracy score is highest for C=0.1.
Taking kernel as rbf and taking different values gamma
Technically, the gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as similarity measure between two points. Intuitively, a small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. In the other hand, a large gamma value means define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
acc_score=[]
for g in gamma_range:
svc = SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.xticks(np.arange(0.0001,100,5))
plt.ylabel('Cross-Validated Accuracy')
We can see that for gamma=10 and 100 the kernel is performing poorly.We can also see a slight dip in accuracy score when gamma is 1.Let us look into more details for the range 0.0001 to 0.1.
gamma_range=[0.0001,0.001,0.01,0.1]
acc_score=[]
for g in gamma_range:
svc = SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
gamma_range=[0.0001,0.001,0.01,0.1]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')
The score increases steadily and raches its peak at 0.01 and then decreases till gamma=1.Thus Gamma should be around 0.01.
Let us look into more detail for gamma values
gamma_range=[0.01,0.02,0.03,0.04,0.05]
acc_score=[]
for g in gamma_range:
svc = SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
gamma_range=[0.01,0.02,0.03,0.04,0.05]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.ylabel('Cross-Validated Accuracy')
We can see there is constant decrease in the accuracy score as gamma value increase.Thus gamma=0.01 is the best parameter.
Taking polynomial kernel with different degree
degree=[2,3,4,5,6]
acc_score=[]
for d in degree:
svc = SVC(kernel='poly', degree=d)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
import matplotlib.pyplot as plt
%matplotlib inline
degree=[2,3,4,5,6]
# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(degree,acc_score,color='r')
plt.xlabel('degrees for SVC ')
plt.ylabel('Cross-Validated Accuracy')
Score is high for third degree polynomial and then there is drop in the accuracy score as degree of polynomial increases.Thus increase in polynomial degree results in high complexity of the model and thus causes overfitting.
Now performing SVM by taking hyperparameter C=0.1 and kernel as linear
from sklearn.svm import SVC
svc= SVC(kernel='linear',C=0.1)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)
With K-fold cross validation(where K=10)
from sklearn.cross_validation import cross_val_score
svc=SVC(kernel='linear',C=0.1)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores)
Taking the mean of all the scores
print(scores.mean())
The accuracy is slightly good without K-fold cross validation but it may fail to generalise the unseen data.Hence it is advisable to perform K-fold cross validation where all the data is covered so it may predict unseen data well.
Now performing SVM by taking hyperparameter gamma=0.01 and kernel as rbf
from sklearn.svm import SVC
svc= SVC(kernel='rbf',gamma=0.01)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
metrics.accuracy_score(y_test,y_predict)
With K-fold cross validation(where K=10)
svc=SVC(kernel='linear',gamma=0.01)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores)
print(scores.mean())
Now performing SVM by taking hyperparameter degree=3 and kernel as poly
from sklearn.svm import SVC
svc= SVC(kernel='poly',degree=3)
svc.fit(X_train,y_train)
y_predict=svc.predict(X_test)
accuracy_score= metrics.accuracy_score(y_test,y_predict)
print(accuracy_score)
With K-fold cross validation(where K=10)
svc=SVC(kernel='poly',degree=3)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')
print(scores)
print(scores.mean())
#Let us perform Grid search technique to find the best parameter
from sklearn.svm import SVC
svm_model= SVC()
tuned_parameters = {
'C': (np.arange(0.1,1,0.1)) , 'kernel': ['linear'],
'C': (np.arange(0.1,1,0.1)) , 'gamma': [0.01,0.02,0.03,0.04,0.05], 'kernel': ['rbf'],
'degree': [2,3,4] ,'gamma':[0.01,0.02,0.03,0.04,0.05], 'C':(np.arange(0.1,1,0.1)) , 'kernel':['poly']
}
from sklearn.grid_search import GridSearchCV
model_svm = GridSearchCV(svm_model, tuned_parameters,cv=10,scoring='accuracy')
model_svm.fit(X_train, y_train)
print(model_svm.best_score_)
#print(model_svm.grid_scores_)
print(model_svm.best_params_)
y_pred= model_svm.predict(X_test)
print(metrics.accuracy_score(y_pred,y_test))
You can find my notebook on Github: (“https://github.com/nirajvermafcb/Data-Science-with-python”)