Data Science: Nullhypothesis, Confusion Matrix, Sklearn, Accuracy, Predictions, Logistic Regression in Python

Our scenario and aim is to predict some data using Classification in Machine Learning and generate statistically significant results, i.e. evidence.

Classifiers used can be Support Vector Machines (SVM), Decision Tree, Random Forest, Logistic Regression, and so on. Below an example with Logistic Regression will be used.

Representation of Confusion Matrix

The horizontal axis corresponds to the predicted values(y-predicted) and the vertical axis correspomds to the actual values(y-actual).

  • Orange, True Negatives, TN: represents the values which are predicted to be false and are actually false.
  • Yellow, False Negatives, FN: represents the values which are predicted to be true, but are false.
  • Pink, False Positives, FP: represents the values which are predicted to be false, but are true.
  • Green, True Positives, TP: represents the values which are predicted to be true and are actually true.
Figure: A1

What are the observations? In each classifier, we get both right as well as wrong predictions.

Wrong predictions can again be categorized into:-

  • False Positive
  • False Negative

False Positive (type I error)

When we predict that something happens/occurs and it didn’t happen/occur.(rejection of a true null hypothesis) Example :- We predict that an earthquake would occur which didn’t happen.

False Negative (type II error)

When we predict that something won’t happen but it does.(non-rejection of a false null hypothesis) Example :- We predict that there will be no tech stock crash but there is.

Usually, type I errors are considered to be not as critical as type II errors.

But in fields like medicine, agriculture, the errors can be critical.

A 2×2 matrix denoting the right and wrong predictions helps analyse the rate of success. This matrix is termed the Confusion Matrix.

Null-Hypothesis Context

Here is some context on the Null-hypothesis, rejecting it, getting a P-Value, Probability. Remember that the P-Value will be between 0 and 1. A value of 1 means that the relationship is purely down to chance. However values close to 0 show that the probility of the relationship being due to chance is very low. In those cases where the value gets to 0.01 to 0.05, so 1 to 5%, the statistical significance ranges from highly significant to significant. When analysing data we are looking for statistical significance from our statistical model. The testing with tools such as sklearn for Python helps data scientists show evidence of significance.

Figure: A3
Figure: Interconnection – Theory to Practice

Code Example for Logistic Regression using Python

Step 1: Importing Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt

Step 2: Importing the Data into Pandas Dataframe

import pandas as pd
candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'admitted': [1,1,0,1,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,1]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted'])
print (df)

Step 3: The Heart of the Logistic Regression with sklearn

X = df[['gmat', 'gpa','work_experience']]
y = df['admitted']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
# Recall that our original dataset (from step 1) had 40 observations.
# Since we set the test size to 0.25, then the confusion matrix displayed the results for 10 records (=40*0.25). 

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)

print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
plt.show()

Step 4: Printing test and predicted data

print (X_test)
print (y_pred)

Step 5: Importing, adding new data to be tested and classified based on prediction

new_candidates = {'gmat': [590,740,680,610,710],
                  'gpa': [2,3.7,3.3,2.3,3],
                  'work_experience': [3,4,6,1,5]
                  }

df2 = pd.DataFrame(new_candidates,columns= ['gmat', 'gpa','work_experience'])
y_pred=logistic_regression.predict(df2)

print (df2)
print (y_pred)
Figure: A2

Questions or Feedback?

Feel free to email me at contact@zuberbuehler-associates.ch or write a comment below with your thoughts or insights.

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s