Our scenario and aim is to predict some data using Classification in Machine Learning and generate statistically significant results, i.e. evidence.
Classifiers used can be Support Vector Machines (SVM), Decision Tree, Random Forest, Logistic Regression, and so on. Below an example with Logistic Regression will be used.
Representation of Confusion Matrix
The horizontal axis corresponds to the predicted values(y-predicted) and the vertical axis correspomds to the actual values(y-actual).
- Orange, True Negatives, TN: represents the values which are predicted to be false and are actually false.
- Yellow, False Negatives, FN: represents the values which are predicted to be true, but are false.
- Pink, False Positives, FP: represents the values which are predicted to be false, but are true.
- Green, True Positives, TP: represents the values which are predicted to be true and are actually true.

What are the observations? In each classifier, we get both right as well as wrong predictions.
Wrong predictions can again be categorized into:-
- False Positive
- False Negative
False Positive (type I error)
When we predict that something happens/occurs and it didn’t happen/occur.(rejection of a true null hypothesis) Example :- We predict that an earthquake would occur which didn’t happen.
False Negative (type II error)
When we predict that something won’t happen but it does.(non-rejection of a false null hypothesis) Example :- We predict that there will be no tech stock crash but there is.
Usually, type I errors are considered to be not as critical as type II errors.
But in fields like medicine, agriculture, the errors can be critical.
A 2×2 matrix denoting the right and wrong predictions helps analyse the rate of success. This matrix is termed the Confusion Matrix.
Null-Hypothesis Context
Here is some context on the Null-hypothesis, rejecting it, getting a P-Value, Probability. Remember that the P-Value will be between 0 and 1. A value of 1 means that the relationship is purely down to chance. However values close to 0 show that the probility of the relationship being due to chance is very low. In those cases where the value gets to 0.01 to 0.05, so 1 to 5%, the statistical significance ranges from highly significant to significant. When analysing data we are looking for statistical significance from our statistical model. The testing with tools such as sklearn for Python helps data scientists show evidence of significance.


Code Example for Logistic Regression using Python
Step 1: Importing Libraries
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics import seaborn as sn import matplotlib.pyplot as plt
Step 2: Importing the Data into Pandas Dataframe
import pandas as pd candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690], 'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7], 'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5], 'admitted': [1,1,0,1,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,1] } df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted']) print (df)
Step 3: The Heart of the Logistic Regression with sklearn
X = df[['gmat', 'gpa','work_experience']] y = df['admitted'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) # Recall that our original dataset (from step 1) had 40 observations. # Since we set the test size to 0.25, then the confusion matrix displayed the results for 10 records (=40*0.25). logistic_regression= LogisticRegression() logistic_regression.fit(X_train,y_train) y_pred=logistic_regression.predict(X_test) confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sn.heatmap(confusion_matrix, annot=True) print('Accuracy: ',metrics.accuracy_score(y_test, y_pred)) plt.show()

Step 4: Printing test and predicted data
print (X_test) print (y_pred)
Step 5: Importing, adding new data to be tested and classified based on prediction
new_candidates = {'gmat': [590,740,680,610,710], 'gpa': [2,3.7,3.3,2.3,3], 'work_experience': [3,4,6,1,5] } df2 = pd.DataFrame(new_candidates,columns= ['gmat', 'gpa','work_experience']) y_pred=logistic_regression.predict(df2) print (df2) print (y_pred)

Questions or Feedback?
Feel free to email me at contact@zuberbuehler-associates.ch or write a comment below with your thoughts or insights.