In this blog I will describe few of the terminologies, that we will be using for entire duration of learning. I divide these terminologies in two parts :
General Terminologies - will be used in all most all machine learning algorithm
Type Of Classifications [short overview]
Performance Measures - To measure performance of the machine learning model, we use certain mathematical formulas.
1) General Terminologies
A) Features & Class
Features are attributes, which are characteristics of particular class. precisely we can say - "A feature is an individual measurable property of a phenomenon being observed"
Examples
let say we want to identify cat and dog based on feature:
So here features are Average-size, Eye-color, Tail-size, voice and Skin-color. Here "cat" and "dog" are classes.
Sometime class is also referred as target.
[Insert figure here]
Here we have taken 2 observation, one for each class. Such many observation are required to make machine train to identify cat and dog.
B) Model
Model is a compact mathematical representation of conclusion made after training.
Examples
1) A trend line is the simplest example where relationship between two variable x and y is represented by coefficient (β) and slop (m). two variable are so defined that when put in a equation for particular data-set it will help to predict closest point to actual one.
2) For Gaussian Bayesian approximation, model is something that stores standard deviation and mean for all feature for all class in data-set.
Here 0.0 and 1.0 are classes and all values in Bold are compact representation of of mean and standard deviation for all features of the data-set.
{0.0: {'stddev': [5.95045670637252, 7.381656962769089, 6.375327172693769, 10.368169435393417, 6.718337695635912, 9.712648896960653, 4.850595587842532, 10.829255915816487, 6.950296458522511], 'mean': [7.396907216494846, 6.298969072164948, 6.396907216494846, 5.304123711340206, 5.402061855670103, 7.675257731958763, 5.649484536082475, 5.84020618556701, 2.716494845360825]}, 1.0: {'stddev': [2.9417041392828223, 1.0992736077481833, 1.2235673930589215, 1.0448518390406987, 1.0773665398362717, 1.8841692609247165, 1.3593450939697855, 1.4419923901764191, 0.21692609247088446], 'mean': [2.833898305084746, 1.4067796610169492, 1.5084745762711864, 1.4067796610169492, 2.1864406779661016, 1.3864406779661016, 2.2813559322033896, 1.3864406779661016, 1.064406779661017]}}
Above Example will be cleared once we will start actual implementation.
C) Training-set & Test-set & validation set
Training set and test-set are two sub-set derived from data-set. Training set usually remains 70% of the data-set and remaining 30% is considered as test set.
[Insert figure here]
As clear form name training- set is used to make machine train and to get a model. test-set as name implies, used test whether or not so prepared model is good or bad. As test -set is one which was not seen by model previously, If machine performs good on train-set, the model is said to be good.
Good practice for training set and test-set creation:
1) training-set and test-set should be created such that all classes remain equally distributed in both the sets without bias.
2) For unbiased results, any part of test-set should not be exposed to training.
However there is another set also called validation set
Here data-set is divided in to three parts Training set (70%), test-set(20%) and validation-set(10%)
Training in machine learning is suppose to last for several days, in this scenario one can not wait for final model to be prepared and then test to be carried out on it. If model is found to be faulty then Its total waste of time.
So validation-set is created, while training after each intermediate run (epoch), intermediate model is tested on validation-set and training is continued, validation-set is also previously unexposed to training-set, so it will provides model's performance on unknown data-set.
D) Algorithm
Simply it is set of mathematical equations that learns inherent patterns in data for given classes and help us to predict on classes of unknown data.
More than 200 different algorithm exist in machine learning domain with more or less functional similarity. Regression, Bayesian Estimation, Self organizing Maps, Deep Learning and Convolution Network are few to name.
Choice of algorithm for particular problem solving is very crucial. Such choices are made considering following point in consideration
1) Performance of algorithm on given problem
2) Compute cost
3) Time cost
4) Available infrastructure (GPU or Distributed)
So this is all about General Terminologies, below given figure, illustrates overall flow of event to be performed while dealing with machine learning.
[Insert figure here]
2) Type Of Classifications
This topic in depth will be covered in another section of the blog. Here I am just providing a superficial information about this topic as this information will be required for understanding further discussion in upcoming topic "Performance Measures".There can be two type of classification 1) Binomial and 2) Mutinomial , based on number of classes to be predicted.
1) Binomial - we have two classes to deal with, just like we saw above in example of Cat and Dog.
2) Multinomial - we have more than two classes to deal with. Lets assume we want to classify between cat dog and horse. This information is enough for time being to proceed with next topic.
3) Performance Measures
There are many metrics that can be used to measure the performance of a algorithm; different fields have different preferences for specific measure due to different goals.
For binary classifier such measures will be Accuracy, Precision, Sensitivity, Specificity, F1 measure and Matthews correlation coefficient.
1) Confusion matrix - is most widely used to measure performance of binary classification.
to understand this, lets assume we have two classes 0 - people without cancer; 1 - people without cancer.
In the test data-set,for 20 test samples the actual classes are given as :
Actual = [1,1,0,0,1,0,1,1,0,0,0,1,1,1,0,1,0,1,1,1]
And predicted classes are as below
Predicted = [1,0,1,1,1,0,1,1,1,0,0,1,1,0,0,1,0,1,0,1]
Lets understand 4 terminology - considering 1 as positive and 0 as negative:
1) True Positive ( TP )- Actully 1, classified as 1 in predicted [ Shown as Green ]
2) True Negative ( TN )- Actully 0, classified as 0 in predicted [ Shown as Red ]
3) False Positive ( FP ) - Actully 0, classified as 1 in predicted [ Shown as Violet ]
4) False Negative ( FN ) - Actully 1, classified as 0 in predicted [Shown as Orange]
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Sensitivity = TP / (TP + FN)
Specificity = TN/(FP + TN)
Precision = TP / (TP + FP)
F1 Score = 2*TP / (2*TP + Fp +FN)
More model evaluation measures can be found at - https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
def createConfusionMatrix(self,actual, predicted, threshold):
"""
will create confusion matrix for given set of actual and predicted array
:param actual: Array of Actual sample
:param predicted: Array of predicted sample
:param threshold: Any number between 0-1
:return:
"""
fp = 0
fn = 0
tp = 0
tn = 0
for i in range(len(predicted)):
if predicted[i] > threshold:
predicted[i] = 1
else:
predicted[i] = 0
for no in range(0, len(predicted)):
if predicted[no] == 1 and actual[no] == 1:
tp += 1
elif predicted[no] == 0 and actual[no] == 0:
tn += 1
elif predicted[no] == 1 and actual[no] == 0:
fn += 1
elif predicted[no] == 0 and actual[no] == 1:
fp += 1
ACC = float((tp + tn))/ float((fp + tp + tn + fn))
F1 = float(2*tp)/ float(2*tp + fp + fn)
print "False Positive : ",fp,", False Negative : ", fn,", True Positive : ", tp,", True Negative : ", tn,", Accuracy : ",ACC ,", F1 Score : ",F1
I have written a sample code for performance measures in python, you may try the same. The same code is available at my GitHub repository.
As we saw performance measure for binary classification, we have performance measure for regression and clustering algorithms too, 'll discuss the same in up coming posts.
I am summing up this post for now but there is lot more to this, that I will keep on adding to this post as we move forward.
Comments