SMOTE - Synthetic Minority Over-sampling Technique

May 21, 2017
6 min read

Codes described throughout are placed in this GitHub Repository.

For theoretical information and pseudo code you may refer : https://www.jair.org/media/953/live-953-2037-jair.pdf

Imbalance in the data-set is mitigated using SMOTE. suppose we have a data-set one class is nearly 85% and other class is nearly 15%. This imbalance is not good for machine learning. Machine learning generally prefers same number of sample for all classes for effective balanced learning and prediction. I have worked with data-set having one class up-to 97% and other is about 3% only. I know the headache if dealing with such data-set. Certain techniques can be used to balance such biased data-set by synthesizing new samples identical to minor class. For this tutorial we will be using iris data-set, Iris data-set is having 3 class of flower (Iris-setosa, Iris-virginica and Iris-versicolor). Each class is having 50 samples. Our goal is to make unbalance data-set balanced so I have deleted 45 samples of Iris-virginica. Now in new data-set we have only 5 sample of Iris-virginica (minor class).

Our goal for this tutorial includes:

1) To multiply samples of Iris-virginica 2) To visualize all sample of Iris-virginica and to ascertain that newly synthesized sample are similar to original Iris-virginica samples.

The core idea behind entire algorithm is very simple

Lets say we had a iris data with N = 105 samples.
Out of N, we have a class Iris-virginica having n = 5 samples, And other two classes having 50 samples each.
We need to increase number of samples for class Iris-virginica (minor class)
Each sample in iris data-set having 4 attribute, one sample can be represented in array as [6.3, 3.3, 6.0, 2.5, 'Iris-virginica']
First we will find out samples (nearest neighbors) which are very similar to given minor sample, lets say we got following nearest neighbors [[6.5, 3.0, 5.8, 2.2, 'Iris-virginica'],[6.3, 2.9, 5.6, 1.8, 'Iris-virginica'],[6.4, 2.8, 5.4, 1.9, 'Iris-virginica']]
Out of this nearest neighbor one sample is choose at random let say we choose
[6.5, 3.0, 5.8, 2.2, 'Iris-virginica'] to be the nearest one
Calculation:
a) One minor sample (m) - [6.3, 3.3, 6.0, 2.5, 'Iris-virginica']
b) Random nearest neighbor (r) - [6.5, 3.0, 5.8, 2.2, 'Iris-virginica']
c) Positive Difference between two (d) - [0.2, 0.3, 0.2, 0.3]
d) Random number between 0 and 1 (i) - [0.2, 0.8, 0.3, 0.5]
e) Synthetic minor sample = m + d * i - [6.34, 3.54, 6.06, 2.65, 'Iris-virginica'] --> New synthetic sample

Function -1. Below mentioned function perform a sole function to load data

def loadDataset(filename, numattrs): """ loads data from file :param filename: :param numattrs: number of column in file, Excluding class column :return: """ csvfile = open(filename, 'r') lines = csv.reader(csvfile) dataset = list(lines) for x in range(len(dataset)): for y in range(numattrs): dataset[x][y] = float(dataset[x][y]) return dataset

Function -2. Another function we would require is to calculate euclidean distance between attributes of two samples:

def euclideanDistance(instance1, instance2, length): """ calculate euclidean distance between two :param instance1:[6.5, 3.0, 5.8, 2.2] :param instance2:[6.3, 2.9, 5.6, 1.8] :param length: 4 length of array :return: """ distance = 0 for x in range(length): distance += pow((instance1[x] - instance2[x]), 2) return math.sqrt(distance)

If we execute above given function with set of array [6.5, 3.0, 5.8, 2.2] and [6.3, 2.9, 5.6, 1.8] then it will give output as 0. euclideanDistance([6.5, 3.0, 5.8, 2.2],[6.3, 2.9, 5.6, 1.8],4) Euclidean distance starts from 0.0 (Exactly similar) to any number indicating how far two objects are.

Function -3. Next function is getNeighbors. getNeighbors function will take three inputs 1) Entire data-set 2) A minorsample and 3) Number of neighbors to be found out The mechanism behind this function is as follows:

For a given minor sample find out euclidean distance with all samples in data-set
Sort all samples in ascending order of euclidean distance so that most similar will come first
From beginning take required number of neighbors to be found out except first sample ( The first sample will be the same minor sample from entire data-set - most similar )

def getNeighbors(trainingSet, eachMinorsample, k): """ will give top k neighbors for given minor sample (eachMinorsample) in dataset (trainingSet) :param trainingSet: here entire data-set is a training set :param eachMinorsample: :param k: number of nearest neighbors to search for each minor sample value, using Euclidean distance :return: top k neighbors output: # Minor Sample: [6.3, 3.3, 6.0, 2.5, 'Iris-virginica'] | Neighbors: [[6.5, 3.0, 5.8, 2.2, 'Iris-virginica'], # [6.3, 2.9, 5.6, 1.8, 'Iris-virginica'], """ distances = [] length = len(eachMinorsample) - 1 for x in range(len(trainingSet)): dist = euclideanDistance(eachMinorsample, trainingSet[x], length) # get euclidean distance for all matches distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) # sort as per distance and get top 3 excluding first one neighbors = [] for x in range(k): neighbors.append(distances[x + 1][0]) # X+1 in the sorted list ensure that the minor sample itself is not selected as neighbors return neighbors

Function -4. seperateMinority will separates all minor samples as per class specified from data-set. This will return an array of all minor samples.

def seperateMinority(dataSet, MinorClassName, classColumnNumber): """ will separate given minor class from the entire dataset :param dataSet: Entire dataset :param MinorClassName: name of minor class, e.g. MinorClassName = "Iris-virginica" :param classColumnNumber: column number where class is present [zero indexed] :return: """ minorSamples = [] for eachSample in dataSet: if (eachSample[classColumnNumber] == MinorClassName): minorSamples.append(eachSample) return minorSamples

Function -5. populate is the main function which takes following parameters:

N: factor by which sample needs to be increase, e.g. 2 means twice
minorSample: all minor samples
nnarray: nearest neighbor array [[2.4, 2.5, 'a'],[2.3, 2.2, 'a'],[2.5, 2.5, 'a']]
numattrs: equals to number of feature (3) , in 0 based index it iterates from

This function perform actual SMOTE algorithm on datset. It perform following steps

Take minorsample and nearest neighbors.
Find out difference between attributes of minorsample and nearest neighbors as diff
Generate a positive random float number between 0 and 1 known as gap
New attributes of minorsample = (attributes of minorsample) + gap * difference
As described above, all attribute for a particular sample are generated
such controlled randomness is added to N number of samples to generate synthetic ones

def populate(N, minorSample, nnarray, numattrs): """ perform actual algorithm 1) take minorsample and nearest neighbours. 2) find out difference between attributes of minorsample and nearest neighbours as diff 3) generate a positive random float number between 0 and 1 known as gap 4) new attributes of minorsample = (attributes of minorsample) + gap * difference 5) As described above, all attribute for a particular sample are generated 6) such controlled randomness is added to N number of samples to generate synthetic ones :param N: factor by which sample needs to be increase, e.g. 2 means twice :param minorSample: all minor samples :param nnarray: nearest neighbour array [[2.4, 2.5, 'a'],[2.3, 2.2, 'a'],[2.5, 2.5, 'a']] :param numattrs: equals to number of feature (3) , in 0 based index it iterates from 0,1,2,3 :return: """ while (N > 0): nn = randint(0, len(nnarray) - 2) eachUnit = [] for attr in range(0, numattrs+1): #[0,1,2,3] iterate over each attribute (feature) diff = float(nnarray[nn][attr]) - (minorSample[nn][attr]) # difference between nearest neighbour and actual minor sample gap = random.uniform(0, 1) # generate a random number between 0 and 1 eachUnit.append(minorSample[nn][attr] + gap * diff) # multiply difference with random number and add this to original attribute value for each in eachUnit: syntheticData.write(str(each)+",") syntheticData.write("\n") N = N - 1

Function -6. The last function is SMOTE function , which call all these above functions and serve as an entry point for the work-flow. k is number of nearest neighbors to be taken in to consideration to generate synthetic samples.

def SMOTE(T, N, minorSamples, numattrs, dataSet, k): """ :param T = Number of minority class Samples # here we have 5 :param k = k mean (clustering value) :param minorSample: all minor samples :param N = "Number of sample to be generated should be more than 100%" Amount of smoted sample required N% """ if (N <= 100): print "Number of sample to be generated should be more than 100%" raise ValueError N = int(N / 100) * T # N = number of output samples required nnarray = [] for eachMinor in minorSamples: # nnarray all nearest neighbour [[2.4, 2.5, 'a'],[2.3, 2.2, 'a'],[2.5, 2.5, 'a']] nnarray = (getNeighbors(dataSet, eachMinor, k)) populate(N, minorSamples, nnarray, numattrs)

I am not yet done, I have generated synthetic data but who knows I am right or not. If this would be 2-D data I would directly plot on graph and can visualize but this is 4-D data, A technique called t-sne (t-Distributed Stochastic Neighbor Embedding ) is used to visualize multi - dimensional data. If you are unaware about t-sne, go through my previous tutorial.

After running file visualize.py I got below given Image:

tsne after applying SMOTE algorithm on iris dataset

Figure. 1 Samples of Iris-virginica before and after application of SMOTE algorithm

Increase in Dark blue samples clearly represents that we have successfully synthesized sample for Iris-virginicaI am currency working on making this implement multi threaded, Soon will update this tutorial.

#MachineLearningTools