You may refer this literature for mathematical explanation of below implemented algorithm 1) http://www.lans.ece.utexas.edu/courses/ee380l_ese/2013/mars.pdf
2) https://www.assct.com.au/media/pdfs/Ag%2022%20Everingham%20and%20Sexton.pdf
All codes discussed here can be found at my Github repository
For effective learning I suggest, you to calmly go through the explanation given below, run the same code from Github and then read mathematical explanation from above given links. Code compatibility : Python 2.7 Only Often it is found that due to changing requirement with change in package versions code fails to run. I have added file namedrequirement.txt file to GitHub, which shows all packages with version present in my system when I executed this code. This will help in replicating working environment and debugging. To get this code running execute swidish.py file as given in GitHub repository.
Figure 1. Showing year-wise Swedish fertility.
Here we will start with Part - 2 of Multivariate Adaptive Regression Splines (MARS). In this section we will apply previous code to a real data. we will see shortcomings of previous code and will improve it further.
Run this file swidish.py. You will get output, after plotting the same we will gate below given graph.
Figure 2. All Separated fragments in Swedish fertility Data
In swidish.py I have written 5 functions with following functionality
loadFromcsv - load data from csv file , Here "SwidishFertility" file to be loaded in string form
convertDataToFloat - converting all string value to float
findTrendline - will find trendline - simillar to as discussed in MARS PART - 1
findBreakPoints - will find break points - simillar to as discussed in MARS PART - 1
Now we will enhance functionality of find breakpoint function so that instead of giving individual point as a fragment it will give group of related points
Now we will implement forward run part of MARS, basically to combine two related points we can have two approaches
1) Join individual points - Forward Run'
2) Controlled breaking of larger series in to smaller once - Backward Run
Now we are going to implement forward run algorithm - Refer forwardRun.py Now onwards
I have modified findBreakPoints function a little and named it as findBreakPointsAdvance.
I have changed it a little in "if condition". Earlier we used to mark any point as a break point where there is change in direction of difference. Now in "If" we have attached one more condition, if precentage difference between a point and surrounding two points is greater than 150% or lesser than 50% , then and then breakpoint will be considered.
def findBreakPointsAdvance(difference): """ find breakpoints for allocation of different slop and b0 to each fragment :param difference: is the array which shows what is the difference between current point and the same point on trend-line :return: subFragments, also referred as breakpoints """ subFragments = []
lastPoint = 0 interMediatePoints = [] for i in range(1, len(difference) - 1): interMediatePoints.append(i)
# percentageDifferenceBetweenThreePoints - it measures, what is percentage difference between i and i-1 and i+1 point percentageDifferenceBetweenThreePoints = difference[i] * 10 / (difference[i - 1] + difference[i + 1]) / 2
# application of some more conditions in below given function will avoid formation of segments containing only one point if (difference[i - 1] > difference[i] < difference[i + 1]) and percentageDifferenceBetweenThreePoints < 50: # if i lower than i-1 and i+1 it will be defined as breakpoint # with additional condition the difference must be lesser than 50% # this will be a crest in local region \/ lastPoint = i subFragments.append(interMediatePoints) interMediatePoints = [] if (difference[i - 1] < difference[i] > difference[i + 1]) and percentageDifferenceBetweenThreePoints > 150: # if i higher than i-1 and i+1 it will be defined as breakpoint # with additional condition the difference must be lesser than 150% # this will be a trough in local region /\ lastPoint = i subFragments.append(interMediatePoints) interMediatePoints = [] # processing last fragment subFragments.append([j for j in range(lastPoint + 1, len(difference) + 1)]) return subFragments
Weather or not a particular nearby fragments mearges or not also depends on the slop. I will explain to you how. there can be two type of slops 1) Positive slop; 2) Negative slop
Figure 3. Type of slop and their significance
Coefficient represents point on y axis where trend-line is meeting y axis (shown by Intercept in Fig. 3 and slop represents direction of trend-line. To combine two nearby fragment we need to consider these two factors. If two fragment's slop and coefficient lies in definite range then only we can combine them.
findBreakPointsAdvance = takes care of boundaries related to slop
fragmentCombiner - will take care of boundaries related to coefficient.
def fragmentCombiner(breakPoints, slopAndCoefficientArray): """ combine smaller fragments to bigger one based on regression coefficient of nearby fragments
:param breakPoints: all breakpoints [[1,2,3],[4,5,6,7],...[..][..]] :param slopAndCoefficientArray: slop and regression coefficient for all fragments e.g.[[.06,6.4],[.76,3.4]] where .06 is the slop & 6.4 is the regression coefficient :return: """ newForwardArray = [] # to store fragments after merger of one or many fragments for i in range(0, len(breakPoints) - 2, 2):
slopAndCoefficientCurrent = slopAndCoefficientArray[i] # coefficient of current fragment slopAndCoefficientNext = slopAndCoefficientArray[i + 1] # coefficient of Next fragment
# percentage difference between coefficient of current and next fragment percentageDiff = slopAndCoefficientNext[1] * 100 / slopAndCoefficientCurrent[1]
""" if percentage difference between coefficient is between 20 then combine. """ if (percentageDiff < 20 or percentageDiff > -20) : temp1 = [] # if this condition is satisfied then two fragments will be merged in to one temp1.extend(breakPoints[i]) temp1.extend(breakPoints[i + 1]) # new array with less number of fragments newForwardArray.append(temp1)
return newForwardArray
You may happily run the below given function in file forwardRun.py
def withForwardRun(): """ This function implements Multivariate adaptive regression splines with forward run In short all fragment wgich can be combined with out significant loss in accuracy can will be combined as bigger fragmnets :return: """ # loading data-set # https://datamarket.com/data/set/22s2/annual-swedish-fertility-rates-1000s-1750-1849-thomas-1940#!ds=22s2&display=line # Annual Swedish fertility rates (1000's) 1750-1849 Thomas (1940)
# loading dataset array = loadFromcsv("SwidishFertility") # loded dataset in previous step is in string form, converting to float array = convertDataToFloat(array) xArray = [] yArray = []
# Seperating data in to x and y for eachXYPAir in array: x = eachXYPAir[0] y = eachXYPAir[1] xArray.append(x) yArray.append(y)
# getting trend line for the entire data m, b = findTrendline(xArray, yArray)
difference = [] # will store difference between actual value of y and Ytrend (y derieved from trendline) yArray = []
for eachXYPAir in array: x = eachXYPAir[0] y = eachXYPAir[1] Ytrend = m * x + b # finding derived value of Y called as Ytrend (y derieved from trendline) from so found m and b in previous code block yArray.append(y) difference.append(abs(y - Ytrend)) """ Breakpoints here are the point which separates group of data having similar slop, locally called fragments """ breakPoints = findBreakPointsAdvance(difference) # print breakPoints
# below given code fragment will give slop and coefficient, for each fragment. slopAndCoefficientArray = [] # m = slop and b = regression coefficient for eachFragment in breakPoints: # print eachFragment, xFragment = xArray[eachFragment[0] - 1:eachFragment[-1]] yFragment = yArray[eachFragment[0] - 1:eachFragment[-1]] m, b = findTrendline(xFragment, yFragment) slopAndCoefficientArray.append([m, b])
# print slopAndCoefficientArray # having m = slop and b = regression coefficient for each fragment
# differenceArray = fragmentCombiner(breakPoints, slopAndCoefficientArray) # print differenceArray
# Fragment combiner is a function that combines relavant fragment on preset criteria # Will explore the function in detail soon # intially the fragment number was 35, now decreased to 17, after application of fragmentCombiner function. breakPoints = fragmentCombiner(breakPoints, slopAndCoefficientArray) # print breakPoints
for eachFragment in breakPoints: # print eachFragment,
# Getting x and y for the given Fragment range xFragment = xArray[eachFragment[0] - 1:eachFragment[-1]] yFragment = yArray[eachFragment[0] - 1:eachFragment[-1]]
# finding trendline for the given Fragment range m, b = findTrendline(xFragment, yFragment) # print m,b
# Getting x and y for the given Fragment range xValueForFragment = [xArray[j - 1] for j in range(eachFragment[0], eachFragment[-1] + 1)] yValueForFragment = [yArray[j - 1] for j in range(eachFragment[0], eachFragment[-1] + 1)] yDerived = [xArray[j - 1] * m + b for j in range(eachFragment[0], eachFragment[-1] + 1)]
# printing value of y and yTrend for printing for i in range(0, len(xValueForFragment)): print yValueForFragment[i], yDerived[i] print ""
After application of all these function to our data new trend-lines looks something like this:
Figure 4. Finally, all relevant fragment got merged; better result in comparison to Figure 2.
We can see that all related fragments got merged . you may also implement backward run similarly.
For n fragments we can have n number of slopes and n number of regression coefficient overall equation predict with MARS can be represented by γ =βn + MnΧ where n is any number from 1 to number of total fragments (n)
Now if I want to predict for any point in graph, local coefficient and slop has to be taken in equation.
Hope you understood the concept well. If not ask your queries. get your hands dirty run the code and debug.