Multivariate Adaptive Regression Splines (MARS) algorithm using py-earth

Sunil Patel
Apr 8, 2017
6 min read

Code/data discussed in this tutorial is discussed in this blog post is available at my GITHUB repository

Py-earth is a Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines (MARS) algorithm. We have seen how MARS works actually through implementation in previous parts [1,2] of tutorial. Step 1. Installation :( for windows linux and mac )

git clone git://github.com/scikit-learn-contrib/py-earth.git cd py-earth python setup.py install

After installing this package, I had to restart my IDE ( Pycharm ). Then it worked seamlessly.

When you hit this commands, it also installs some additional packages. If you have some conflict with package, refer requirements.txt . requirements.txt list all package (with version) my system had when I ran the code for this tutorial.

to install any package in python with specific version you may provide command like this.

pip install package==version OR python -m pip install package==version OR pip install package==version--user

Note :

Refrain from installing python package with sudo like sudo pip install package==version

instead install for your user space only like pip install package==version--user

Step 2. Usage

We will apply this package on the same data (swidishFertility.txt) , we used to discuss MARS PART - 2

Below given is the basic code to run py-earth on the Swedish fertility data and visualize using pyplot utility.

In addition to below given script we use two functions 1) loadFromcsv - to load data from file; 2) convertDataToFloat - convert loaded data from file to float.

dataset = loadFromcsv("SwidishFertility.txt") # loading dataset dataset = convertDataToFloat(dataset) # converting string to float

xArray = [] yArray = []

#seperating X and Y from the dataset for eachXYPAir in dataset: x = eachXYPAir[0] y = eachXYPAir[1] xArray.append(x) yArray.append(y)

# print len(xArray) xArray = numpy.asarray(xArray,"float32") # converting to numpy array # print len(yArray) yArray = numpy.asarray(yArray,"float32") # converting to numpy array # Fit an Earth model model = Earth(max_degree=1, verbose=True) # initializing py- earth package

# making model for the data model.fit(xArray, yArray)

# Print the model print(model.trace()) print(model.summary())

# Plot the model y_hat = model.predict(xArray) plt.figure() plt.plot(xArray, yArray, 'r.') plt.plot(xArray, y_hat, 'b.') plt.show()

After running the code we will get following plot :

Figure 1. Showing hyper-plane made up of 3 fragments, for effective regression in non linear data.

MARS PART - 2 .This the same data we used to work with in Here after looking at plot you may wonder that why there are only 3 fragments only. It all depends on parameter applied while making model. If you widen the threshold as we discussed in MARS PART - 2 we will get fewer fragments and more generalization. A good model is a product of perfect trade-off between generalization and error.

You may provide more settings to fine tune your model following are the parameter accepted by Earth(...)

Usage of all parameters is described below: (taken form py-earth documentation)

max_terms : int, optional (default=min(2 * n + m // 10, 400)),

where n is the number of features and m is the number of rows)

The maximum number of terms generated by the forward pass. All memory is allocated at the beginning of the forward pass, so setting max_terms to a very high number on a system with insufficient memory may cause a MemoryError at the start of the forward pass.

max_degree : int, optional (default=1)

The maximum degree of terms generated by the forward pass.

allow_missing : boolean, optional (default=False)

If True, use missing data method described in [3]. Use missing argument to determine missingness or,if X is a pandas DataFrame, infer missingness from X.

penalty : float, optional (default=3.0)

A smoothing parameter used to calculate GCV and GRSQ. Used during the pruning pass and to determine whether to add a hinge or linear basis function during the forward pass. See the d parameter in equation 32, Friedman, 1991.

endspan_alpha : float, optional, probability between 0 and 1 (default=0.05)

A parameter controlling the calculation of the endspan parameter (below). The endspan parameter is calculated as round(3 - log2(endspan_alpha/n)), where n is the number of features. The endspan_alpha parameter represents the probability of a run of positive or negative error values on either end of the data vector of any feature in the data set. See equation 45, Friedman, 1991.

endspan : int, optional (default=-1)

The number of extreme data values of each feature not eligible as knot locations. If endspan is set to -1 (default) then the endspan parameter is calculated based on endspan_alpah (above). If endspan is set to a positive integer then endspan_alpha is ignored.

minspan_alpha : float, optional, probability between 0 and 1 (default=0.05)

A parameter controlling the calculation of the minspan parameter (below). The minspan parameter is calculated as

(int) -log2(-(1.0/(n*count))*log(1.0-minspan_alpha)) / 2.5

where n is the number of features and count is the number of points at which the parent term is non-zero. The minspan_alpha parameter represents the probability of a run of positive or negative error values between adjacent knots separated by minspan intervening data points. See equation 43, Friedman, 1991.

minspan : int, optional (default=-1)

The minimal number of data points between knots. If minspan is set to -1 (default) then the minspan parameter is calculated based on minspan_alpha (above). If minspan is set to a positive integer then minspan_alpha is ignored.

thresh : float, optional (default=0.001)

Parameter used when evaluating stopping conditions for the forward pass. If either RSQ > 1 - thresh or if RSQ increases by less than thresh for a forward pass iteration then the forward pass is terminated.

zero_tol : float, optional (default=1e-12)

Used when determining whether a floating point number is zero during the forward pass. This is important in determining linear dependence and in the fast update procedure. There should normally be no reason to change zero_tol from its default. However, if nans are showing up during the forward pass or the forward pass seems to be terminating unexpectedly, consider adjusting zero_tol.

min_search_points : int, optional (default=100)

Used to calculate check_every (below). The minimum samples necessary for check_every to be greater than 1. The check_every parameter is calculated as

(int) m / min_search_points

if m > min_search_points, where m is the number of samples in the training set. If m <= min_search_points then check_every is set to 1.

check_every : int, optional (default=-1)

If check_every > 0, only one of every check_every sorted data points is considered as a candidate knot. If check_every is set to -1 then the check_every parameter is calculated based on min_search_points (above).

allow_linear : bool, optional (default=True)

If True, the forward pass will check the GCV of each new pair of terms and, if it’s not an improvement on a single term with no knot (called a linear term, although it may actually be a product of a linear term with some other parent term), then only that single, knotless term will be used. If False, that behavior is disabled and all terms will have knots except those with variables specified by the linvars argument (see the fit method).

use_fast : bool, optional (default=False)

if True, use the approximation procedure defined in [2] to speed up the forward pass. The procedure uses two hyper-parameters : fast_K and fast_h. Check below for more details.

fast_K : int, optional (default=5)

Only used if use_fast is True. As defined in [2], section 3.0, it defines the maximum number of basis functions to look at when we search for a parent, that is we look at only the fast_K top terms ranked by the mean squared error of the model the last time the term was chosen as a parent. The smaller fast_K is, the more gains in speed we get but the more approximate is the result. If fast_K is the maximum number of terms and fast_h is 1, the behavior is the same as in the normal case (when use_fast is False).

fast_h : int, optional (default=1)

Only used if use_fast is True. As defined in [2], section 4.0, it determines the number of iterations before repassing through all the variables when searching for the variable to use for a given parent term. Before reaching fast_h number of iterations only the last chosen variable for the parent term is used. The bigger fast_h is, the more speed gains we get, but the result is more approximate.

smooth : bool, optional (default=False)

If True, the model will be smoothed such that it has continuous first derivatives. For details, see section 3.7, Friedman, 1991.

enable_pruning : bool, optional(default=True)

If False, the pruning pass will be skipped.

feature_importance_type: string or list of strings, optional (default=None)

Specify which kind of feature importance criteria to compute. Currently three criteria are supported : ‘gcv’, ‘rss’ and ‘nb_subsets’. By default (when it is None), no feature importance is computed. Feature importance is a measure of the effect of the features on the outputs. For each feature, the values go from 0 to 1 and sum up to 1. A high value means the feature have in average (over the population) a large effect on the outputs. See [4], section 12.3 for more information about the criteria.

verbose : int, optional(default=0)

If verbose >= 1, print out progress information during fitting. If verbose >= 2, also print out information on numerical difficulties if encountered during fitting. If verbose >= 3, print even more information that is probably only useful to the developers of py-earth.

#Regression