Linear regression with scikit-learn and installing Anaconda to gain access to thousands of machine learning libraries.

Installing Anaconda

Firstly, go here and download the Anaconda installer for your system. Download the Python 3.x version as this is the most recent and widely used version of Python. It may take a while to download.

Once you have downloaded the installer, run it to install Anaconda on your computer. Once it has finished, open the newly-installed Anaconda Navigator.

When you open Anaconda, you’ll be presented with a choice of applications to start. Select ‘Environments’ in the left to see all of the machine learning libraries that are available in the default environment. Here you can create different environments with different libraries available to run your code in. Go back to Home and launch the Spyder IDE, which we’ll be using today.

Classification and regression

Classification is when a machine learning model uses data to sort items into groups or categories. For example, last time we created a decision tree that classified names into Male or Female.

Regression, on the other hand, is when a model uses data to create a numerical prediction. For example, you might predict a student’s test result percentage whereas with classification you could only predict whether they’d pass or fail.

Linear regression

Linear regression is a regression algorithm. When it is drawn out on a graph it forms a straight line with a constant gradient, hence the name linear regression. For example:

Image result for linear regression

Image source: David Fumo on Medium

As you can see, the model basically draws a line of best fit and uses it to make predictions. The obvious limitation of this is that it can only accurately predict simple linear patterns.

Making predictions

To make predictions the model uses an equation such as the following:

Capture.PNG

This is a very simple equation. x is the input we give the model, such as hours revised. w is the weight (also called the coefficient) that the model decides on and b is the bias or offset (where the line goes across the y axis) which is also decided by the model. p is the model’s prediction.

If you recall straight lines from maths you’ll realise that it is the same as the equation for a straight line, where x would be the x axis on the graph, w would be the gradient and b would be the y-intercept.

The prediction of a more complex model with multiple input variables could be calculated as:

Capture.PNG

Where p is the prediction xy and z are the inputs and w_1, w_2 and w_3 are the weights decided upon by the model.

Cost Function

In order to optimise the weights, we use a cost function. The cost function essentially determines how wrong the model is, or how badly it is performing. A popular cost function is MSE, which stands for Mean Squared Error. It is also called L2:

Capture.PNG

Although it looks complicated at first, it is not so complicated once you’ve been walked through it.

n is the total amount of datapoints in the dataset. \sum\limits_{i=1}^{n} is the calculus summation symbol. It basically loops through each number from the number at the bottom (1) to the number at the top (n), adding 1 to the variable defined at the bottom (i in this case) each time. Each time, it adds the answer to the calculation on its right to the total. For example:

Capture.PNG

Each time it adds the value of i (which increases by one each time) to the total. It loops five times, as it says at the top. This could be written in python like:

total=0
for i in range(1,6): #Python stops one early, so it has to be six not five
    total=total + i

(mx_i + b) is our prediction for a datapoint (the ith datapoint) in the training data, like we saw earlier. y_i is the correct output for that datapoint. So (y_i - (mx_i + b)) works out the difference between our model’s prediction and the correct output, or the error for that datapoint. It then gets squared and added to the total.

Lastly, the total returned by the summation is multiplied by \tfrac{1}{n}, which gives us the average of the squared difference between the correct and predicted answer for each item in the training data. So essentially it loops through each datapoint in the training data and performs a prediction on it, working out the error and squaring it each time before calculating the average of all the squared errors.

The goal of our model is to reduce the score calculated by MSE. Don’t worry if you don’t understand the maths, because unless you want to write the algorithms from scratch you don’t really need to know them; you can just use the models in Scikit-learn.

After calculating the cost function the model uses and algorithm called gradient descent to change the weights to minimize the cost. You can find out more about gradient descent here.

Programming

Open your Spyder IDE which came with Anaconda (you can launch it from home on the Anaconda Navigator). We’re going to use linear regression to predict the sales revenue of a company based on their advertising spending. You can download the dataset as a CSV file from here and save it into the same directory as your python file. The file has four columns: thousands of dollars spent on TV, radio and newspaper advertising and sales in thousands of units sold.

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from numpy import array
from sklearn.metrics import mean_squared_error
import math

Firstly we need to import everything we need.

Preparing the data

data = pd.read_csv("Advertising.csv")

data['totalSpend'] = data['newspaper'] + data['radio'] +data['TV']
print(data.head())

Next we have to open the csv file in pandas, a library commonly used for reading csv files. Then we create a new column in the version of the file stored in the variable data. Because this is our first attempt at linear regression, we’re only going to use one variable to keep it simple. To do this, we’re going to add the newspaper, radio and TV spending together and call it total spending. Finally, we use the pandas head() function to print the first five rows of the file. It should look like this:

   Unnamed: 0     TV  radio  newspaper  sales  totalSpend
0           1  230.1   37.8       69.2   22.1       337.1
1           2   44.5   39.3       45.1   10.4       128.9
2           3   17.2   45.9       69.3    9.3       132.4
3           4  151.5   41.3       58.5   18.5       251.3
4           5  180.8   10.8       58.4   12.9       250.0

As you can see, there are two columns of index numbers. This is because there was already one in the file and pandas has also added its own. We don’t need either of these, so in the next chunk of code we need to deal with them.

del data['newspaper']
del data['radio']
del data['TV']
data.set_index("totalSpend", inplace= True)
del data['Unnamed: 0']
print(data.head())

Here we delete the newspaper, radio and TV columns as we no longer need them. Then we replace the index created by pandas with the totalSpend column. We then delete the index column that was already in the file. Lastly, we print the first five rows again:

            sales
totalSpend       
337.1        22.1
128.9        10.4
132.4         9.3
251.3        18.5
250.0        12.9

Perfect!

Plotting the data in a graph

data.plot(marker='.',linestyle="None")
plt.show()

We then plot the data in a graph using pyplot:

Capture.PNG

As you can see, there is a correlation – the higher the advertising spending, the higher the sales.

Splitting the data

X=array(data.index).reshape(-1,1)
y=array(data['sales']).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=0)

We then convert the columns into Numpy arrays and reshape them into 2D arrays that can easily be fitted to the model before assigning them to the variables X and y, for input and output (spending and sales). Lastly, we use the scikit-learn function train_test_split() to split it into training and testing data – a third of it will be for testing and two thirds for training.

Creating and training the model

lr = LinearRegression()
lr.fit(X_train, y_train)

y_test_pred = lr.predict(X_test)

Next we create the model and fit the training data to it (give it the training data so that it starts to train). Then, once it has finished training it tries every test datapoint and stores its prediction in y_test_pred.

Evaluating the model

print(lr.coef_)
print(lr.intercept_)
print("MSE =",round(mean_squared_error(y_test, y_test_pred),2))
print("Average error =",round(math.sqrt(mean_squared_error(y_test, y_test_pred)),2))
print("R2 score =", round(r2_score(y_test, y_test_pred),2))

On the third line we print out the model’s MSE and on the fourth line we square root that to find the average error, or the average of how far off the model’s predictions were. Mine was an average of 2.33 off, which is quite good. Next we calculate the R^2 score of the model. This measures how close the data is to the regression line. 1.0 is the best possible score of a perfect model. Our model scored 0.79, which is also quite good. Also, on the first and second lines we print out the model’s weight and bias. Mine were 0.04811286 and 4.49619351. You may remember our equation to calculate the prediction:

Capture.PNG

If we substitute the letters for the true values we get how the model actually predicts:

Capture.PNG

We can prove this because if we add print(X_test[0]) into the program we find that the first test datapoint is 108. If we now type 0.04811286 * 108 + 4.49619351 into a calculator we get 9.69238239. Now, if we add print(lr.predict([X_test[0]])) to the program, we find that this is the exact output of the model! Anyway, the rest of the code:

Showing the model on a graph

plt.scatter(X_test,y_test,marker='.')
plt.plot(X_test,y_test_pred,color='red',linewidth=3)
plt.xlabel('Advertising spending (thousands of $s)')
plt.ylabel('Sales (thousands of units)')
plt.show()

Lastly, we create another graph and plot the test data on it. Then we add the red line of the model’s predictions and the labels on the axes:

Capture.PNG

Here’s all the code together:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from numpy import array
from sklearn.metrics import mean_squared_error, r2_score
import math

data = pd.read_csv("Advertising.csv")

data['totalSpend'] = data['newspaper'] + data['radio'] +data['TV']
print(data.head())
del data['newspaper']
del data['radio']
del data['TV']
data.set_index("totalSpend", inplace= True)
del data['Unnamed: 0']
print(data.head())

X=array(data.index).reshape(-1,1)
y=array(data['sales']).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=0)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_test_pred = lr.predict(X_test)

print(lr.coef_)
print(lr.intercept_)
print("MSE =",round(mean_squared_error(y_test, y_test_pred),2))
print("Average error =",round(math.sqrt(mean_squared_error(y_test, y_test_pred)),2))
print("R2 score =", round(r2_score(y_test, y_test_pred),2))

plt.scatter(X_test,y_test,marker='.')
plt.plot(X_test,y_test_pred,color='red',linewidth=3)
plt.xlabel('Advertising spending (thousands of $s)')
plt.ylabel('Sales (thousands of units)')
plt.show()

Using all three features in the dataset

Now we can try training the model with all three features – TV, radio and newspaper. Here’s the new code:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from numpy import array
from sklearn.metrics import mean_squared_error, r2_score
import math

data = pd.read_csv("C:\\Users\\brych_000\\Downloads\\Advertising.csv")

del data['Unnamed: 0']
print(data.head())

X=array(data[['TV','radio', 'newspaper']].values.tolist())
y=array(data['sales']).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=0)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_test_pred = lr.predict(X_test)

print(lr.coef_)
print(lr.intercept_)
print("MSE =",round(mean_squared_error(y_test, y_test_pred),2))
print("Average error =",round(math.sqrt(mean_squared_error(y_test, y_test_pred)),2))
print("R2 score =", round(r2_score(y_test, y_test_pred),2))

This time we don’t have to create a new column or delete the old ones, since we’re using TV, radio and newspaper instead of the total of all three. Also, we don’t set an index as we’re not making a graph so there’s no point. In addition, we use the .tolist() function to get the TV, radio and newspaper data for X in a 2d array like this:

[[230.1  37.8  69.2]
 [ 44.5  39.3  45.1]
 [ 17.2  45.9  69.3]
 ...

Evaluating and improving the new model

When we run our new code we get a new average error of 1.85, which is an improvement of 0.48. Also, our R^2 score has improved to 0.87; 0.09 closer to a perfect model. However, we find that the coefficients (weights) for TV, radio and newspaper are [[0.04433887 0.19659119 0.00264541]]. As you can see, the model has given newspaper a tiny weight, suggesting that it is unimportant. If we remove newspaper from the training data by changing line 17 to X=array(data[['TV','radio']].values.tolist()) we find that the average error and R^2 score stay the same. This means that we’ve simplified our model with no effect on performance!

That’s it for today! Next time we’ll try some more complex datasets. Thanks for reading!

Advertisements