A continuation of the last blog

Again, sorry for the delay – I ought to set some sort of reminder.

Last time we created a simple program from scratch to solve a simple numeric puzzle. This time, we’re going to use a machine learning library to make it easier.

We’re going to be using Scikit-learn, one of the most popular machine learning libraries for Python.

### Installation

The easiest way to install scikit-learn is by using pip, the inbuilt package manager for Python. Simply type:

`pip install -U scikit-learn`

into a console, terminal or cmd window. If you don’t have numPy and sciPy already installed, you need to install these before, using pip.

You can check if it’s installed correctly by typing

```>>> import sklearn
>>> sklearn.__version__```

into an IDLE window.

### Easy puzzle

Firstly, let’s solve the puzzle we did last time:

Input 1 Input 2 Input 3 Output
0 0 1 0
1 1 1 1
1 0 1 1
0 1 1 0

#### Decision trees

A decision tree is an algorithm that can be used to classify data. It is a series of If/Else statements that sort data into groups. They are called decision trees because when they are represented by a diagram they look like an upside down tree. Here is a very simple decision tree that decides if you are an adult or not:

```               [Age > 18?]
/No    \Yes
/        \

This is sometimes called a decision stump, because it only has one decision. A more complex decision tree for deciding if you would have survived the Titanic might look like: (image source: Big Whale Learning)

Decision trees have to be built by picking a feature to split at each node. For example, at the start of this tree, we could have split the data based on gender, class, age or point of embarkment. But how can we know which feature to split? To do this, we calculate the purity that each split gives. A ‘pure’ node is a node that doesn’t have a mix of classes; it only has one class, such as died or survived. For every split we could do, we try it and find the purity of its two child nodes. The split with the purest child nodes gets picked and is added to the tree. We do this using a measure called Entropy (go here to find out more).

Now we can try solving the puzzle we did last time. Open a new Python file and type in the following:

```from sklearn import tree

x=[[0,0,1], [1,1,1], [1,0,1], [0,1,1]] #Training puzzles
y=[0, 1, 1, 0] #Training puzzle answers
```

Firstly, we import the decision tree class from Scikit Learn, and then we store the training puzzles and their answers in two lists.

```dt = tree.DecisionTreeClassifier() #Create the tree
dt = dt.fit(x,y) #attach (fit) the training data to the tree
```

Then we create the tree in a variable called ‘dt’ and fit the training data to it. The tree is then created.

```output = dt.predict([[1,0,0]]) #try a new puzzle
print(output) #print the tree's attempt
```

Lastly, we give the tree a new situation (puzzle) and it attempts it and its output gets printed.

When you run all the code above, you should find that `` is outputted. This is correct, since the output is the same as input 1. You can now try any puzzle, and the tree should get it right (assuming it’s a valid puzzle).

Now we can see a representation of what the tree’s doing. Add the following code to the end of your file:

```picture = tree.export_graphviz(dt, out_file=None,feature_names=['input 1','input2','input3'],class_names=['0','1'])
print(picture)
```

This should print some text such as:

```digraph Tree {
node [shape=box] ;
0 [label="input 1 <= 0.5\ngini = 0.5\nsamples = 4\nvalue = [2, 2]\nclass = 0"] ; 1 [label="gini = 0.0\nsamples = 2\nvalue = [2, 0]\nclass = 0"] ; 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="gini = 0.0\nsamples = 2\nvalue = [0, 2]\nclass = 1"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}```

This is your tree represented in DOT format, a format that allows you to represent graphs as text. Copy your text into Webgraphviz and select ‘Generate Graph!’ It should now create a diagram of your tree, such as the following: As you can see, this is a very simple tree. All it does is splits the data based on whether input 1 is smaller than or equal to 0.5 or not. If it is smaller than or equal to 0.5, it outputs 0. Otherwise, it outputs 1. This is the correct answer for the puzzle. Now we can try a harder problem

### Harder problem

Now we’re going to try to create a decision tree that can predict your gender based on your name. However, now we’ve hit two of the largest problems in machine learning; training data and deciding which features to use.

Neither of these were problems in the last problem. There was so little training data we could just type it in manually, and the features were simply the inputs. However, now we need hundreds of names. Where can we get hundreds of names from? Also, most machine learning algorithms including decision trees can’t accept textual data – only numeric data. So how can we represent names as numbers?

Fortunately, here there is a list of male names and here there is a list of female names created by Mark Kantrowitz in 1991 and added to by Bill Ross. Just in case you’re wondering, these lists were created for AI training, not baby names browsing.

We need to pick features (properties) of the name to represent as numbers. After a lot of research/playing about, I’ve decided to use the length of the name, the second to last letter (as a number from 1-26) and the last letter. Now we can start the code

```import urllib.request
import random
from sklearn import tree
```

We need to import urllib.request to get the names, random to shuffle the data and Scikit-Learn’s tree library. I think you can guess why we need the last one.

```def extractFeatures(name): #function that extracts length, second to last char and last char from name
last_char=ord(name[-1]) - ord('a')+1 #find last char as number
sec_last_char=ord(name[-2]) - ord('a')+1 #find second last char as number
return [len(name),sec_last_char, last_char] #return as list
```

Then we define a function called extractFeatures() that takes a name as an argument. It finds the last and second to last characters and the length of the name and returns it in a list.

```male=[]
for line in urllib.request.urlopen("https://www.cs.cmu.edu/Groups/AI/areas/nlp/corpora/names/male.txt"): #get male names
if chr(line) != '#' and chr(line) != '\n': #if it isn't a comment or blank line
male=male+[line.decode("utf-8").rstrip()] #add name to male list after decoding and removing \n
```

Then we open the url of the list of male names using urllib. For each line on the page, if it isn’t a blank line or a comment we add it to the list of male names after decoding it to utf-8 and removing newlines.

```female=[]
for line in urllib.request.urlopen("https://www.cs.cmu.edu/Groups/AI/areas/nlp/corpora/names/female.txt"):#get female names
if chr(line) != '#' and chr(line) != '\n': #if it isn't a comment or blank line
female=female+[line.decode("utf-8").rstrip()] #add name to female list after decoding and removing \n
```

We then repeat the process for female names

```random.shuffle(male) #shuffle both lists to make order random instead of alphabetical
random.shuffle(female)
```

We then shuffle both lists randomly so that they aren’t in alphabetical order.

```train_names=male[:1500]+female[:1500] #create a list of 1500 male and 1500 female names for training
train_names_gender=['M']*1500+['F']*1500 #create a list with 'M' or 'F' for each name
print(train_names[:1]) #see sample of data
```

We then add the first 1500 shuffled male names and the first 1500 shuffled female names to a list to use to train the tree. We also create a list that, for each name in the training list, has either ‘M’ or ‘F’ which we’ll also use to train the tree.

```test_names=male[1500:2000]+female[1500:2000] #create a list of 500 male and 500 female names for testing
test_names_gender=['M']*500+['F']*500 #create a list with 'M' or 'F' for each name
```

We then create a list with 500 different male names and 500 different female names in order to calculate the accuracy of the tree at the end. We also create another list of ‘M’ and ‘F’ values to check if the tree’s predictions for the test names are correct (in order to calculate the accuracy).

```temp=[]
for name in train_names: #for every name,
temp.append(extractFeatures(name)) #extract features and add to temp
train_names=temp #replace list of real names with list of features
print(train_names[:1]) #see data after feature extraction
```

Then, for every name in the training data we perform the extractFeatures function on it and add the list of features to temp. We then replace the list of names in train_names with the 2D list of features for each name.

```dt = tree.DecisionTreeClassifier(max_depth=4) #Create the tree
dt = dt.fit(train_names, train_names_gender) #attach (fit) the training data to the tree
```

We then create the decision tree itself in a variable called dt and fit the training data to it. We limit the depth of the tree to 4 to prevent overfitting (I’ll come back to this later).

```temp=[]
for name in test_names: #for every name in test list,
temp.append(extractFeatures(name)) #extract features and add to temp
test_names=temp #replace real names with features
```

We then loop through the test names and extract their features like we did to the training names earlier on.

```correct,total=0,0
for i in range(len(test_names)): #for every name in test list
output=dt.predict([test_names[i]]) #try predicting gender with tree
if output==test_names_gender[i]: #if it's the same as the actual gender
print("ACCURACY:",(correct/total)*100,'%') #calculate and print accuracy
```

We then loop through the list of test name features and ask the tree to perform a prediction for each name. If the tree’s output matches the corresponding value in the list of genders for the test names, we add 1 to the correct variable that counts the number of test names the tree correctly predicted the gender for. We then add 1 to total, regardless of whether it predicted the name correctly or not. After looping through each name, we calculate the accuracy as a percentage and print it.

```test=input("\nEnter a name to test (or -1 to quit): ") #allow user to enter name
while test != '-1': #if user doesn't request to quit
test=extractFeatures(test) #extract features from user-entered name
output=dt.predict([test]) #predict gender of name
print ("PREDICTION:",output,'\n') #print prediction
test=input("Enter a name:") #prompt to enter another name
```

We then allow the user to enter a name, and we perform feature extraction on it. Then the tree predicts the name’s gender and its prediction gets printed. It keeps prompting the user to enter a name until they enter ‘-1’ to show they want to quit. Here’s all the code together:

```import urllib.request
import random
from sklearn import tree

def extractFeatures(name): #function that extracts length, second to last char and last char from name
last_char=ord(name[-1]) - ord('a')+1 #find last char as number
sec_last_char=ord(name[-2]) - ord('a')+1 #find second last char as number
return [len(name),sec_last_char, last_char] #return as list

male=[]
for line in urllib.request.urlopen("https://www.cs.cmu.edu/Groups/AI/areas/nlp/corpora/names/male.txt"): #get male names
if chr(line) != '#' and chr(line) != '\n': #if it isn't a comment or blank line
male=male+[line.decode("utf-8").rstrip()] #add name to male list after decoding and removing \n

female=[]
for line in urllib.request.urlopen("https://www.cs.cmu.edu/Groups/AI/areas/nlp/corpora/names/female.txt"):#get female names
if chr(line) != '#' and chr(line) != '\n': #if it isn't a comment or blank line
female=female+[line.decode("utf-8").rstrip()] #add name to female list after decoding and removing \n

random.shuffle(male) #shuffle both lists to make order random instead of alphabetical
random.shuffle(female)

train_names=male[:1500]+female[:1500] #create a list of 1500 male and 1500 female names for training
train_names_gender=['M']*1500+['F']*1500 #create a list with 'M' or 'F' for each name
print(train_names[:1]) #see sample of data

test_names=male[1500:2000]+female[1500:2000] #create a list of 500 male and 500 female names for testing
test_names_gender=['M']*500+['F']*500 #create a list with 'M' or 'F' for each name

temp=[]
for name in train_names: #for every name,
temp.append(extractFeatures(name)) #extract features and add to temp
train_names=temp #replace list of real names with list of features
print(train_names[:1]) #see data after feature extraction

dt = tree.DecisionTreeClassifier(max_depth=4) #Create the tree
dt = dt.fit(train_names, train_names_gender) #attach (fit) the training data to the tree

temp=[]
for name in test_names: #for every name in test list,
temp.append(extractFeatures(name)) #extract features and add to temp
test_names=temp #replace real names with features

correct,total=0,0
for i in range(len(test_names)): #for every name in test list
output=dt.predict([test_names[i]]) #try predicting gender with tree
if output==test_names_gender[i]: #if it's the same as the actual gender
print("ACCURACY:",(correct/total)*100,'%') #calculate and print accuracy

test=input("\nEnter a name to test (or -1 to quit): ") #allow user to enter name
while test != '-1': #if user doesn't request to quit
test=extractFeatures(test) #extract features from user-entered name
output=dt.predict([test]) #predict gender of name
print ("PREDICTION:",output,'\n') #print prediction
test=input("Enter a name:") #prompt to enter another name
```

```picture = tree.export_graphviz(dt, out_file=None,feature_names=['length','sec_last_char','last_char'],class_names=['M','F'])
print(picture)
```

and copy the printed text into Webgraphviz again.

If you run it, you should get an accuracy of around 74 to 76%. This is pretty good, considering it’s only the second time we’ve used a decision tree, we’ve only used 3 features and a decision tree is a comparatively simple machine learning algorithm. Now you can try entering your own name and see if it predicts your gender correctly.

When you are deciding what features to use for a decision tree, you have to be careful because not enough features will mean that it doesn’t have enough data to make accurate predictions and too much will mean that it’s less accurate because then the tree could become too specific and lose accuracy. Also, you have to make sure that the features you pick are related to what you want to predict, or the tree won’t be able to predict from that data.

For example, if you wanted to predict how many children someone has, you can’t only use their age. Likewise, you can’t use their age, average monthly income, average monthly outgoing, how old their car is,  the temperature inside their greenhouse, material their house is made of, whether or not they own a boomerang, their favourite colour, the size of their car’s wheels, how many instruments they play, their favourite sport, whether they prefer Coke or Pepsi, the number of pets they have, the number of potted plants in their house, whether or not they can speak Spanish and the colour of the blinds in their bedroom. This would just make the decision tree more complicated and probably less accurate. Lastly, obviously, you can’t use unrelated features such as the temperature inside their greenhouse, whether or not they own a boomerang and whether they prefer Coke or Pepsi.

#### Overfitting

A major disadvantage of decision trees is that they can become overfitted easily. Overfitting is when a machine learning model becomes too exact or specialised to the training data, and hence doesn’t work well with any other data. This can be seen in the following picture from Wikipedia: The green line represents an overfitted model – it exactly sorts the training data, even the irregularities. The green line would have a 100% accuracy if tested against the training data. However, if you introduced different data, it would most likely perform worse than the black line.

One way of preventing decision trees from becoming overfitted is by limiting the size of the tree. This is what we did by setting `max_depth=4` when we created the decision tree. As you can see, this prevents the tree from creating more than 4 decision nodes down any path: However, if we removed the max_depth argument, the tree would become massively over complicated as a result of overfitting: And that’s only part of the tree!