This lab practice the ideas of recursive computation with tree structure and core concepts of machine learning, by learning a decision tree for diagnosing breast cancer.
Submission instructions:
For this lab, you will need to submit 2 different files, as described below.
In the 21st century, data is everywhere. Being able to make sense of large data sets and use them to make predictions is becoming increasingly relevant in many domains including scientific and medical research, government, and commerce.
Machine learning is the study of computer algorithms that learn from data. Commonly, we are given some training data containing a set of sample input and output pairs and want to use this to make predictions for future inputs whose true output value is unknown. Using a set of labelled training data to learn in this manner is known as supervised learning.
Supervised learning algorithms use training data to create a model which describes a probabilistic relationship between input and output data, and are then able to use this model to make predictions. The fundamental goal in supervised learning is to create models that generalise well. This means that they must make accurate predictions when given future, unknown inputs rather than simply modelling the training data perfectly [1].
In this tutorial you will investigate a popular model called the decision tree, and see how it can be applied as a diagnostic tool in medicine.
The training data set we will use in this tutorial is a subset of the Breast Cancer Wisconsin (Original) Data Set, courtesy of the UCI Machine Learning Repository [4]. Each training example contains a set of measurements, called features, and a corresponding classification as either malignant (true) or benign (false). The 9 features were obtained from Fine-Needle Aspiration Cytology (FNAC) measurements of human tissue. Each feature has been normalised to lie in the range 1-10, with high valued features thought to represent more malignant (cancerous) cell characteristics. All feature values are integers.
# Feature Domain
-- -----------------------------------------
1. Clump Thickness 1 - 10
2. Uniformity of Cell Size 1 - 10
3. Uniformity of Cell Shape 1 - 10
4. Marginal Adhesion 1 - 10
5. Single Epithelial Cell Size 1 - 10
6. Bare Nuclei 1 - 10
7. Bland Chromatin 1 - 10
8. Normal Nucleoli 1 - 10
9. Mitoses 1 - 10
In this tutorial we will represent a single training example as a list containing 10 elements - the 9 input features and
the corresponding class. For example
represents a training example
whose class is benign. The training data set is then a list of training examples - a list of lists.
A test example, e.g. from a new patient, however contains only the 9 input features. For example,
represents a test example
whose class is unknown. The test data set is then a list of test examples - also a list of lists in Snap. Your overall goal is to learn good decision rules from a training dataset, so as to classify the test examples correctly.
Trees are one of the most widely used data structures in computer science. They are used for searching, sorting, representing hierarchical relationships and even finding routes for data over the Internet. Recall that a tree consists of a single root node, a number of internal nodes and leaf nodes, and that these nodes are connected with edges.
A decision tree is a special type of tree used to classify input data into one of a number of classes. Each internal node (or decision node) of a decision tree contains a decision rule, and each leaf node contains a class label. Here is a simple decision tree for our FNAC domain with 4 decision nodes and 5 leaf nodes:
This decision tree takes a set of features as input and produces a classification of either Benign or Cancer as
output. To decide the class of a new input vector we start at the root node and continue down the tree, repeatedly
applying decision rules. Each decision rule compares a particular input feature with a fixed threshold, and depending on
the feature value we continue down either the left or right subtree. When we reach a leaf node we stop and this becomes
the classification for our input vector.
The decision function at each decision node in our tree can therefore be described by two parameters: a feature number and a split point. The number in brackets next to the feature description at each decision node is the feature number. The number next to the edges leaving a decision node is the split point.
be classified as Benign or Cancer?A tree is an example of a recursive data type. The children of a node in a tree are themselves trees - large trees are
created simply by joining smaller trees together. In the example above, the subtree rooted at the node labelled Clump
thickness and the subtree rooted at the node labelled Bare nuclei are both perfectly valid decision trees in their
own right. We were able to construct a larger decision tree by joining these two trees with a new choice node labelled
Marginal adhesion.
Defining trees in this recursive manner allows us to write beautifully simple algorithms that achieve non-trivial
results. For example to classify a new input vector, all a choice node needs to do is pass the input vector to its
left or right child and ask it to perform the classification. In this task, you will add a classify method to both
the choice nodes and leaf nodes and see how this works.
In this task the tree structure is fixed (- later, you will see how the tree structure can be learned from training data).
The code around the top of the scripting area will create a fixed decision tree and show it on the stage.
Opening the block will show you its definition in Snap! that correspond to its shape drawn on stage. If this looks very busy, don’t worry, the rest of this task will break down and explain each component in turn.
We start by implementing simplest decision tree ever - the leaf node. We also use this part to look at the concept of implementing a function as part of a list (i.e. a gray ring).
The inpupt to a leaf node is a test example, the output is a decision (true or false).
There are two types of leaf nodes: One that always returns
(cancer) or one that always returns
(benign). Therefore the expected behaviour of a true leaf node, invoked with given input data, is shown below
Therefore, the leaf node should always report a class label. This class label should also be the same as the type of leaf node (i.e. TRUE or FALSE in angle slot).
Look at the snap workspace to see how this is implemented.
You next task is to complete the choice node:
Currently, it will also report Not implemented when asked to classify
an input vector. Your goal is to edit the node type blocks so the decision tree reports either
or
for an input vector according to the structure of the decision tree. Similar to the leaf node, replace
with your code.
Two utility blocks to help you visualize the process:
draws a decision tree to the stage.
can be used to visualise how a given decision tree should classify a new input vector. For example, the visualization of a small decision tree
shows:
Some more hints:
or
directly, a decision node should delegate this decision to one of its children.
or
.
and
every time you test your code.The following three tests should help you trouble shoot and test your program as you write it:
should reproduce the two-level tree drawn above, and report
.
prints a message to the bottom of the stage. For example:
this prints the classification result of the original tree above, it should say “Classification: false”.
On the other hand, a slightly different test example
should say “Classification: true”.
When you think your implementation is working run the following test:
. This will
test your implementation with a range of different inputs and report Pass or Fail.
.
Notice that this tree here is the exact converse of the 3-level decision tree earlier – whenever the other tree says ‘cancer’ this tree will diagnose ‘benign’ … So which tree is correct?
The next few tasks will explore this question.Remember that our goal in decision tree learning is to build a decision tree automatically based on our training data. How can we do that?
In computer science, a greedy algorithm is an algorithm that finds a solution to a problem by making locally optimal decisions. Greedy algorithms make a series of choices that seem best at the time but depending on the problem may not produce a globally optimal solution. These algorithms are often useful when finding a globally optimal solution is not computationally feasible. In the next two tasks, we will use a greedy algorithm to build our decision tree, choosing decision functions at each node that appear to best split the data we have at the time.
We are going to start with all our training data, and build our tree by splitting our data set over and over until we reach some stopping criteria. The algorithm we will use to choose a feature and split point given a set of training examples is as follows:
Input: dataset with N features
Output: (feature, split point) pair
for each feature 1 to N:
sort dataset by feature
for each split point in sorted dataset:
evaluate the gain of split
if gain is the highest seen so far:
remember gain, feature and split point
return highest gain, and the corresponding feature and split point
The gain of splitting on a given feature at a given split point is a measure of the quality of the split. At each stage, we want to choose the split with the lowest cost. To define this cost concretely, we need to understand the concept of entropy.
Entropy gives us a measure of the uncertainty associated with a probability distribution. Suppose we roll a die and wish to predict which outcome in the set $C={1,2,3,4,5,6}$ will occur next. If the die is fair, each outcome in $C$ has an equal probability of occuring ($p(x) = \frac{1}{6}$). Because of this, it is difficult to predict which number will come up next - the outcome is uncertain. We say that the distribution of outcomes has high entropy. Conversely, the distribution of an unfair die (that is more likely to roll some numbers than others) has lower entropy since there is less uncertainty associated with rolling it.
Given a random variable $X$ with probability mass function $p(x)$, the entropy $\mathbb{H}(X)$ is defined as:
$$ \mathbb{H}(X) = -\sum_x p(x)\log_2 p(x) $$
taking $p(x)\log_2 p(x)=0$ whenever $p(x)=0$ [3]. The more uniform the distribution $p(x)$, the higher $\mathbb{H}$ will be. $\mathbb{H}$ is at its highest when all outcomes are equally likely.
In this exercise, we define the cost of a split in terms of the entropy of the sublists that the split creates. If we consider the proportion of training examples from each of the two classes as an approximation of a probability distribution over the classes, then a set of training examples from mostly the same class has low entropy and a set with an equal number from each class has high entropy. We therefore define our cost function as:
cost of split = (proportion of examples in left sublist) * (entropy of left sublist) +
(proportion of examples in right sublist) * (entropy of right sublist)
Minimising this sum will result in both sublists having low entropy and therefore gives us a split with the best possible class separation [2].
The block
implements the feature selection function. The block takes a list of training data as input and should report the index of the optimum feature as well as the optimum split point for that feature. To report a pair of values in Snap!, report a list with two elements:
.
We encourage you to look at the implementation of the feature split block
and note the following technical aspects.
is used to keep track of the possible split points when iterating over the sorted data.
has been created to implement counting
sort, since comparison sorting algorithms are incredibly slow in Snap!. Note that this block relies on the features
being integers from 1 to 10 and will not work otherwise.Even using counting sort this procedure can be quite slow to run, so it is useful to add status messages such as:
to view progress of the algorithm.
Now that we have a block to tell us the best feature and split point given a set of data, all that is left to do is to greedily build our decision tree. In this task, we will use a recursive algorithm to do this.
The tree building algorithm we will use is as follows [3]:
Input: training dataset, maximum tree depth
Output: root node of a decision tree constructed greedily using minimum entropy
if depth = 1 or entropy of list = 0:
# Stopping criteria reached
decision = most common class in dataset
report a leaf node with this decision
else:
j, t = choose a feature and split point for this dataset
left sublist, right sublist = split dataset according to (j, t)
left child = call this procedure recursively on the left sublist with depth - 1
right child = call this procedure recursively on the right sublist with depth - 1
return a decision node with the above split and children
Notice that this algorithm is recursive - it contains calls to itself.
Load Snap! project here.
You will also need to import the two blocks from Task 1 for this task to work. Namely,
and
.
Implement the algorithm above when depth = 1.
Test your code by learning on the following simple dataset with one feature, and making the tree classify two new examples, as shown below.
Hints/discussions:
Now let’s implement the tree with depth=2.
Hints/discussions:
Refining your implementation: If you tell the tree to learn from the same 3-example dataset, but with depth=3, what should happen? Do you actually get a tree with depth 3, why or why not?
Test your implementation with the blocks below, with the 3-example dataset, and a 4-example dataset of 2 features.
Complete the implementation of block
. This block takes a list of training data and a maximum tree depth as input and should report the root node of a tree constructed according to the greedy algorithm above. Test it by executing this block
, speed it up using “warp” after the code is correct.
When you have constructed the tree, use
to visualise your tree. Does this tree look reasonable? Does it conform to the meanings of FNAC feautures, i.e., higher feature values represent more malignant cell characteristics?
Use
to check your results. This block takes a decision tree as input and checks the classification of some input vectors against a reference implementation.
In this task we used two stopping criteria:
What other stopping criteria could be used? What impact do the stopping criteria have on the performance of our model on future, unknown input data?
In computer science, an optimal solution is a solution to a problem that is at least as good as any other solution.
The decision tree you have constructed in this tutorial classifies the training data perfectly. Is this a good thing? If our training set instead consisted of 1,000,000 samples, are we better off with a deep, complex decision tree that fits our training data perfectly or a simpler tree that makes some errors on the training set?
Creating incredibly complex models that match our training data perfectly is known as overfitting, and is something that we must actively avoid in machine learning. Remember that our goal in machine learning is to achieve good generalisation behaviour - we want our to model to predict future, unknown inputs well, not just perform well on the training set (in other words, remembering what it was taught on).
We cannot judge generalisation performance using the same data set we used to train our model. For this we must use a separate set of labelled examples which we call a test set [1].
Complete the implementation of block
that takes a decision tree and a labelled data set and reports the fraction of misclassifications that the decision tree makes on the data.
For example,
should return 0.5, as the first example is correctly classified, while the second is wrongly classified.
Use a few more example trees and datasets to test this block.
Training-testing protocol.
Build a decision tree using the existing data and your block
.
Draw the tree on stage using
Measure training error use your block
to measure how many mistakes this tree has made on training data.
Download test data Use the
block to download the breast-cancer-50.data data set. This data set contains 50 new labelled examples. All datasets in this lab reside in the same web directory, you can download them and assign to a new variable like this:
Measure test error Use your block
to measure how many mistakes this tree has made on test data.
In your report:
max-depth=5 using the original training data, what will the tree look like? – Draw it in your report or describe in words.The effect of training data
Build two decision trees up to depth=4. One using the original training data, the other use breast-cancer-train2.data.
Test them on two datasets: breast-cancer-50.data as above, and data-50-2.data.
In your report:
So how do we generally ensure that the learned model is of good quality? One commonly used technique is cross-validation. Read about this technique in Section 1.3 of Bishop’s book [1] – this section is included in the preview here.
Note: TBD.