Lab 8: Learning to diagnose - Machine learning with decision trees


This lab practice the ideas of recursive computation with tree structure and core concepts of machine learning, by learning a decision tree for diagnosing breast cancer.

Submission instructions:

For this lab, you will need to submit 2 different files, as described below.

  • Task 1 - export an xml containig the Snap! blocks for the leaf node and choice node (no need to submit).
  • Task 2 and 3 - submit an xml containing the leaf node and choice node from task 1, the “build a decision tree … ” block, and the “Test performance” block.
  • Task 3 - in addition, submit a short report (written eletronically in an editor of your choice, and exported to pdf), with answers and analysis for the questions 3.2, 3.3, and 3.4.

Data, data everywhere

In the 21st century, data is everywhere. Being able to make sense of large data sets and use them to make predictions is becoming increasingly relevant in many domains including scientific and medical research, government, and commerce.

Machine learning is the study of computer algorithms that learn from data. Commonly, we are given some training data containing a set of sample input and output pairs and want to use this to make predictions for future inputs whose true output value is unknown. Using a set of labelled training data to learn in this manner is known as supervised learning.

Supervised learning algorithms use training data to create a model which describes a probabilistic relationship between input and output data, and are then able to use this model to make predictions. The fundamental goal in supervised learning is to create models that generalise well. This means that they must make accurate predictions when given future, unknown inputs rather than simply modelling the training data perfectly [1].

In this tutorial you will investigate a popular model called the decision tree, and see how it can be applied as a diagnostic tool in medicine.

A data set for medical diagnosis

The training data set we will use in this tutorial is a subset of the Breast Cancer Wisconsin (Original) Data Set, courtesy of the UCI Machine Learning Repository [4]. Each training example contains a set of measurements, called features, and a corresponding classification as either malignant (true) or benign (false). The 9 features were obtained from Fine-Needle Aspiration Cytology (FNAC) measurements of human tissue. Each feature has been normalised to lie in the range 1-10, with high valued features thought to represent more malignant (cancerous) cell characteristics. All feature values are integers.

  #  Feature                       Domain
  -- -----------------------------------------
  1. Clump Thickness               1 - 10
  2. Uniformity of Cell Size       1 - 10
  3. Uniformity of Cell Shape      1 - 10
  4. Marginal Adhesion             1 - 10
  5. Single Epithelial Cell Size   1 - 10
  6. Bare Nuclei                   1 - 10
  7. Bland Chromatin               1 - 10
  8. Normal Nucleoli               1 - 10
  9. Mitoses                       1 - 10

In this tutorial we will represent a single training example as a list containing 10 elements - the 9 input features and the corresponding class. For example training-example represents a training example whose class is benign. The training data set is then a list of training examples - a list of lists.

A test example, e.g. from a new patient, however contains only the 9 input features. For example,
input-vector represents a test example whose class is unknown. The test data set is then a list of test examples - also a list of lists in Snap. Your overall goal is to learn good decision rules from a training dataset, so as to classify the test examples correctly.

Task 1: Decision trees and classification (40%)

Trees are one of the most widely used data structures in computer science. They are used for searching, sorting, representing hierarchical relationships and even finding routes for data over the Internet. Recall that a tree consists of a single root node, a number of internal nodes and leaf nodes, and that these nodes are connected with edges.

A decision tree is a special type of tree used to classify input data into one of a number of classes. Each internal node (or decision node) of a decision tree contains a decision rule, and each leaf node contains a class label. Here is a simple decision tree for our FNAC domain with 4 decision nodes and 5 leaf nodes:

Example of a decision tree

This decision tree takes a set of features as input and produces a classification of either Benign or Cancer as output. To decide the class of a new input vector we start at the root node and continue down the tree, repeatedly applying decision rules. Each decision rule compares a particular input feature with a fixed threshold, and depending on the feature value we continue down either the left or right subtree. When we reach a leaf node we stop and this becomes the classification for our input vector.

The decision function at each decision node in our tree can therefore be described by two parameters: a feature number and a split point. The number in brackets next to the feature description at each decision node is the feature number. The number next to the edges leaving a decision node is the split point.

Discussion question

  • Under this decision tree, would the input vector input-vector be classified as Benign or Cancer?

Recursion with trees

A tree is an example of a recursive data type. The children of a node in a tree are themselves trees - large trees are created simply by joining smaller trees together. In the example above, the subtree rooted at the node labelled Clump thickness and the subtree rooted at the node labelled Bare nuclei are both perfectly valid decision trees in their own right. We were able to construct a larger decision tree by joining these two trees with a new choice node labelled Marginal adhesion.

Defining trees in this recursive manner allows us to write beautifully simple algorithms that achieve non-trivial results. For example to classify a new input vector, all a choice node needs to do is pass the input vector to its left or right child and ask it to perform the classification. In this task, you will add a classify method to both the choice nodes and leaf nodes and see how this works.

In this task the tree structure is fixed (- later, you will see how the tree structure can be learned from training data).

Instructions

  1. The Snap! project link for this task is here.
  2. The code around the top of the scripting area will create a fixed decision tree and show it on the stage. init-tree

    Opening the block will show you its definition in Snap! that correspond to its shape drawn on stage. If this looks very busy, don’t worry, the rest of this task will break down and explain each component in turn. fixed_tree_task1

  3. We start by implementing simplest decision tree ever - the leaf node. We also use this part to look at the concept of implementing a function as part of a list (i.e. a gray ring).

    decision-node

    The inpupt to a leaf node is a test example, the output is a decision (true or false). There are two types of leaf nodes: One that always returns true (cancer) or one that always returns false (benign). Therefore the expected behaviour of a true leaf node, invoked with given input data, is shown below

    ask_leaf_true

    Therefore, the leaf node should always report a class label. This class label should also be the same as the type of leaf node (i.e. TRUE or FALSE in angle slot).

    Look at the snap workspace to see how this is implemented.

  4. You next task is to complete the choice node:

    choice-node

    Currently, it will also report Not implemented when asked to classify an input vector. Your goal is to edit the node type blocks so the decision tree reports either true or false for an input vector according to the structure of the decision tree. Similar to the leaf node, replace report-not-implemented with your code.

    Two utility blocks to help you visualize the process:

    • tell-artist-draw-tree draws a decision tree to the stage.
    • tell-artist-draw-classification-blank can be used to visualise how a given decision tree should classify a new input vector. For example, the visualization of a small decision tree tell-artist-draw-classification shows:

      Example of `draw_classification` command

    Some more hints:

    • Remember, you are implementing a recursive algorithm. Rather than report true or false directly, a decision node should delegate this decision to one of its children.
    • For leaf nodes, the implementation forms the base for your recursion and should return either true or false .
    • Because you are changing the code that creates tree nodes, your changes will not impact trees that have already been built unless you tell Snap! to build them again. Make sure Snap! is calling choice-node and decision-node every time you test your code.

    The following three tests should help you trouble shoot and test your program as you write it:

    • smalltree-classify should reproduce the two-level tree drawn above, and report false .

    • tell-artist-print prints a message to the bottom of the stage. For example: tell-artist-print-false this prints the classification result of the original tree above, it should say “Classification: false”.

    • On the other hand, a slightly different test example tell-artist-true should say “Classification: true”.

  5. When you think your implementation is working run the following test:

    • test-classification . This will test your implementation with a range of different inputs and report Pass or Fail.

Discussion

  • Execute these blocks at the end of the Snap workspace converse tree . Notice that this tree here is the exact converse of the 3-level decision tree earlier – whenever the other tree says ‘cancer’ this tree will diagnose ‘benign’ … So which tree is correct? The next few tasks will explore this question.

Reading - Choosing a decision rule

Remember that our goal in decision tree learning is to build a decision tree automatically based on our training data. How can we do that?

Greed is good

In computer science, a greedy algorithm is an algorithm that finds a solution to a problem by making locally optimal decisions. Greedy algorithms make a series of choices that seem best at the time but depending on the problem may not produce a globally optimal solution. These algorithms are often useful when finding a globally optimal solution is not computationally feasible. In the next two tasks, we will use a greedy algorithm to build our decision tree, choosing decision functions at each node that appear to best split the data we have at the time.

We are going to start with all our training data, and build our tree by splitting our data set over and over until we reach some stopping criteria. The algorithm we will use to choose a feature and split point given a set of training examples is as follows:

Input: dataset with N features
Output: (feature, split point) pair

for each feature 1 to N:
    sort dataset by feature
    for each split point in sorted dataset:
        evaluate the gain of split
        if gain is the highest seen so far:
            remember gain, feature and split point
return highest gain, and the corresponding feature and split point

The gain of splitting on a given feature at a given split point is a measure of the quality of the split. At each stage, we want to choose the split with the lowest cost. To define this cost concretely, we need to understand the concept of entropy.

Discussion questions
  • Why do we sort the data set before looping over possible split points?
  • Is finding the globally optimal split possible, and computationally feasible?

Information entropy

Entropy gives us a measure of the uncertainty associated with a probability distribution. Suppose we roll a die and wish to predict which outcome in the set $C={1,2,3,4,5,6}$ will occur next. If the die is fair, each outcome in $C$ has an equal probability of occuring ($p(x) = \frac{1}{6}$). Because of this, it is difficult to predict which number will come up next - the outcome is uncertain. We say that the distribution of outcomes has high entropy. Conversely, the distribution of an unfair die (that is more likely to roll some numbers than others) has lower entropy since there is less uncertainty associated with rolling it.

High entropy: a fair die

Low entropy: an unfair die

Given a random variable $X$ with probability mass function $p(x)$, the entropy $\mathbb{H}(X)$ is defined as:

$$ \mathbb{H}(X) = -\sum_x p(x)\log_2 p(x) $$

taking $p(x)\log_2 p(x)=0$ whenever $p(x)=0$ [3]. The more uniform the distribution $p(x)$, the higher $\mathbb{H}$ will be. $\mathbb{H}$ is at its highest when all outcomes are equally likely.

Cost function revisited

In this exercise, we define the cost of a split in terms of the entropy of the sublists that the split creates. If we consider the proportion of training examples from each of the two classes as an approximation of a probability distribution over the classes, then a set of training examples from mostly the same class has low entropy and a set with an equal number from each class has high entropy. We therefore define our cost function as:

cost of split = (proportion of examples in left sublist) * (entropy of left sublist) + 
                (proportion of examples in right sublist) * (entropy of right sublist)

Minimising this sum will result in both sublists having low entropy and therefore gives us a split with the best possible class separation [2].

Discussion questions
  • Why do we weight each term by the relative size of the sublist?
  • What other metrics could we use to quantify the quality of a split?

The block choose-feature-split implements the feature selection function. The block takes a list of training data as input and should report the index of the optimum feature as well as the optimum split point for that feature. To report a pair of values in Snap!, report a list with two elements: report-feature-split .

Notes

We encourage you to look at the implementation of the feature split block choose-feature-split and note the following technical aspects.

  • Sorting is used so we can efficiently check all the possible split points for a given feature. The variable split is used to keep track of the possible split points when iterating over the sorted data.
  • The block sort-dataset-by-feature has been created to implement counting sort, since comparison sorting algorithms are incredibly slow in Snap!. Note that this block relies on the features being integers from 1 to 10 and will not work otherwise.
  • Even using counting sort this procedure can be quite slow to run, so it is useful to add status messages such as:

    The `print_status` command can be used for debugging

    to view progress of the algorithm.

Discussion question

  • Why does this block report the cost rather than the information gain as described in class? Do they lead to the same results?

Task 2: Building a decision tree (40%)

Now that we have a block to tell us the best feature and split point given a set of data, all that is left to do is to greedily build our decision tree. In this task, we will use a recursive algorithm to do this.

The tree building algorithm we will use is as follows [3]:

Input: training dataset, maximum tree depth
Output: root node of a decision tree constructed greedily using minimum entropy

if depth = 1 or entropy of list = 0:
    # Stopping criteria reached
    decision = most common class in dataset
    report a leaf node with this decision
else:
    j, t = choose a feature and split point for this dataset
    left sublist, right sublist = split dataset according to (j, t)
    left child = call this procedure recursively on the left sublist with depth - 1
    right child = call this procedure recursively on the right sublist with depth - 1
    return a decision node with the above split and children

Notice that this algorithm is recursive - it contains calls to itself.

Instructions

  1. Load Snap! project here.

    You will also need to import the two blocks from Task 1 for this task to work. Namely, decision-node and choice-node .

  2. Implement the algorithm above when depth = 1.

    • This block should report a decision tree of depth 1 (what kind of node is this?)

    Test your code by learning on the following simple dataset with one feature, and making the tree classify two new examples, as shown below. level1-tree

    Hints/discussions:

    • What should the ‘tree’ here look like? Can you express it using the leaf node block?
    • What does it do on the input with featuer “2”? (true)
    • What does it do on the input with featuer “6”? (true) What should it do? (false) – the next step would fix this.
  3. Now let’s implement the tree with depth=2. level2-tree

    Hints/discussions:

    • What should the ‘tree’ here look like? Can you express it using the leaf node and choice node blocks?
    • Does it now classify the input with featuer “2” and “6” correctly?
  4. Refining your implementation: If you tell the tree to learn from the same 3-example dataset, but with depth=3, what should happen? Do you actually get a tree with depth 3, why or why not?

    Test your implementation with the blocks below, with the 3-example dataset, and a 4-example dataset of 2 features.

    level3a-tree

    level3b-tree

  5. Complete the implementation of block build-decision-tree . This block takes a list of training data and a maximum tree depth as input and should report the root node of a tree constructed according to the greedy algorithm above. Test it by executing this block build-decision-tree , speed it up using “warp” after the code is correct.

  6. When you have constructed the tree, use tell-artist-d to visualise your tree. Does this tree look reasonable? Does it conform to the meanings of FNAC feautures, i.e., higher feature values represent more malignant cell characteristics?

  7. Use test-decision-tree to check your results. This block takes a decision tree as input and checks the classification of some input vectors against a reference implementation.

Extension questions

  • In this task we used two stopping criteria:

    • Maximum depth
    • Entropy equal to zero.

    What other stopping criteria could be used? What impact do the stopping criteria have on the performance of our model on future, unknown input data?

  • In computer science, an optimal solution is a solution to a problem that is at least as good as any other solution.

    • How would you define an optimal solution to the decision tree problem given a set of training data?
    • The problem of finding an optimal partitioning of the data is NP-complete [3]. Loosely speaking, this means we do not currently know of an efficient algorithm to solve the problem for large data sets and it is quite likely that there isn’t one. Can you describe a ‘brute force’ algorithm that reports the optimal solution given a set of training data? How do you think the running time of this algorithm would compare to that of the greedy algorithm you have implemented in this task?

Task 3: Analysing performance (20%)

The decision tree you have constructed in this tutorial classifies the training data perfectly. Is this a good thing? If our training set instead consisted of 1,000,000 samples, are we better off with a deep, complex decision tree that fits our training data perfectly or a simpler tree that makes some errors on the training set?

Creating incredibly complex models that match our training data perfectly is known as overfitting, and is something that we must actively avoid in machine learning. Remember that our goal in machine learning is to achieve good generalisation behaviour - we want our to model to predict future, unknown inputs well, not just perform well on the training set (in other words, remembering what it was taught on).

We cannot judge generalisation performance using the same data set we used to train our model. For this we must use a separate set of labelled examples which we call a test set [1].

Exercises

  1. Complete the implementation of block test-performance that takes a decision tree and a labelled data set and reports the fraction of misclassifications that the decision tree makes on the data.

    For example, test-toy-example should return 0.5, as the first example is correctly classified, while the second is wrongly classified. Use a few more example trees and datasets to test this block.

  2. Training-testing protocol.

    Build a decision tree using the existing data and your block build-decision-tree .

    Draw the tree on stage using tell-artist-d

    Measure training error use your block test-performance to measure how many mistakes this tree has made on training data.

    Download test data Use the get-dataset-from-url block to download the breast-cancer-50.data data set. This data set contains 50 new labelled examples. All datasets in this lab reside in the same web directory, you can download them and assign to a new variable like this: test data 50

    Measure test error Use your block test-performance to measure how many mistakes this tree has made on test data.

    In your report:

    • Include a picture (snap! stage) of the learned decision trees, at depth 4, 3, and 2, respectively.
    • Include a table containing the training and test errors these trees made on training and test data, respectively.
    • What are the training accuracies? Should it be 1.0?
    • Why are the testing accuracy not 1.0? Should it be?
    • What seem to be the right depth of this decision tree, why?
    • Given these results, if you were to train a tree of max-depth=5 using the original training data, what will the tree look like? – Draw it in your report or describe in words.
    • Discuss (and not implement) the following scenario: If you were to write a modified decision tree learning routine, in which each choice node can split one feature to up to three branches (e.g. feature 1 <=2, 2 < feature 1 <= 4, feature 1 >4), what needs to change in our program? Will you get a different decision tree? Will this generate a fundamentally better tree than a binary tree?
  3. The effect of training data

    Build two decision trees up to depth=4. One using the original training data, the other use breast-cancer-train2.data.

    Test them on two datasets: breast-cancer-50.data as above, and data-50-2.data.

    In your report:

    • Include pictures of the two decision trees being compared here.
    • Include a small table of the testing accuracy of the two trees on the two datasets.
    • Which decision tree performs better on the two datasets? Can you offer some explanations as to why?

Discussion:

So how do we generally ensure that the learned model is of good quality? One commonly used technique is cross-validation. Read about this technique in Section 1.3 of Bishop’s book [1] – this section is included in the preview here.

References

  1. Bishop, Christopher M. Pattern recognition and machine learning. Vol. 1. New York: springer, 2006.
  2. Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT Press, 2012.
  3. Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT Press, 2012.
  4. Wolberg, W., Mangasarian, O.: Breast Cancer Wisconsin (Original) Data Set (1992), https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 (accessed May 2014).

20 April, 2016
3889 words


Written by
Tags
labs

Quick links

2016 S1 lectures schedule:
  • Monday 1-2pm (RS Chem T)
  • Tuesday 9-10am (ENGN T)
  • Thursday 12-1pm (JD101)

Note: TBD.

Quick-access class pointers: TBC