DECISION TREES

What we do is we find the attribute that is the most important.  Meaning that selecting this attribute gives us most information gain of all the attributes.  This becomes the root.

So, in our restaurant example, our most important attribute was how full the restaurant was.  In the training data, if there were patrons, the answer was also NO.  If there were some patrons, the answer was always YES, and if the restaurant were full, then the answer was sometimes yes and sometimes no.  So, we have to use recursion to continue building this tree.  The root is the "patrons" attribute, with three children, corresponding to "None", "Some", and "Full".

Since None always gives no and Some always give yes, these are leaves of the tree, they do not have children, we don't worry about them any more.

Since Full still leaves other tests to determine the answer, what we do is take the subset of the data for which the patrons attribute is Full.  And try again.  (We no longer consider the Patrons attribute).  We once again go through the data to find the attribute that is most important, and this time it turns out the attribute for whether we are hungry is the most important, and we do the same thing.  If we are not hungry, the answer is always no, and if we are hungry, then it could be yes or no so we continue.

The tree generated from the training data does work for every single piece of training data.  This does not mean it always give the right answer for every possible test case.

One problem.  What if a real case falls off the tree without giving a YES or NO.  This means that a certain combination of attributes does not exist in the training data and so wasn't given a branch on the tree but this combination does exist in the real data.

The way to fix this is when you build the decision tree, you simply don't leave blank paths.  If you find a branch from an important attribute that has no matching test cases, you simply give the leaf a value (yes or no) based on the majority rule of the test cases being considered by the parent.

If I have six test cases when I'm evaluating patron, and three contain Some and three contain Full, then when I'm making the None branch I have to give it a value even though there are no test cases at this point.  So, I take a look at the six test cases at the Patron attribute, and see how many have yeses and how many have noes, and just go with majority rule.

Sometimes the tree can be too large and you want to shrink the tree a bit.  (Shrinking the tree will result in the tree no longer being 100% accurate for the training data, but you can move through a smaller tree faster, and the tradeoff might be worth it.)

Shrinking a tree is called "pruning".  Unfortunately, pruning a decision tree happens at the leaf level (you remove one attribute from the very bottom and replace it with a simple yes or no.  Maybe after pruning a number of these leaf-level attributes, you now have enough information to prune an attribute one level higher.

The idea is that you prune attributes which, in the long run, don't really give a lot of useful information anyway.  Let's say by the time we get to an attribute, we have y yeses and n noes in our training data at this point.  Well if that attribute splits the training data into certain classes based on the attribute and those smaller classes also have their yeses and noes in the ratio (y/(y+n) for yeses and n/(y+n) for noes) then that attribute didn't help us much after all, and we may be able to prune the tree.