Decision Tree Pruning

--> Prune at the leaf level.

--> Prune those attributes that don't contribute a whole lot to the whole.

--> Keep going backwards if need be.

--> There are formulas to apply to determine with an attribute is insignificant to prune.

"Noise" means essentially a bad or extremely unlikely piece of data.

ALWAYS build the complete tree first and then prune from the leaf level working back up.  Don't prune it as you build it.


Imagine a situation in which a test doesn't really distinguish much, but a following test does.

Imagine a simple tree that should return YES if the string has an A and a B but not two of either one.

AB BA AA BB

First test:  Is first letter an A?
	YES:  Is second letter an A?
		YES:  No
		NO:  Yes
	NO:   Is second letter an A?
		YES:  Yes
		NO:  No

Your gut impulse may be to eliminate that first test because it doesn't really decide much.  It's Yes half the time and No half the time, the Yes group is about equal in yeses and noes and the No group is about equal in yes and noes.

That first test splits the training data into AB AA    and BA BB
And each group contains 1 yes and 1 no.

If I had half yeses and noes going into the first test, and I still have half yeses and half noes after the first test, then the first test, on its own, didn't yield much information, and so a pruning algorithm would flag for being unnecessary.

However, the first test in conjunction with the second test turns out to be very useful.  If you were pruning it on the way down, you would be inclined to eliminate the first test.

If we build the whole tree first, and then prune it going back up, we aren't going to eliminate the second test because it gives a definitive Yes/No answer.  It gives us a lot of information.  So, since the second test can't be eliminated, then the first test can't be, since it's never going to be at the leaf level.



TRAINING A DECISION TREE / UPDATING A DECISION TREE

Suppose to have a large amount of test data for the tree?  Why might you not want to use all of the test data to build your tree?

If I use too much training data, I have a very large, complicated decision tree.  Sure, it's 100% accurate on the training data (which is all of the test data).  But because it's complicated, it's going to be slower, and it's going to be overspecific.  Weird anomalies in the training data will be accommodated by specific branches in the tree that go against the normal trend and may not be important or accurate as more data emerges.

In other words, if that test data contains "noise" or bad data, then the tree will give the wrong answer for similar forms of bad data that ie encounters later on.

However, what we can do build a tree on a subset of the data, say 10%, and test it against the rest of the data to see how accurate it is.

Then, we can build another tree based on another 10%, again test it on the other 90% to see how accurate that is.

After a few of these tests, we can go with the most accurate one.  It should be a simpler tree (so therefore faster), and with fewer anomalous entries that could give wrong answers on the real data.

NOW, if you want the tree to learn as its going.  To become more accurate down the line.  Every so often, you add new training data to make adjustments to the tree.

Why is this a problem?

It is a problem because our tree algorithm doesn't allow for "adjustments."  You have to make a whole new tree from scratch.  You add new test data, hey, maybe your tree gets a new root.

Other issues with decision trees:

Missing or or new attribute selections.
  I have programmed my decision tree to allow for French, Italian, Thai, and burger for Types of Restaurant.  At some point, some training data comes in and the type of restaurant is Mexican.  No one told me there was going to be Mexican, and there it is.  (This is a problem with having an attribute be non-binary.)

Continuous attributes:
  Suppose I have "distance from my house" as an attribute (3.5 mi, 9.6 miles, whatever) because this would make a difference in whether I want to stay if I have to wait for a table.  However, this means my attribute, since it has a potential of an infinite number of values, would have an infinite number of children in my decision tree.  The obvious way to address this is to replace with an attribute with a finite set of values.  Less than 10 miles, 10 <= x < 20, 20 <= x < 30, x >= 30

NonContinuous attributes, but very large.  ZIP Codes are an example of this.  There are a lot of ZIP codes but breaking them up into categories based on continuous is probably not useful.  ZIP Codes from 10000 to 19999 may not be a particularly useful way to break them up.  Some analysis is going to be required to decide how to separate them, and perhaps using certain discrete examples might be the best way to do this.

Anyway, decision trees aren't necessarily the best way to handle machine learning / artificial intelligence.

DEEP LEARNING

aka Neural Networks.  It's called Deep Learning because it's layered.  You make one set of decisions, go into the next layer and make another set of decisions and so forth.

With neural networks, numeric data is passed to a node from several other nodes, a mathematical function is computed by this node and then passed on to one or possibly several other nodes.  The "learning" happens by modifying the functions or even sometimes the connections of and between these nodes.

A classic (and very simple) example of this would be, say, a Boolean circuit.

X-----------v
Y---------->OR
             v
Z---------->AND---> ____


Computing (X || Y) && Z

If X is 1, Y is 0 and Z is 1,     X||Y will be 1.     Since Z is also 1, (X||Y) && Z is also 1.



Imagine an OR-gate computes a mathematical function:  X+Y-XY

And and AND-gate computes XY