Decision Trees

Build a decision tree from "training data"

We measure the size of the required information (in bits).

If I have a fair coin and I toss it.  It either comes up heads or tails.  How many bits of information do I need to store how the coin came up.  I need one bit.

If I have a four-sided tetrahedral die, how many bits of information do I need to store the result of tossing that die?  Two bits.  Two bits have four combinations and I can represent four possibilities with two bits.

Now, suppose I toss a three-sided die.  How many bits of information do I need to store the result of the toss?  You might think we need two, and in the discrete world of computers, you do need 2.  But two is overkill.  Two bits store four possibilities and we only need three.  One bit stores two possibilities but we need three.  In this discrete world of computers where you have to have a whole number of bits, this is the best we can do.

But, in information theory, we have the notion of entropy, which measures information using amounts of bits that might not be a whole number.  So, the number of "bits" needed to store the result of a three-sided die would fall somewhere between one and two.

We can measure the "entropy" of a random variable.