Question/Max Score I (20) II (25) III (25) IV (20) V (20) Total Actual Score Take Home Examination. This is an open book, open notes examination. The examination is due at 12 noon on Monday, December 17, 2007. Submit your exam to the instructor by email or submit to the digital drop box on Oak. In either case, please send a separate email to your instructor as soon as you have submitted your exam.
Late submissions will be penalized at the rate of 5 points off for every minute the submission is late. No excuses will be entertained.
Be sure to answer questions clearly. Explicitly state all your assumptions, make sure you have clearly outlined the problem you are solving, and be sure to explain and justify all of the steps in your solution process. We reserve the right to not grade answers that do not address the question asked in the exam. On the other hand, don't get carried away. A long drawn out answer is often not the best one. Avoid showing a lot of repetitive steps. I am more interested in your approach to solving the problem, your justification for the steps you choose, and a good explanation for what the result is.
Your answers to the exam questions must be your own work. You cannot discuss the exam with other students, colleagues, or faculty at the university or elsewhere. You may only consult the instructor or the TA for clarifications and help on the exam. If you use other sources, such as papers and material on the web to derive your answer, please be sure to acknowledge them in a proper way. Do not attempt to download answers to questions directly from the web. It is very important that the work you submit be based entirely on your own efforts.
The honor pledge below must accompany all examination submissions. Otherwise the examination will not be graded.
I pledge my honor that I have neither given nor received aid on this work.
Convert the following sentences for FOL expressions.
If a program cannot be told a fact, then it cannot learn that fact.
∀ p, f ¬∃ x
program(p) ∧ fact(f) ∧
tells(x, p, f) ⇒
¬learns(p, f)
All blocks on top of blocks that have been moved or that are attached to blocks that have been moved have also been moved.
∀ x, y block(x) ∧ block(y) ∧
(on(x, y) ∨ attached(x, y)) ∧
moved(y) ⇒ moved(x)
Every dog who loves one of its brothers is happy.
∀ x ∃ y dog(x) ∧ brother(y, x) ∧
loves(x, y) ⇒ happy(x)
Consider the following story:
Tony, Mike, and John belong to the Alpine Club. Every member of the Alpine Club is either a skier or a mountain climber or both. No mountain climber likes rain, and all skiers like snow. Mike dislikes whatever Tony likes, and likes whatever Tony dislikes. Tony likes rain and snow. Is there a member of the Alpine Club who is a mountain climber but not a skier? Who is it?
Formulate as predicate calculus expressions the facts given.
Tony, Mike, and John belong to the Alpine Club.
member('Tony', 'Alpine Club') ∧
member('Mike', 'Alpine Club') ∧
member('John', 'Alpine Club')
Every member of the Alpine Club is either a skier or a mountain climber or both.
∀ x member(x, 'Alpine Club') ⇒
skier(x) ∨ climber(x)
No mountain climber likes rain.
∀ x climber(x) ⇒ ¬likes(x, 'rain')
All skiers like snow
∀ x skier(x) ⇒ likes(x, 'snow')
Mike dislikes whatever Tony likes.
∀ x likes('Tony', x) ⇒ ¬likes('Mike', x)
Mike likes whatever Tony dislikes.
∀ x ¬likes('Tony', x) ⇒ likes('Mike', x)
Tony likes rain and snow.
likes('Tony', 'rain') ∧ likes('Tony', 'snow')
Use resolution refutation with answer-extraction to find the answer. Be sure to show all of the steps in the derivation process, starting from converting the English sentences to FOL expressions, generating clauses, and so on till you have the complete proof tree.
First, the previous statements need to be transformed into conjunctive normal form:
member('Tony', 'Alpine Club') ∧
member('Mike', 'Alpine Club') ∧
member('John', 'Alpine Club')
∀ x member(x, 'Alpine Club') ⇒
skier(x) ∨ climber(x)
member(x, 'Alpine Club') ∨
skier(x) ∨ climber(x)
∀ x climber(x) ⇒ ¬likes(x, 'rain')
climber(x) ∨ ¬likes(x, 'rain')∀ x skier(x) ⇒ likes(x, 'snow')
skier(x) ∨ likes(x, 'snow')∀ x likes('Tony', x) ⇒ ¬likes('Mike', x)
likes('Mike', x) ∨ ¬likes('Tony', x)∀ x ¬likes('Tony', x) ⇒ likes('Mike', x)
likes('Tony', x) ∨ likes('Mike', x)likes('Tony', x) ∨ likes('Mike', x)likes('Tony', 'rain') ∧ likes('Tony', 'snow')
To answer the question: "Is there a member of the Alpine Club who is a mountain climber but not a skier?", first convert the statement to predicate calculus.
There is a member of the Alpine Club who is a mountain climber but not a skier.
member(x, 'Alpine Club') ∧
climber(x) ∧ ¬skier(x)
Using resolution refutation then, that statement is negated and combined with existing predicates to attempt to reduce to a null set which shows a contradiction (and a verification of the statement). This particular resolution is done in a narrow tree where one predicate is added per level.
¬(member(p, 'Alpine Club') ∧
climber(p) ∧ ¬skier(p))
|
¬member(p, 'Alpine Club') ∨
¬climber(p) ∨ skier(p)
|
¬member(x, 'Alpine Club') ∨
skier(x) ∨ climber(x) ⇒
|
¬member(p, 'Alpine Club') ∨
skier(p)
|
member('Mike', 'Alpine Club') {p⇒'Mike'} ⇒
|
skier('Mike')
|
¬skier(x) ∨ likes(x, 'snow') ⇒
|
likes('Mike', 'snow')
|
¬likes('Mike', x) ∨ ¬likes('Tony', x) ⇒
|
¬likes('Tony', 'snow')
|
likes('Tony', 'snow') ⇒
|
| ∅ |
Consider the following FOL expressions:
Can a forward chaining algorithm infer Q(a)? If so, show how, if not why not?
In forward chaining, all fully known implications are expanded. The only fully known implication is Q(x) ⇒ P(F(x)) for which {x ⇒ a}. That expansion is:
Q(b) ⇒ P(F(b)) ⇒ P(b)
No implications involving a are fully known, so forward chaining will never generate the query, Q(a), and find a solution.
Backward chaining will return true if we ask query P(b)? Explain why or why not.
In backward chaining, implications that could generate a predicate are generated until they are a set of known predicates. This example only has one implication that will generate P(F(x)) ⇒ P(x). The only expansion possible is:
P(b) ⇐ P(F(b)) ⇐ Q(b)
Q(b) is in the knowledge base, so the algorithm succeeds.
Is this true or false — "If forward chaining has failed to infer a given literal, then it is not entailed by the KB." Explain.
Forward chaining will eventually generate all possible predicates. If it fails then it is not possible to arrive at the query from the given facts.
Is this true or false — "If backward chaining does not return a true given a query literal, then it is not entailed by the KB." Explain.
Backward chaining will examine all inferences related to a given set of facts. Eventually it will explore all possible inference paths and if there is a path, it will find it.
Represent the following sentences by using and extending the (knowledge) representation schemes defined in Chapter 10 of Russell and Norvig.
Water is a liquid between 0 and 100 degrees Celsius.
w ∈ Water ∧ t ∈ Temperature ∧
Centigrade(t) ∧ TemperatureOf(t, w) ⇒
(Liquid(w) ⇔ >(t, Centigrade(0)) ∧
<(t, Centigrade(100)))
The water in John's bottle is frozen.
w ∈ Water ∧ b ∈ Bottle ∧
Owns('John', b) ∧ In(w, b) ⇒
Frozen(w)
Suppose you have a STRIPS representation for operators a1 and a2, and you want to define the STRIPS representation for the composite action a1:a2, which means you first do a1 and then do a2.
What is the add list for the composite action?
If a1 and as are execute sequentially, the operations would be:
MeetPre(a1),
Del(a1), Add(a1),
MeetPre(a2),
Del(a2), Add(a2)
The elements added by this process will be:
Add(a1:a2) =
(Add(a1) - Del(a2)) +
Add(a2)
In set notation, this would be expressed as:
Add(a1:a2) =
(Add(a1) ⋂ ¬Del(a2)) ⋃
Add(a2)
What is the delete list?
Del(a1:a2) =
Del(a1) + Del(a2) =
Del(a1) ⋃ Del(a2)
What are the preconditions for the composite action?
The composite of the two actions is only possible if a precondition for a2 is not removed by a1:
Pre(a2) ⋂
(Del(a1) ⋂ ¬Add(a1)) = ∅
This specification permits an element to be in both the Add and Del lists for a1. This shouldn't generally happen, but could.
Assuming that the above precondition is met:
Pre(a1:a2) =
Pre(a1) + (Pre(a2) - Add(a1)) =
Pre(a1) ⋃
(Pre(a2) ⋂ ¬Add(a1))
Consider the delivery robot domain.
The relations that are true of situations or states are as follows:
at(Obj,Loc): Object Obj is at location Loccarrying(Ag,Obj): Agent Ag is carrying Objunlocked(Door): if Door is unlockedlocked(Door): if Door is lockedopens(Key,Door): if Key can open Dooradjacent(Pos1,Pos2): position Pos1 is adjacent to position Pos2 and the robot can move to it in one stepbetween(Door,Pos1,Pos2): if Door is between position Pos1 and position Pos2. If Door is unlocked, the two positions are adjacent. between is assumed to be static, i.e., you cannot create new doors.Define the following actions as STRIPS operators (precondition, add, and delete lists) in this domain.
move(Ag,From,To):
Precon(at(Ag,From), adjacent(From,To))Del(at(Ag,From))Add(at(Ag,To)))pickup(Ag,Obj):
Precon(at(Ag,x), at(Obj,x))Del(at(Obj,x))Add(carrying(Ag,Obj)))putdown(Ag,Obj):
Precon(at(Ag,x), carrying(Ag,Obj))Del(carrying(Ag,Obj))Add(at(Obj,x)))unlock(Ag,Door):
Precon(at(Ag,x), at(Door,x),
opens(k,Door), carrying(Ag,k),
locked(Door), between(Door,Pos1,Pos2))
Del(locked(Door))Add(adjacent(Pos1,Pos2), adjacent(Pos2,Pos1),
unlocked(Door)))
lock(Ag,Door):
Precon(at(Ag,x), at(Door,x),
opens(k,Door), carrying(Ag,k),
unlocked(Door), between(Door,Pos1,Pos2))
Del(unlocked(Door),
adjacent(Pos1,Pos2), adjacent(Pos2,Pos1))
Add(locked(Door)))Give the STRIPS representation for the composite action fetch made up of the three primitive actions:
move(Ag,Pos1,Pos2)pickup(Ag,Obj)move(Ag,Pos2,Pos1)fetch(Obj,ObjPos):
Precon(at(Ag,Pos), at(Obj,ObjPos),
adjacent(Pos,ObjPos)
Del(at(Obj,ObjPos))Add(carrying(Ag,Obj)))Consider the initial state of the system depicted in the figure below.
This can be described by: (though from the figure it appears that the door isn't locked)
at(rob, o109)at(parcel, storage)at(k1, mail)locked(door1)We also know the following static facts about the domain:
between(door1, o103, lab2)opens(k1, door1)We also know rob is an agent, and parcel and mail are objects. In addition, the figure gives us a number of relations, such as:
adjacent(lab2, o109)adjacent(r101, r103)between(Door, P1, P2) ∧ unlocked(Door) ⇒
adjacent(P1, P2)
adjacent(mail, o103)adjacent(mail, stairs)adjacent(mail, r131)Given the goal condition: at(rob, lab2), carrying(rob, parcel); apply regression planning using the STRIPS operators to generate a complete plan. You don't need to show me all of the gory details of each step of the plan (because it is going to be very repetitive), but highlight the important steps and the major issues in generating the plan and the plan that is generated.
The diagram is somewhat confusing because it is not clear of all positions are present or not. There are positions outside of rooms 103, 109 and 111 ('o109' and 'o111'), but there's no position outside the rest of the rooms. The closest position outside of room 115 is 'o109', however it doesn't really make sense to say that 'o109' is adjacent to 'r115'. The closest place to room 115 is room 113 and from the problem specification, it seems there should be a predicate adjacent(r113,r115). This would not allow both the two doors that separate those two spaces to be locked. Unlocking one door or the other adds predicates saying the two spaces are adjacent. Two adjacent spaces can be moved between, and that isn't true if there's still a locked door in the way. A more reasonable representation would be to define a 'oxyz' which is the area immediately outside the door to room xyz.
That being said, increasing the number of states will significantly increase the depth to which a planner will have to search since each position can be reached from several directions (representing a branch in the search tree). For this reason, the only positions will be those on the map and doors without labels are not considered and assumed to be open.
[Extra Credit] If you wanted to avoid this very repetitive one-step move operator, and define a more composite MOVE operator, i.e., MOVE(Ag, From, To), can you come up with a recursive definition for this composite operator using the primitive move operator? Discuss.
This question requires something of an alteration of the basic STRIPS operator. Normally the precondition, add and delete lists are a set of facts that are present in the knowledge base. There is no concept of performing an action from within another. STRIPS would easily allow the specification of a multiple step move, for example the composite action:
move_2(Ag, From, To):
Precon(at(Ag,From), adjacent(From,x)
adjacent(x,To))Del(at(Ag,From))Add(at(Ag,To)))To allow a recursive MOVE, there are two questions to be answered:
The recursion has to happen in the preconditions. If this was not the case then MOVE could be called with nonexistent positions and it would execute.
The second issue is more difficult. STRIPS only allows conjunctions or operators. This means that the MOVE operator would always be recursively executed and the recursion would be infinite. There are a couple options that would permit ending recursion, but one of the more useful additions would be short-circuit disjunctions. This concept exists in many programming languages, and it is simply that in an ∨ed expression, the first element to evaluate to true stops evaluation of the expression. If these existed for STRIPS, MOVE would be:
MOVE(Ag, From, To):
Precon(at(Ag,From) ∧
(adjacent(From,To) ∨
(adjacent(From,x) ∧ MOVE(Ag, x,To)))
Del(at(Ag,From))Add(at(Ag,To)))Consider the game of peg solitaire, as shown in the figure below (see en.wikipedia.org/wiki/Peg_solitaire) . The goal is to remove all of the pieces but one, which must be left in the center hole C. A piece is removed by hopping over an adjacent piece into an empty adjacent hole. For example, piece A can hop over piece B into hole C, removing piece B from the board.
You have to formulate peg solitaire as a partial-order planning problem using STRIPS operators.
Describe the start and finish steps for this problem (there is no need to list EVERY literal one by one, but just say what literals are needed.) You will need to choose a vocabulary. You may assume that the predicate Line(u,v,w) (u,v, and w are consecutive locations along a line) is available and that the constant C refers to the center location.
It is assumed all jumps are either horizontal or vertical. There are no diagonal jumps (or lines). The peg positions are assigned identifiers based on their position if they were embedded in a 7×7 grid:
The knowledge base then contains line predicates:
Line(13, 14, 15)Line(23, 24, 25)Line(13, 23, 33)Line(31, 32, 33)Line(32, 33, 34)Line(33, 34, 35)Lines are defined left to right. Either an inverted predicate could be defined for each predicate or an additional action that permits reversing could be defined:
reverse_order(x,z):
Precon(Line(x,y,z))
Del()
Add(Line(z,y,x)))
There are also predicates about the presence of pegs:
Peg(13)Peg(14)There is one empty space at C which is 44 in the numbering scheme defined above.
Empty(44)Now write the STRIPS operator for making a move. Be sure to strictly observe the syntactic restrictions on STRIPS operators.
jump(x,z):
Precon(Peg(x), Empty(z),
Line(x,y,z), Peg(y))
Del(Peg(x), Peg(y),
Empty(z))
Add(Empty(x), Empty(y),
Peg(z)))
Describe how a clobbering conflict can occur in the course of partial-order planning with this particular formulation or explain why one cannot occur.
A clobbering conflict can occur. There are multiple ways to remove a particular peg. One plan might plan to remove it horizontally and the other vertically. Once a peg is removed it prevents the jump in the other plan. Similarly, any series of jumps begins with a single peg. If that peg is removed then the series is impossible. Finally, a series of jumps relies on positions being empty if another plan places a peg in a position a plan was going to jump into then that plan is no longer possible.
Consider the Bayesian network for car diagnosis as shown below.
Whether the engine starts doesn't just depend on the spark plug and petrol; it also depends on the starter motor is working. The starter motor can get frozen in icy weather, and icy weather can also increase the likelihood of the battery dying. Extend the network with the boolean variables IcyWeather and StarterMotorFrozen.
Give reasonable conditional probability tables for all the nodes. [Here are some values to help you along:
| P(Icy Weather) |
|---|
| 0.05 |
| P(Petrol) |
|---|
| 0.995 |
| Icy Weather | P(Battery | Icy Weather) |
|---|---|
| T | 0.8 |
| F | 0.95 |
| Battery | P(Radio | Battery) |
|---|---|
| T | 0.995 |
| F | 0.05 |
| Battery | P(Spark Plug | Battery) |
|---|---|
| T | 0.9 |
| F | 0.001 |
| Icy Weather | P(Starter Frozen | Icy Weather) |
|---|---|
| T | 0.5 |
| F | 0.001 |
Note that unlike the other properties, the starter being frozen reduces the chances of the engine starting.
| Spark Plug | Petrol | Starter Frozen | P(Engine Starts | Spark Plug, Petrol, Starter Frozen) |
|---|---|---|---|
| T | T | T | 0.08 |
| T | T | F | 0.995 |
| T | F | T | 10-7 |
| T | F | F | 10-5 |
| F | T | T | 10-6 |
| F | T | F | 10-4 |
| F | F | T | 10-12 |
| F | F | F | 10-10 |
| Engine Starts | P(Car Moves | Engine Starts) |
|---|---|
| T | 0.999 |
| F | 0.005 |
How many independent values are contained in the joint probability distribution for eight boolean nodes, assuming that no conditional independence relations are known to hold among them?
For independent probabilities:
Each Xi requires a table of two values to capture its information. There are eight values, so the number of independent values considered in the joint probability is 2n = 16.
How many independent probability values do your network tables contain?
For describing the probability as an equation, the following abbreviations are used:
Given the Bayes net above:
The number of values required for each of these distributions is:
Write out the marginal distribution for P(Car Moves, Radio, Battery)
The notation used is capitol variables (W) represent distributions and lower case variables (w) represent the members of the distribution:
The concept of summation then is:
Given the Bayes net above:
Apply the variable elimination algorithm to the query P(Car Moves | Icy Weather, Radio). Show the entire computation process, though you do not have to do the actual arithmetic calculations.
Given the Bayes net above:
The reductions are made using a concept called factors. Factors are generated starting from the left of the equation and generating going right. The first factor is: (note that the syntax used here for factors differs slightly from (and is less ambiguous than) Russel and Norvig)
The next factor is more complex. For each axis along which a conditional probability varies, a new dimension is added to the matrix. So, the next factor 2×2×2×2 rather than 2×2:
The factors then are combined into factors using a process called pointwise multiplication. In pointwise multiplication, factor elements are joined by variables they have in common. The number of dimensions in the result are the number of unique factors. (note that elements have been written lenghtwise to get them to fit)
This rapidly growing factor is then reduced in size using summation. In this case, it is a summation over E. (Which is really long and I ran out of time.)
An agent lives in a 3×3 grid world surrounded by walls. (1,1) is the bottom left and (3,1) is the bottom right. The agent's action A can be left, up, right, and down. The precept is an integer N indicating the number of adjacent walls. The agent's state is just its (X, Y) position and the prior over X and Y is uniform. At each step the agent receives a precept, and chooses its action.
Below are the first three time slices of a dynamic Bayes net. Add links to make the best possible representations of the problem. (You need not add parents for A because the agent chooses its value.)
Suppose the wall sensor is accurate and the agent's movement is deterministic. Given N1 = 1, A1 = right, N2 = 2, where could the agent be at time 2?
Something unspecified in the problem is the effect of a robot attempting to drive through a wall. For example if a robot is at (1,1) and decides to move up. An interesting effect would be if it would rotate 90° to either side with some probability (and change the meanings of up, right, etc.). That would be significantly harder to model though, so it is assumed if the robot attempts to drive through a wall it simply doesn't move.
Since the robot is in a 3×3 grid, there are three possibilities for N:
Moving right means going from (x,y) to (x + 1,y).
This leaves the robot at either (2,1) or (2,3).
Suppose the wall sensor detects each wall with independent probability p < 1, but detects the absence of a wall correctly. Given N1 = 1, A1 = right, N2 = 1, where could the agent not have been at time 1?
That the sensor correctly determines the absence of a wall perfectly but may have an error detecting a wall means a wall may be missed, but there are no phantom walls. The effect is the detected value of N is a lower bound.
So, N1 and N2 were at least 1.
Ultimately, the only positions that can be eliminated are:
Is it true that given the motion model for part (b) and the sensor model for part (c), it is impossible for the agent ever to know exactly where it is with certainty. Explain your answer.
Since there is no mention of a robot not moving as it intended or getting turned, if a robot goes right twice and down twice it will know definitively it is in the lower right corner. It can get to any corner by a similar method. After it knows which corner it is in, since navigation is perfect, it can reliably move to any position without relying on its sensors at all.
Can a purely unsupervised learning agent learn? How do an unsupervised learning agent and a reinforcement learning agent differ? Be specific.
An unsupervised agent generally starts off with a dataset it is attempting to identify patterns within. The process begins by extracting some random elements from the dataset and setting them aside as a test set. It then uses the remaining learning set to formulate concepts about the patterns in the whole set. Next, it makes predictions about the elements it set aside in the test set and from the accuracy of these tests the learning agent can estimate how well it learned.
Reinforcement learning differs from unsupervised learning in that in unsupervised learning generally deals with finding patterns in a set of data. Reinforcement learning deals with the actions an agent should take to maximize the amount of some reward. Unsupervised agents know the entire state of the world and are attempting to find patterns whereas reinforcement learners have incomplete knowledge and are only able to discover the effects of a action by performing the action and observing.
Reinforcement agents are attempting to learn the set of actions to maximize reward and unsupervised agents are attempting to identify a pattern to maximize prediction efficiency. The difference between an unsupervised agent and a reinforcement agent whose reward is correct prediction is the completeness of the information the agent has at its disposal.
Why is Occam's razor important to learning by observation?
For any given set of points there are an infinite number of curves that could be fitted to pass through them. Occam's razor, by specifying that the simplest complete solution is the best, defines a criteria for choosing a specific solution.
Can all boolean functions be represented by small decision trees? Why or why not?
A boolean function simply means there are two outcomes. Imagine the boolean function gender. It has two outputs, 'male' or 'female.' Now, imagine that the system does not have access to chromosomal information, or that the system uses a modern interpretation of gender which divorces it from sex (a drag queen would be referred to using female pronouns and called a woman despite being genetically male). This system would rely on behavioral characteristics (wears a dress, wears makeup, etc.) and physical characteristics (hair length, height, waist to hip ratio, etc.). There are hundreds of thousands of potential characteristics that provide some level of separation, but particularly if sampling was done across cultures the correlation between any one characteristic and definitively knowing gender is small. With a cross-culture sample size of a couple hundred thousand people, this tree could potentially require hundreds of characteristics and levels to predict the boolean characteristic of gender.
Draw a decision tree for the following problem, where we have a table of data that describes eight pieces of fruit. Use the features, Taste, Size, and Color in this order to generate the decision tree.
| Number | Taste | Size | Color | Result |
|---|---|---|---|---|
| 1 | Sweet | Medium | Red | Apple |
| 2 | Sour | Medium | Green | Apple |
| 3 | Sour | Medium | Red | Apple |
| 4 | Sweet | Medium | Yellow | Apple |
| 5 | Sweet | Small | Red | Grape |
| 6 | Sour | Small | Green | Grape |
| 7 | Sour | Large | Yellow | Grapefruit |
| 8 | Sour | Small | Yellow | Lemon |
Compute the information gain for each attribute of the three features: taste, size, and color.
Use this information to choose the root node for a more compact decision tree. Continue this process and complete the tree. Compare against the tree that you generated in part (d).
Information content of a set X in bit-based information entropy information is:
The entropy of a partition, C, of the set X is:
It is also useful to recall that:
The information gain isn't really needed. The gain is the information content of the whole set minus the entropy of a partition. Since the information total is constant, one can simply pick the partition with minimum entropy. Since it is requested though, the information content of the entire set is:
For the top of the decision tree (where the entire dataset is considered), the values for I(x) would be:
| Chr. | Value | Apple | Grape | G.fruit | Lemon | I(c) | E(C) | G(C) = I(X) - E(C) |
|---|---|---|---|---|---|---|---|---|
| Taste | Sweet | 2 | 1 | 0 | 0 | 0.204434 | ||
| Sour | 2 | 1 | 1 | 1 | ||||
| Size | Large | 0 | 0 | 1 | 0 | 0 | 1.405639 | |
| Medium | 4 | 0 | 0 | 0 | 0 | |||
| Small | 0 | 2 | 0 | 1 | 0.918296 | |||
| Color | Red | 2 | 1 | 0 | 0 | 0.918296 | 1.156861 | 0.593139 |
| Green | 1 | 1 | 0 | 0 | 1 | |||
| Yellow | 1 | 0 | 1 | 1 | 1.5 |
A lower entropy is better (since the gain is the total information minus the subdivision entropy), so size is chosen as the characteristic to separate on. Two of the three sizes (large and medium) are perfect predictors for a fruit and don't need to be subdivided. For the third another calculation needs to be performed on the remaining characteristic. For this table, only small fruits are being considered.
| Chr. | Value | Apple | Grape | G.fruit | Lemon | I(c) | E(C) |
|---|---|---|---|---|---|---|---|
| Taste | Sweet | 0 | 1 | 0 | 0 | 0 | |
| Sour | 0 | 1 | 0 | 1 | 1 | ||
| Color | Red | 0 | 1 | 0 | 0 | 0 | 0 |
| Green | 0 | 1 | 0 | 0 | 0 | ||
| Yellow | 0 | 0 | 0 | 1 | 0 |
Color comes out with the best entropy and it is a perfect predictor (0 entropy) so there are no ambiguities left. The resultant tree is:
This tree is less bushy than the previous tree and shallower. For apples and grapefruits only a single computation is needed. For everything else there are two. There is no classification that can't be done more quickly on this tree than the previous one.