MACHINE LEARNING

MACHINE
LEARNING
What is learning?


A computer program learns if it
improves its performance at some
task through experience (T. Mitchell,
1997)
Any change in a system that allows it
to perform better (Simon 1983)
2
What do we learn:
Descriptions
Rules how to recognize/classify
objects, states, events
Rules how to transform an initial
situation to achieve a goal (final state)
3
How do we learn:





Rote learning - storage of computed information.
Taking advice from others. (Advice may need to be
operationalized.)
Learning from problem solving experiences remembering experiences and generalizing from
them. (May add efficiency but not new knowledge.)
Learning from examples. (May or may not involve
a teacher.)
Learning by experimentation and discovery.
(Decreasing burden on teacher, increasing burden
on learner.)
4
Approaches to Machine
Learning
• Symbol-based
• Connectionist Learning
• Evolutionary learning
5
Inductive Symbol-Based Machine
Learning
Concept Learning

Version space search

Decision trees: ID3 algorithm

Explanation-based learning

Supervised learning

Reinforcement learning
6
Version space search for concept
learning

Concepts – describe classes of
objects

Concepts consist of feature sets

Operation on concept descriptions


Generalization: Replace a feature with a
variable
Specialization: Instantiate a variable with a
feature
7
Positive and Negative examples of
a concept


The concept description has to match
all positive examples
The concept description has to be
false for the negative examples
8
Plausible descriptions


The version space represents all the
alternative plausible descriptions of
the concept
A plausible description is one that is
applicable to all known positive examples
and no known negative example.
9
Basic Idea

Given:
A representation language
 A set of positive and negative examples
expressed in that language


Compute: A concept description that is
consistent with all the positive examples and
none of the negative examples
10
Hypotheses
The version space contains two sets of
hypotheses:
G – the most general hypotheses
that match the training data
S – the most specific hypotheses
that match the training data
Each hypothesis is represented as
a vector of values of the known attributes
11
Example of Version space
Consider the task to obtain a description of the
concept: Japanese Economy car.
The attributes under consideration are:
Origin, Manufacturer, Color, Decade, Type
training data:
Positive ex: (Japan, Honda, Blue, 1980, Economy)
Positive ex: (Japan, Honda, White, 1980, Economy)
Negative ex: (Japan, Toyota, Green, 1970, Sports)
12
Example continued
The most general hypothesis that matches
the positive data and does not match the
negative data, is:
(?, Honda, ?, ?, Economy)
the symbol ‘?’ means that the attribute may take any value
The most specific hypothesis that matches
the positive examples is:
(Japan, Honda, ?,1980, Economy)
13
Algorithm: Candidate elimination

Initialize G to contain one element: the
most general description (all features are
variables).

Initialize S to empty.

Accept a new training example.
14
Process positive examples
Remove from G any descriptions
that do not cover the example.
 Generalize S as little as possible so
that the new training example is
covered.
 Remove from S all elements that
cover negative examples.

15
Process negative examples
Remove from S any descriptions
that cover the negative example.
Specialize G as little as possible so
that the negative example is not
covered.
 Remove from G all elements that
do not cover the positive examples.

16
Algorithm continued
Continue processing new training examples, until
one of the following occurs:
 Either S or G become empty, there are no consistent
hypotheses over the training space. Stop.
 S and G are both singleton sets.



if they are identical, output their value and stop.
if they are different, the training cases were inconsistent.
Output this result and stop.
No more training examples. G has several
hypotheses.
The version space is a disjunction of hypotheses. If for a
new example the hypotheses agree, then we can classify
the example. If they disagree we can take the majority
vote
17
Learning the concept of
"Japanese economy car"

Features:


POSITIVE EXAMPLE:



Origin, Manufacturer, Color, Decade, Type
(Japan, Honda, Blue, 1980, Economy)
Initialize G to singleton set that includes everything
Initialize S to singleton set that includes first
positive example
G = {(?, ?, ?, ?, ?)}
S = {(Japan, Honda, Blue, 1980, Economy)}
18
Example continued

NEGATIVE EXAMPLE:


(Japan, Toyota, Green, 1970, Sports)
Specialize G to exclude negative example
G = {(?, Honda, ?, ?, ?),
(?, ?, Blue, ?, ?)
(?, ?, ?, 1980, ?)
(?, ?, ?, ?, Economy)}
 S = {(Japan, Honda, Blue, 1980, Economy)}

19
Example continued

POSITIVE EXAMPLE:



(Japan, Toyota, Blue, 1990, Economy)
Remove from G descriptions inconsistent
with positive example
Generalize S to include positive example
G = { (?, ?, Blue, ?, ?)
(?, ?, ?, ?, Economy)}
S = {(Japan, ?, Blue, ?, Economy)}
20
Example continued

NEGATIVE EXAMPLE:

(USA, Chrysler, Red, 1980, Economy)
Specialize G to exclude negative
example (but staying within version
space, i.e., staying consistent with S)
G = {(?, ?, Blue, ?, ?)
(Japan, ?, ?, ?, Economy)}
S = {(Japan, ?, Blue, ?, Economy)}

21
Example continued

POSITIVE EXAMPLE:




(Japan, Honda, White, 1980, Economy)
Remove from G descriptions inconsistent with
positive example
Generalize S to include the positive example
G = {(Japan, ?, ?, ?, Economy)}
S = {(Japan, ?, ?, ?, Economy)}
S = G, both singleton => done!
22
Decision trees


A decision tree is a structure that
represents a procedure for classifying
objects based on their attributes.
Each object is represented as a set of
attribute/value pairs and a
classification.
23
Example
A set of medical symptoms might be represented
as follows:
Cough Fever Weight
Mary
no
yes
normal
Fred
no
yes normal
Julie
yes
yes skinny
Elvis
yes
no obese
Pain
throat
abdomen
none
chest
Classification
flu
appendicitis
flu
heart disease
The system is given a set of training instances
along with their correct classifications and
develops a decision tree based on these
examples.
24
Attributes


If a crucial attribute is not represented,
then no decision tree will be able to learn
the concept.
If two training instances have the same
representation but belong to different
classes, then the attribute set is said to be
inadequate. It is impossible for the
decision tree to distinguish the instances.
25
ID3 Algorithm (Quinlan, 1986)
ID3(R, C, S)



// R – list of attributes,
// C – categorical attribute, S - examples
If all examples from S belong to the same class Cj , return a
leaf labeled Cj
If R is empty
return a node with the most frequent value of C
Else




select the “best” decision attribute A in R with values
v1, v2, …, vn for next node
divide the training set S into S1, …, Sn according to
values v1,…,vn
Call ID3 (R – {A}, C, S1), ID3(R – {A}, C, S2), …
ID3(R – {A}, C, Sn), i.e. recursively build subtrees T1,
…, Tn for S1, …, Sn
Return a node labelled A with children the subtrees
T1, T2, … Tn
26




S - a sample of training examples
Entropy
Entropy (S ) = expected number of bits needed to encode the
classification of an arbitrary member of S
Information theory: optimal length code assigns
-log2 p bits to message having probability p
Generally for c different classes
Entropy(S)  c(- pi * log2 pi)
27
Entropy of the Training Set
T : a set of records partitioned into
C1, C2, …, Ck on the bases of the
categorical attribute C.
 Probability of each class
Pi = Ci / T
 Info(T) = -p1*Log(P1) - … - Pk*log(Pk)

Info (T) is the information needed to classify an
element.
28
How much helpful is an attribute?


X : a non-categorical attribute,
T = {T1,…,Tn} is the split of T according to X
The entropy of each Tk is:
Info(Tk) = - (Tk1 / Tk)* log(Tk1 / Tk) - …
- (T kc / Tk)*log(Tkc / Tk )
where c is the number of partitions in Tk produced by the
categorical attribute C
For any k, Info(Tk) reflects how the categorical attribute C
splits the set Tk
29
Information Gain
Info(X,T) = T1/T * Info(T1) +
T2/T * Info(T2) +
…. + Tn /T * Info(Tn)
Gain(X,T) = Info(T) – Info(X,T) =
Entropy(T) - i (Ti/T)*Entropy(Ti)
30
Information Gain


Gain(X,T) - the expected reduction in entropy
caused by partitioning the examples of T according
to the attribute X.
Gain(X,T) - a measure of the effectiveness of an
attribute in classifying the training data

The best attribute has maximal Gain(X,T)
31
Example (1)
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
sunburned
(positive)
Dana
blonde
tall
average
yes
none
(negative)
Alex
brown
short
average
yes
none
Annie
blonde
short
average
no
sunburned
Emily
red
average
heavy
no
sunburned
Pete
brown
tall
heavy
no
none
John
brown
average
heavy
no
none
Katie
blonde
short
light
yes
none
32
Example (2)


Attribute “hair”
Blonde: T1 = {Sara, Dana, Annie, Katie}
Brown: T2 = {Alex, Pete, John}
Red:
T3 = { Emily}
T1 is split by C into 2 sets:
 T11 = {Sarah, Annie}, T12 = {Dana, Katie}

Info(T1) = - 2/4 * log(2/4) – 2/4* log(2/4) = -log(1/2) = 1
In a similar way we compute
Info(T2) = 0,
Info(T3) = 0

Info(‘hair’,T) = T1/T * Info(T1) + T2/T * Info(T2) + T3 /T *Info(T3)


= 4/8 * Info(T1) + 3/8* Info(T2) + 1/8 * Info(T3) =
= 4/8 * 1 = 0.50
This happens to be the best attribute
33
Example (3)
Hair color
blonde
red
brown
Lotion
yes
none
no
sunbur
n
sunbur
n
none
34
Split Ratio

GainRatio(D,T) =
Gain(D,T) / SplitInfo(D,T)
where SplitInfo(D,T) is the information
due to the split of T when D is considered
categorical attribute
35
Split Ratio Tree
lotion
no
yes
Color
blonde
red
brown
none
none
sunbur
n
none
36
More Training Examples
37