Download Report

Continuous Conceptual Set Covering:
Learning Robot Operators From Examples
Carl Myers Kadie
Knowledge-Based Systems Group,
Department of Computer Science & Beckman Institute,
University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
[email protected]
Abstract
Continuous Conceptual Set Covering (CCSC) is
an algorithm that uses engineering knowledge to
learn operator effects from training examples.
The program produces an operator hypothesis
that, even in noisy and nondeterministic
domains, can make good quantitative
predictions. An empirical evaluation in the traytilting domain shows that CCSC learns faster
than an alternative case-based approach. The
best results, however, come from integrating
CCSC and the case-based approach.
1.
INTRODUCTION
An important open problem in Machine Learning is
learning the effects of robot operators from examples.
Previous research has provided a partial solution.
Case-based approaches, for example, have been effective
in some domains [Moore, 1990]. A limitation of the
case-based approach is that it is very sensitive to the
number of attributes used to describe a case. Machine
Learning offers a number of generalization techniques
that are relatively insensitive to the number of attributes
[Quinlan, 1986; Kadie, 1990]. Most of these techniques,
however, work only on discrete concept-learning
problems. The CCSC algorithm extends these
generalization methods to work on continuous problems.
The key extensions involve the application of
background knowledge and the automatic determination
of an error threshold.
2.
PHYSICAL-WORLD
OPERATOR-EFFECT LEARNING
Conceptually, the effects of an operator are a function
from the state of a system and a set of parameters to a
new system state. For example, the tray-tilting domain
system is made up of a Puma robot holding a square 11"
× 11" tray (figure 1). The tray contains a single round
puck. The state of the world is represented by the x and y
coordinates of the puck. The values of the coordinates
are continuous and range from about 30.0 to about 240.0.
Figure 1. Experimental Set Up
Initially, the robot knows how to physically execute the
tilt_operator. It does not, however, know the effects of
the operator. When the tray-tilt operator is executed, the
robot tips the tray down 30° from the horizontal in the
direction of tilt. The new position of the puck is hard to
predict because of uncertainty in the initial conditions
(the initial position of the puck and the tilting angle are
continuous values subject to measurement error). In
addition, the puck's movement can be complex; it can
slide, bounce and even roll (along an edge of the tray).
The performance task in the tray-tilting domain is to take
the initial position of the puck, <x0 ,y0>, and the tilt
angle tilt, and predict the puck's final position, <x1, y1>.
The goodness of this prediction will be measured as the
Euclidean distance between the actual final position and
the predicted final position. The input to the learning
task is 1) a set of training examples of the form
{<<statei,parms>, statei+1>} where statei+1 is the result
of applying the operator to statei with parameters parmsi
and 2) background knowledge, K.
The output of the learning task is an (operator)
hypothesis, H. The operator hypothesis should be an
executable function, for example, a Lisp lambda
expression (for an example, look ahead to figure 4).
Background knowledge here is in the form of
engineering experts. Each expert is a program, Ej, that
makes (sometimes bad) predictions about the puck's
movement. Referring to figure 2, the experts used in the
tray-tilting experiments are:
• stay-still (position a).
• go-and-stick (position b).
• go-and-slam (position c).
• slider: The puck will slide from position b toward
position c. The distance it sides will be a linear
function of the distance from position a to position b
and cos(φ), where φ is the angle of incidence.
Figure 2. Geometry of Tray Tilting for an Initial
Puck Position a and Tilt Angle tilt - If the puck
traveled in a straight line it would contact the
wall at point b with an angle of incidence of φ.
If the puck then slid along the wall, it would
reach point c.
The experts may be fixed or they may be the output of
other learning programs. For example, slider is the
output of a program that takes the training examples as
input and then uses multiple-linear regression to find the
best linear relation.
CCSC is similar to work in quantitative discovery,
especially the ABACUS system [Falkenhainer and
Michalski, 1986]. Unlike most quantitative discovery
systems, CCSC works with whatever experts (or expert
constructors) it is given. CCSC can, for example, work
with equations, look-up tables, and Lisp programs. It
could even work with ABACUS's equation constructor.
CCSC also differs from most quantitative discovery
systems in that it automatically adapts to error.
Learning in a tray-tilting domain is described in
[Christiansen, et al, 1990; Mason, et al, 1989]. In their
version of the problem the domain is discrete. They
divide the tray into nine subsquares (like a tic-tac-toe
board) and the tilt heading into 24 angles. The goal of
their work is to build a robot that learns from
experimentation how to improve its planning ability. The
inductive learning component of the system is
case-based. The output of the learner is a Markov model.
Given one of the initial subsquares, an angle, and a final
subsquare, the Markov model tells the probability of that
move.
The case-based, or exemplar, approach to operator
learning is also explored in [Moore, 1990]. Moore's
system efficiently uses the same nearest-neighbor metric
used in this paper. Like all case-based approaches,
however, this approach is sensitive to the number of
attributes and has difficulty accepting background
knowledge.
Because this is a physical-world domain, no expert's
predictions will likely ever be perfect. But because some
of the experts are created dynamically, all examples
should be well predicted by at least one expert. The heart
of the learning task is thus selecting the right expert for a
given example.
The Grasper system demonstrates explanation-based
learning in a robot domain [Bennett, 1990]. Grasper is
given an approximate domain theory. In contrast with
CCSC's more empirical approach, Grasper uses
explanation-based methods to help it tune scalar
parameters such as the initial width of a robot's grasper.
Grasper requires much more background knowledge than
CCSC (it must be given an approximate domain theory),
but fewer training examples.
The operator hypothesis that results from learning can be
represented as a decision list of the form:
3.
If d1 then apply Ek ,
1
else if d2 then apply Ek ,
2
Three algorithms were created and tested. This
section describes each one in turn.
...
else apply Ek
m
3.1.
where the dj's are decision rules and Ek's the experts.
Section 3 details CCSC, an algorithm for doing expert
selection. But first, here is a review of related work.
Related Work - CCSC differs from most Machine
Learning research in that it creates hypotheses that
predicts continuous values. Continuous value prediction
is seldom perfect and never completely wrong. Instead, it
is correct to a lesser or greater degree.
LEARNING ALGORITHMS
CASED-BASED LEARNING
The simplest algorithm for operator learning is a
case-based, nearest-neighbor approach. In the
experiments, case nearness was measured by scaling all
values to the interval [0,1] and then measuring Euclidean
distance.
In a second set of experiments, the same similarity
metric was applied to a set of constructed attributes.
Symmetry allowed each case to represent eight cases.
Given enough examples, case-based learning will
converge to an hypothesis with minimal error. The
disadvantages of case-based learning are two fold. First,
the record of past cases is bulky and nearly
incomprehensible to humans. Second, the case-based
approach is very sensitive to the number of attributes
used to describe the state and the parameters. In other
words, as the dimensionality of input space increases,
performance decreases.
3.2.
GREEDY SELECTION OF EXPERTS
If engineering experts (in the form of programs) are
available, then a simple greedy algorithm can be used to
learn.
For
each
training
example,
<<statei, parmsi>, statei+1>, find the expert, Ej , that best
i
covers the example and record ji of that expert. The
result is a new set of training examples
{<<statei, parmsi>, ji> }. These new training examples
can be given to a multiple-concept learning system. The
result will be a decision function D that, when given a
prediction problem <statenew, parmsnew>, predicts the
index, jnew of the best expert for that problem. Applying
Ej to <statenew, parmsnew> produces statenew+1, the
new
predicted result of apply the operator to
<statenew, parmsnew>. In the experiments of section 4 the
decision function D was of the form of a decision list
where each decision rule was produced by a version of
the ID3 program [Quinlan, 1986].
This greedy learning method can produce concise and
comprehensible hypotheses. Moreover, because ID3 is
good on problems with high dimensionality, the
hypothesis should be less sensitive to the input-space
dimensionality. The problem with this approach is that
the decision function D might be more complex than
necessary, that is, the best expert on a particular training
example may not be the best expert for similar examples.
The result of using only the locally best experts, may be
a more complex, less accurate decision function.
3.3.
CCSC: CONCEPTUAL SELECTION OF
EXPERTS
In general, a learning program should be willing to trade
fit with the training examples for greater hypothesis
simplicity. The Conceptual Set Covering (CSC)
algorithm [Kadie 1990] can make this trade off, but only
in discrete, deterministic, errorless domains.
For each example, CSC chooses one expert from the set
of experts that cover that example. It tries to make this
choice so that the syntactic complexity of the final
decision function is minimized.
The result is a decision list D that when given a new
problem <statenew, parmsnew> predicts which expert will
cover that problem. Applying the expert to the problem
produces a predicted result.
In continuous, nondeterministic domains, experts are
unlikely to exactly cover an example, so the notion of
coverage must be relaxed. The simplest approach is to
define cover in terms of an error cut off. Specifically, an
expert E is said to cover a training example
<<statei, parmsi>, statei+1> if
| E(statei, parmsi) - statei+1|
<
cutoff
Because
an
acceptable error cut off for one domain will not
necessarily be an acceptable error cut off in another,
CCSC determines the error cut off automatically.
The idea of the cut off is to separate acceptable error
from unacceptable error. Toward this goal, CCSC makes
two assumptions. First, it assumes that the error of the
best expert on a particular example is usually acceptable.
Call the error of the best expert on example i, besti. The
distribution of all the besti's can be plotted as in figure 3.
Second, it assumes that the error of the other experts on a
particular example is usually unacceptable. Call the
errors of the other experts on example i, {otheri} . The
distribution of all the otheri's can also be plotted. CCSC
sets the cut off to the value that is at the pth percentile of
the best distribution and at the (100%-p)th percentile of
the other distribution. If, on a particular example, no
expert meets the cut off, the best expert is accepted.
Figure 3.
CCSC generally produces hypotheses that are more
concise and comprehensible than those produced by the
greedy method. Figure 4 shows an operator hypothesis
produced by CCSC.
CCSC is, however, only as good as its experts. The error
of its hypotheses converges to the minimum error of the
experts not to the overall minimal error. The solution is
to integrate CCSC with case-based learning by using a
case-based learner as one of CCSC experts. The next
section evaluates these algorithms in practice.
(LAMBDA (X Y TILT DIST1 INCID DIST2 N1 N2)
(COND ((OR
(AND (>= DIST1 176.13756)
(< DIST1 213.34853)
(< INCID 13.5))
(AND (< DIST1 213.34853)
(< DIST2 91.29100)
(>= INCID 13.5)
(< INCID 45.00000))
(AND (>= INCID 45.0000)
(< DIST1 37.96659)
(< N2 25))
(AND (< DIST1 213.34853)
(>= INCID 45.00000)
(>= N2 25)))
(GO-AND-STICK X Y TILT
DIST1 INCID DIST2 N1 N2))
((OR (AND (< DIST1 7.66117)
(< INCID 36.5))
(AND (>= DIST1 7.66117)
(< INCID 63.5)))
(GO-AND-SLAM X Y TILT
DIST1 INCID DIST2 N1 N2))
(T (SLIDER X Y TILT DIST1 INCID DIST2 N1 N2))))
Figure 5. Error Curves for Learning on the Raw Data
Figure 4. An Operator Hypothesis Produced by
CCSC - This hypothesis has an mean error of
less than 7.0.
4.
a)
EVALUATION
Three series of experiments were used to test the
algorithms.
4.1.
RAW DATA EXPERIMENTS
In the raw data experiments, three operator learners were
tested: the case-base learner, the simple CCSC learner
(CCSC with the simple experts of section 2), and the
combined learner (CCSC using the simple experts as
well as the cased-based learner as an expert). To help
measure each learner's sensitivity to the dimensionality
of the input space, between zero and three additional
input attributes where added. The value of each of these
attributes was chosen randomly according to the uniform
distribution over the range 0 to 100. When the total
number of attributes was three, all the learners do about
the same. But as the number of attributes is increased to
six, the CCSC learners needed fewer examples to
produce hypotheses that make better predictions (figure
5).
4.2.
b)
TRANSFORMED DATA EXPERIMENTS
The second series of experiments tested the same
algorithms on transformed data. The transformation
allowed all the learners to exploit the symmetry of the
tray problem and made it easier to measure the learners'
convergent behavior.
Referring back to figure 2, the attributes of the
transformed data are 1) the distance from a to b, 2) the
incident angle φ, and 3) the distance from b to c.
Figure 6. Error Curves for Learning on the
Transformed Data (The x- and y-axes differ.)
When the number of attributes was three and the number
of examples was small, all the learners achieve about the
same accuracy. As predicted, as the number of examples
increases cased-based learning and CCSC with casedbased learning do best (figure 6a). As the number of
attributes is increased to six, the CCSC learners show
much faster learning (figure 6b).
4.3.
GREEDY EXPERT SELECTION VERSUS
CONCEPTUAL EXPERT SELECTION
Section 3.3 contained a prediction that CCSC's
conceptual expert selection method would produce
hypotheses that were more accurate than those produced
by greedy expert selection. This prediction was tested
with repeated runs of CCSC and greedy expert selection
on the raw tray data with no extra attributes. CCSC often
performed significantly better than the greedy algorithm.
On average, CCSC's error rate is 10% lower than the
greedy method's error rate.
5.
CONCLUSION
This paper described the problem of learning the effects
of operators from examples.. The operators may have
parameters (for example, a parameter that specifies the
direction of the tilt). They may also be noisy and
nondeterministic. The input to the learning system also
includes a set of experts (in the form of programs) some
of which may be created automatically. The learning
program tries to learn which expert to apply to any
particular problem.
Several learning algorithms were considered. The best
algorithm was a hybrid in which CCSC used the
case-based algorithm as one of its experts. This
algorithm learned significantly quicker than the
case-based learner and unlike, the first version of CCSC,
converged toward the minimal error.
Work in progress addresses three limitations of CCSC's
current implementation. First, the operator hypotheses
produced by CCSC should do more than make a
prediction; it should also estimate the error of the
prediction. Second, CCSC should be evaluated on more
problems including synthetic problems generated from
mathematical models. Work on such models has begun.
Third, some experts should be constructed from
primitives such as translate_point. The beginnings of
such a system for discrete, errorless domains, is
described in [Kadie, 1988].
Despite these limitations, CCSC offers immediate
benefits to those who wish to learn operator effects. It
shows how background knowledge can be used to
improve this type of inductive learning. It is especially
useful when the dimensionality of the input space is
high.
Acknowledgments
Support was provided by the Fannie and John Hertz
Foundation and ONR grant N00014-88-K124. Thanks to
Alan Christiansen of the Carnegie Mellon University
School of Computer Science for providing the
tilting-tray data.
References
[Bennett, 1990] Scott W. Bennett. Reducing real-world
failures of approximate explanation-based rules. In
Proceedings of the Seventh International Conference on
Machine Learning, pages 226-234, Morgan Kaufmann
Publishers, June 1990.
[Christiansen, et al, 1990] Alan D. Christiansen,
Matthew T. Mason, and Tom M. Mitchell. Learning
reliable manipulation strategies without initial physical
models. In IEEE International Conference on Robotics
and Automation, Cincinnati, May 1990.
[Falkenhainer and Michalski, 1986] Brian Falkenhainer
and Ryszard S. Michalski. Integrating qualitative and
quantitative discovery: the Abacus system. Machine
Learning, 1(4), 1986.
[Kadie, 1988] Carl M. Kadie. Diffy-S: learning robot
operator schemata from examples. In Proceedings of the
Fifth International Conference on Machine Learning,
pages 430-436, Morgan Kaufmann Publishers, June
1988.
[Kadie, 1990] Carl M. Kadie. Conceptual set covering:
improving fit-and-split algorithms. In Proceedings of the
Seventh International Conference on Machine Learning,
pages 40-48, Morgan Kaufmann Publishers, June 1990.
[Mason, et al, 1989] M. T. Mason, A. D. Christiansen,
and T. M. Mitchell. Experiments in robot learning. In
Proceedings of the Sixth International Workshop on
Machine Learning, Ithaca, NY, June 1989.
[Moore, 1990] Andrew W. Moore. Acquisition of
dynamic control knowledge for a robot manipulator. In
Proceedings of the Seventh International Conference on
Machine Learning, pages 244-252, Morgan Kaufmann
Publishers, June 1990.
[Quinlan, 1986] J. Ross Quinlan. Induction of decision
trees. Machine Learning, 1(1), 1986.