Thesis

Action Recognition for Robot Learning
ALESSANDRO PIEROPAN
Doctoral Thesis
Stockholm, Sweden 2015
TRITA-CSC-A-2015:09
ISSN-1653-5723
ISRN-KTH/CSC/A–15/nr-SE
ISBN 978-91-7595-561-2
Computer Vision and Active Perception
School of Computer Science and Communication
KTH Royal Institute of Technology
SE-100 44 Stockholm, Sweden
Copyright © 2015 by Alessandro Pieropan except where otherwise stated.
Tryck: Universitetsservice US-AB 2015
iii
Abstract
This thesis builds on the observation that robots cannot be programmed
to handle any possible situation in the world. Like humans, they need mechanisms to deal with previously unseen situations and unknown objects. One
of the skills humans rely on to deal with the unknown is the ability to learn
by observing others. This thesis addresses the challenge of enabling a robot
to learn from a human instructor. In particular, it is focused on objects. How
can a robot find previously unseen objects? How can it track the object with
its gaze? How can the object be employed in activities? Throughout this
thesis, these questions are addressed with the end goal of allowing a robot
to observe a human instructor and learn how to perform an activity. The
robot is assumed to know very little about the world and it is supposed to
discover objects autonomously. Given a visual input, object hypotheses are
formulated by leveraging on common contextual knowledge often used by humans (e.g. gravity, compactness, convexity). Moreover, unknown objects are
tracked and their appearance is updated over time since only a small fraction
of the object is visible from the robot initially. Finally, object functionality
is inferred by looking how the human instructor is manipulating objects and
how objects are used in relation to others. All the methods included in this
thesis have been evaluated on datasets that are publicly available or that we
collected, showing the importance of these learning abilities.
Sammanfattning
Denna avhandling bygger på tesen att robotar inte kan programmeras för
att hantera alla tänkbara situationer. Liksom människor behöver de mekanismer för att hantera tidigare osedda situationer och okända föremål. En
av de färdigheter människan förlitar sig på för att hantera nya situation är
förmågan att lära sig genom att observera andra människor. Temat för denna
avhandling är att möjliggöra för en robot att lära av en mänsklig instruktör.
Vi inriktar oss i synnerhet på objekt. Hur kan en robot hitta tidigare osedda objekt? Hur kan den följa objekt med blicken? Hur kan objekt användas
för olika syften? Dessa frågor avhandlas med fokus på robotinlärning från demonstration – roboten observerar en mänsklig instruktör och lär sig att utföra
en aktivitet. Roboten antas veta mycket lite om världen och måset upptäcka föremål självständigt. Roboten observerar och formulerar objekthypoteser
genom att utnyttja de kontextuella kunskaper om fysiska objekt som ofta
används av människor (t.ex. gravitation, kompakthet, konvexitet). Objekthypoteserna följs över tiden och roboten lär sig mer om objektens utseende med
tiden, eftersom endast en liten del av objektet är synlig för roboten från början. Slutligen häreder roboten objektens funktionalitet genom att titta på hur
den mänskliga instruktören manipulerar objekten och hur objekt interagerar
med varandra. Alla metoder som ingår i denna avhandling har utvärderats på
datamängder som är tillgängliga för allmänheten, och visar vikten av dessa
inlärningsförmågor.
iv
Acknowledgments
This journey was not easy at all. Two years ago I couldn’t even imagine myself
being at this point. There are some people I would really like to thank if I made it
this far. First, thanks to Hedvig for:
• Offering me this opportunity.
• Guiding me along the journey.
• Believing in me even when I did not.
• Tolerating my complaints about writing.
Furthermore, thanks to Carl Henrik for sharing his energy, positive attitude and
for being there when I needed the most. Thanks to Niklas for having so many roles
in my life during these years: colleague, landlord, supervisor and more importantly
friend. (What is next?). Thanks to Renaud for being such a great travelling
companion during conferences (good job my friend!). Cheng, the greatest office
mate, you really helped me with your hopeless optimism. Michele for sharing so
many coffee breaks with me. And Virgile for appearing always right on time.
Giampi for the wonderful collaboration started during a coffee break. Magnus for
being a great friend despite I am not from Åkersberga. Miro for showing me the joy
of playing beach volley in Sweden. Puren for being such a cheerful office mate. Karl
for sharing with me his knowledge about tracking and Belgian beers. I would like to
thanks Professor Ishikawa in being so welcoming and hosting me in his laboratory
In Tokyo. The group of bandy for so many funny and competitive matches. The
Amazon team for such a fun experience: Mr.Hang, Francisco, Johannes, Karl and
Michele. Thanks to all past and current Cvappers who contributed in making
our laboratory such a great working place: Alper, Jeannette, Xavier, Xavi, Marin,
Heydar, Vahid, Hossein, Gert, Oscar, Raresh, Nils, Yasemin, Omid, Andrej, Martin,
Sergio, Johan, Ivan, Marianna, Alejandro, Akshaya, Yuquan, Patric, Aaron, Erik,
Emil, Ali, Anastasia, Fredrik, Kristoffer, Lazaros, Florian, Ioannis, John, Christian,
Petter, Josephine, Atsuto, Mårten, Frine, Jan-Olof, Stefan, Tove, Ramviyas, Zhang
and Mikael. Thanks to all reviewers who helped me with invaluable feedback about
my work, improving my research and helping me to become the researcher I am.
Thanks go to Dani who offered me, an unknown student writing from Italy, a Master
thesis project.
Finally I would like to thank all my friends and my family. My parents Giorgio
and Carla who supported me in all my choices. My sister Francesca who is always
present. Furthermore, I am extremely grateful to Serena for always trusting in me
and standing by my side all these years. Without you I would not be here writing
this.
Contents
Contents
v
I Introduction
1
1 Introduction
1
Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
6
7
2 Robot Learning and Object
1
Problem Statement . . . .
2
Example Scenario . . . . .
3
Challenges . . . . . . . . .
Understanding
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
13
13
13
14
3 Summary of Papers
A
Unsupervised Object Exploration Using Context . . . . . . . . . .
B
Robust 3D Tracking of Unknown Objects . . . . . . . . . . . . . .
C
Functional Object Descriptors for Human Activity Modeling . . . .
D
Recognizing Object Affordances in Terms of Spatio-Temporal ObjectObject Relationships . . . . . . . . . . . . . . . . . . . . . . . . . .
E
Audio-Visual Classification and Detection of Human Manipulation
Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
25
27
4 Conclusions
1
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
34
Bibliography
35
II Included Publications
39
A Unsupervised Object Exploration Using Context
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A1
A3
v
29
31
vi
CONTENTS
2
Related Work . . . . . . .
3
Contextual Segmentation
4
Experiments . . . . . . . .
5
Conclusions . . . . . . . .
References . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A5
A6
A14
A17
A19
B Robust 3D Tracking of Unknown Objects
1
Introduction . . . . . . . . . . . . . . . . .
2
Related Work . . . . . . . . . . . . . . . .
3
Method . . . . . . . . . . . . . . . . . . .
4
Experiments . . . . . . . . . . . . . . . . .
5
Conclusions and Future Work . . . . . . .
References . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B1
B3
B5
B6
B13
B16
B17
Modeling
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
C1
C3
C6
C7
C8
C11
C15
C17
C Functional Object Descriptors for Human Activity
1
Introduction . . . . . . . . . . . . . . . . . . . . . . .
2
Related Work . . . . . . . . . . . . . . . . . . . . . .
3
Extraction Of Object Hypotheses . . . . . . . . . . .
4
Functional Object Representation . . . . . . . . . . .
5
Results . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Conclusions . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . .
D Recognizing Object Affordances in Terms of Spatio-Temporal ObjectObject Relationships
D1
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D3
2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D5
3
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D6
4
Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D10
5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D12
6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D15
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D15
E Audio-Visual Classification and Detection of
Human Manipulation Actions
1
Introduction . . . . . . . . . . . . . . . . . . .
2
Related Work . . . . . . . . . . . . . . . . . .
3
Dataset . . . . . . . . . . . . . . . . . . . . .
4
Pose Estimation . . . . . . . . . . . . . . . .
5
Model . . . . . . . . . . . . . . . . . . . . . .
6
Features . . . . . . . . . . . . . . . . . . . . .
7
Experimental Results . . . . . . . . . . . . . .
8
Conclusions . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
E1
E3
E5
E6
E7
E8
E10
E12
E17
E18
Part I
Introduction
Chapter 1
Introduction
Robots are useful for a wide variety of tasks but introducing them comes at a high
financial cost. There are three main criteria robotics engineers consider when evaluating the need for robots in a certain application. If a situation meets one of these
criteria robots most likely can help. Those criteria are: dull, dirty and dangerous and they are know as the three D of Robotics. As an example, the work in an
assembly line meets the first criteria, dull, as it is very boring for a human since
it needs to be repeated over and over. Thus, car manufacturing is one of the most
successful examples of the application of robotics.
However there are two major limitations of the application of robots even in
such a relatively well structured scenario. Firstly, an assembly line requires a team
of specialized engineers to set up the whole machinery in order to produce a defined
product. Nothing is left to chance, everything needs to be exactly where it is supposed to be. As it can be seen in Figure 1, the chassis of a vehicle is positioned on
a moving platform that is programmed to stop at some defined location for a fixed
amount of time. When the platform stops the robots positioned around the chassis
execute the task they were programmed for. Once the task is completed the moving
platform transport the chassis to the next check point. Every robot performs its
task working exactly on a specific location of the chassis. Therefore, every time
the assembly line requires to deploy a different car model the team of engineers
has to tune the line according to the new design. This results in a downtime loss
that effects the productivity of the assembly line. The shorter the downtime is the
more convenient a line involving robots is. Secondly, those robots need to operate
in cages since it is very dangerous for a human to be around them while the line
is working. Clearly, these limitations reduce the application of robotics because
either the setup and changeover time make the assembly line not profitable or the
product requires a human in the process able to operate close to robots.
Only recently there are companies investing in the development of robots that can
work side by side with humans and that can be trained interactively by people with-
3
4
CHAPTER 1. INTRODUCTION
Figure 1: Example of assembly line. The chassis of a car model are placed on
a moving platform that stops at determined position. The robotic arms at the
location execute the actions they are programmed for on the chassis. The robotic
assembly line is surrounded by walls to avoid any injury to employees.
out special training, e.g. Rethink Robotics. This is already a step forward. Such
adaptive robots would be possible to deploy in more flexible assembly lines, allowing
to use an existing line to produce a new product and minimizing the changeover
time.
Production is not the only field that can benefit from such adaptive robots. An
application that meets the second criteria, dirty, is the navigation of sewers and
detection of clogged sewer pipes. However, in such a scenario, robots are still controlled remotely given the unstructured environment they need to explore. Robots
can be used in natural or man-made disaster emergencies where the intervention of
humans is not possible due to high risks for health, this meets the third criteria:
dangerous.
The disaster of Fukushima was a wake up call for the robotics community. The
Japanese media wondered why a country well known for its cutting-edge technology in robotics was unable to respond to the emergency 1 . It was not possible to
attempt to repair the damaged reactors because the levels of radiation were too
dangerous for the emergency crew. Since Japan had no robot able to operate in
the scenario, the iRobot company provided four robots (two PackBot 510s and
1 Domestic
robots failed to ride to rescue after No. 1 plant blew, The Japan Times
5
two Warrior 710s) to assist. The robots could provide video from inside the power
plant, however they were not able to execute any of the tasks needed to slow the
meltdown.
Robots that can adapt and operate in a scenario similar to Fukushima can be really
helpful in preventing other disasters in the future. Moreover, scenarios like the one
in Fukushima may have areas where it is not possible to control a robot remotely,
therefore it is necessary to have automated robots. The Defense Advance Research
Project Agency (DARPA) is investing a lot of effort in this direction now.
“An oncoming demographic inversion will leave us with fewer people to do all
kinds of jobs"
— Professor Rodney Brooks
Another sector, that does not meet any of the introduced criteria but that can
benefit from the use of robotics, is healthcare. It has been noticed in the past decade
that the median age of the population of the western-world countries is increasing
while the birthrate is reducing. This phenomena is known as population ageing; the
number of elderly people is growing while the number of active working people is
shrinking [20]. It is probable that in the future there will not be sufficient facilities
and workforce to take care of the patients [11]. If we can have adaptive robots in
a house-hold environment, it is possible then to have continuous monitoring and
assistance in house making aging-in-place a reality [11].
The limiting factor that prevents the deployment of robots in those situations is
the environment itself. A home environment is so unstructured and unpredictable
that a robot needs to adapt continuously in order to operate on its own. What is
missing today is robots’ ability to deal with the unknown and adapt.
Thanks to human skills, a person can adapt to new situations very quickly. One of
the fundamental abilities on which humans rely consists in learning from the people
around how to perform a certain action or to understand how something works by
observing others. This form of learning is distinct especially during childhood [5].
Such skills allow a human being to deal with the unknown using another person as
a model to emulate. Clearly, in order to have robots fully integrated in our daily
routine, this ability, implemented in robots, is fundamental.
Let us take a home robot as an example. Ideally this robot should be able to be
instructed in making an espresso just by showing how to use a coffee machine once
as any human being is able to do. Once the robot has learned that, it should be
able to prepare an espresso even if the position of the coffee machine changes or if
the old cups are replaced with some new ones. A robot that can adapt to that is
deployable in any house-hold environment.
Learning from demonstration is then required [3]. Robot learning from demonstration (also known as Imitation Learning) is the paradigm for enabling a robot
to learn new tasks observing an instructor without being specifically programmed
6
CHAPTER 1. INTRODUCTION
[40]. First, the robot needs the ability to observe the instructor performing a task.
The robot needs to interpret the raw input data acquired trough its sensors by detecting complex patterns and associate them to discrete classes. This is known in
Neuroscience as the binding problem [34]. Using our espresso example, if the robot
is supposed to learn how to prepare coffee, it needs to observe the instructor and
detect the coffee machine otherwise it will learn something else. Second the robot
should learn what a task means and its effect on the world. The robot should
understand all the objects required to perform the activity make an espresso and it
should understand when the task has been completed obtaining the right outcome.
Third, the robot imitates the instructor and achieves the same end goal.
1
Thesis Contributions
The work in this thesis explores various aspects of the first two steps required in imitation learning: observe and learn. In particular we focus on learning manipulation
activities and we contribute with solutions to the following challenges:
Object discovery. In this thesis we focus on the unsupervised discovery of objects from a visual sensory input. Our main contribution is a method that segments a scene into regions and performs an estimate of the manipulable objects
in the scene, which can be used for reasoning about objects and their functionality in manipulation activities. This is a fundamental capability for a robot that
needs to learn a task from a human demonstrator. However such task is challenging since the problem of defining what an object is is ill-defined, as the nature of
human perception is not fully understood. Looking at the same scene there can be
multiple interpretations of what objects are present [45]. We constrain the problem to objects which can be grasped and moved by a human (or a robot) and we
rely on contextual knowledge to find plausible object hypotheses without losing in
generality.
Object tracking. We propose a tracking algorithm that is robust and versatile
since it can track unknown objects by learning the appearance of the object while
it is tracked. Such capabilities are fundamental for robot learning in a realistic
scenario given the unstructured nature of the environment. Still few methods actually employ fully working general tracking systems [31]. Instead, experiments are
performed with simplified environments [2] or trackers work only with a limited set
of specific objects [33].
Object affordance. We present object descriptors that can be used to understand how an object can be employed in human activity rather than a more common
approach of characterizing an object by its appearance. This is motivated by the
fact that in order to learn a task successfully from an instructor, a robot needs to
observe an action, its effect on the world, and perform an action that has the same
2. THESIS OUTLINE
7
effect. Therefore, an agent should be able to reason about objects in terms of the
current activity [15] by understanding the effect of an object on the world. This is
known as object affordance [30].
Multi-modal action understanding. We propose a multi-modal approach that
merges audio and visual input to understand an observed action. Such an approach
mitigates the limitation of visual recognition, e.g object occlusion, and that actions
have to be performed within the field of view. Clearly multi-modal understanding
is a capability desired in autonomous robots. However, while much effort has been
spent on development of methods for visual perception[19, 24, 25], few studies has
been performed in robot understanding using auditory input, e.g., [42]. Moreover,
to our knowledge the task of audio-visual action recognition has only been addressed
in Robotics very rarely, e.g., in [43].
2
Thesis Outline
The rest of the thesis is structured as follows:
2.1
Chapter 2: Robot Learning and Object Understanding
Chapter 2 motivates the importance of the challenges presented in the introduction
giving a practical example scenario where robot learning can be applicable. The
challenges that arise in such a scenario are presented with a particular focus on the
ones addressed in this work.
2.2
Chapter 3: Summary of Papers
Chapter 3 introduces the papers included in the second part of this work, and how
the problems are addressed in respect to previous work in the same area.
2.3
Chapter 4: Conclusion
We conclude with a discussion on the results achieved by this work and the potential
work for continuing this in the future.
2.4
Part II: Included Papers
Five publications are included in Part II. The abstract of these papers along with
the contribution made by the author are given here. A short summary of each
paper is given in Chapter 3. The contributions are presented in the same order as
the challenges introduced in Section 1.
8
CHAPTER 1. INTRODUCTION
Paper A: Unsupervised Object Exploration Using Context
Alessandro Pieropan and Hedvig Kjellström. In Proceedings of the 2014
IEEE International Symposium on Robot and Human Interactive
Communication (ROMAN’14), Edinburgh, UK, August 2014.
Abstract:
In order for robots to function in unstructured environments in interaction with
humans, they must be able to reason about the world in a semantic meaningful
way. An essential capability is to segment the world into semantic plausible object
hypotheses. In this paper we propose a general framework which can be used for
reasoning about objects and their functionality in manipulation activities. Our system employs a hierarchical segmentation framework that extracts object hypotheses
from RGB-D video. Motivated by cognitive studies on humans, our work leverages
on contextual information, e.g., that objects obey the laws of physics, to formulate
object hypotheses from regions in a mathematically principled manner.
Contribution by the author:
Designed and implemented the RGB-D segmentation algorithm. Designed an object
hypothesis formulation method based on contextual knowledge. Performed the
evaluation of the system on a public dataset.
Paper B: Robust 3D Tracking of Unknown Objects
Alessandro Pieropan, Niklas Bergström, Masatoshi Ishikawa and Hedvig
Kjellström. In Proceedings to 2015 IEEE/RSJ International Conference on
Robotics and Automation (ICRA’15), Seattle, USA, May 2015.
Abstract:
Visual tracking of unknown objects is an essential task in robotic perception, of
importance to a wide range of applications. In the general scenario, the robot has
no full 3D model of the object beforehand, just the partial view of the object visible
in the first video frame. A tracker with this information only will inevitably lose
track of the object after occlusions or large out-of-plane rotations. The way to
overcome this is to incrementally learn the appearances of new views of the object.
However, this bootstrapping approach is sensitive to drifting due to occasional
inclusion of the background into the model.
In this paper we propose a method that exploits 3D point coherence between
views to overcome the risk of learning the background, by only learning the appearances at the faces of an inscribed cuboid. This is closely related to the popular
idea of 2D object tracking using bounding boxes, with the additional benefit of
recovering the full 3D pose of the object as well as learning its full appearance from
all viewpoints.
We show quantitatively that the use of an inscribed cuboid to guide the learning
leads to significantly more robust tracking than with other state-of-the-art methods.
2. THESIS OUTLINE
9
We show that our tracker is able to cope with 360 degree out-of-plane rotation, large
occlusion and fast motion.
Contribution by the author:
Designed a real-time tracking algorithm for unknown objects. Designed an adaptive mechanism to learn new appearances of an object. Acquired a dataset used
for evaluation. Manually labeled the dataset with ground truth. Performed evaluation of the system on the dataset. Benchmark against state-of-the-art tracking
algorithms.
Paper C: Functional Object Descriptors for Human Activity Modeling
Alessandro Pieropan, Carl Henrik Ek and Hedvig Kjellström. In
Proceedings of the 2013 IEEE/RSJ International Conference on Robotics and
Automation (ICRA’13), Karlsruhe, Germany, May 2013.
Abstract:
The ability to learn from human demonstration is essential for robots in human
environments. The activity models that the robot builds from observation must
take both the human motion and the objects involved into account. Object models designed for this purpose should reflect the role of the object in the activity
– its function, or affordances. The main contribution of this paper is to represent object directly in terms of their interaction with human hands, rather than
in terms of appearance. This enables the direct representation of object affordances/function, while being robust to intra-class differences in appearance. Object hypotheses are first extracted from a video sequence as tracks of associated
image segments. The object hypotheses are encoded as strings, where the vocabulary corresponds to different types of interaction with human hands. The similarity
between two such object descriptors can be measured using a string kernel. Experiments show these functional descriptors to capture differences and similarities in
object affordances/function that are not represented by appearance.
Contribution by the author:
Recorded the dataset used for evaluating the method presented in the paper. Designed a temporal feature descriptor to capture object functionality. Designed the
method to segment and track objects used to extract the feature descriptor. Performed evaluation of the system comparing the results with appearance based descriptors.
Paper D: Recognizing Object Affordances in Terms of Spatio-Temporal
Object-Object Relationships
Alessandro Pieropan, Carl Henrik Ek and Hedvig Kjellström. In
Proceedings of the 2014 IEEE/RAS International Conference on Humanoid
Robots (HUMANOIDS’14), Madrid, Spain, November 2014.
10
CHAPTER 1. INTRODUCTION
Abstract:
In this paper we describe a probabilistic framework that models the interaction
between multiple objects in a scene. We present a spatio-temporal feature encoding
pairwise interactions between each object in the scene. By the use of a kernel
representation we embed object interactions in a vector space which allows us to
define a metric comparing interactions of different temporal extent. Using this
metric we define a probabilistic model which allows us to represent and extract the
affordances of individual objects based on the structure of their interaction. In this
paper we focus on the presented pairwise relationships but the model can naturally
be extended to incorporate additional cues related to a single object or multiple
objects. We compare our approach with traditional kernel approaches and show a
significant improvement.
Contribution by the author:
Designed a spatio-temporal descriptor that captures pairwise object relationships.
Designed a probabilistic model to infer object functionality. Performed the evaluation outperforming the results achieved in Paper C.
Paper E: Audio-Visual Classification and Detection of Human Manipulation Actions
Alessandro Pieropan, Giampiero Salvi, Karl Pauwels and Hedvig
Kjellström. In Proceedings of the 2014 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS’14), Chicago, USA, September
2014.
Abstract:
Humans are able to merge information from multiple perceptional modalities and
formulate a coherent representation of the world. Our thesis is that robots need
to do the same in order to operate robustly and autonomously in an unstructured
environment. It has also been shown in several fields that multiple sources of
information can complement each other, overcoming the limitations of a single
perceptual modality. Hence, in this paper we introduce a dataset of actions that
includes both visual data (RGB-D video and 6DOF object pose estimation) and
acoustic data. We also propose a method for recognizing and segmenting actions
from continuous audio-visual data. The proposed method is employed for extensive
evaluation of the descriptive power of the two modalities, and we discuss how they
can be used jointly to infer a coherent interpretation of the recorded action.
Contribution by the author:
Acquired a dataset used to measure the quality of the framework. Manually labeled
the ground truth. Designed a model to perform recognition of pre-segmented subactions. Designed a model to perform recognition of sub-action in a continuous data
stream. Implemented a multi-modal framework for action recognition. Evaluated
the quality of the model using the dataset recorded.
2. THESIS OUTLINE
11
Other Publications. Apart from the papers included in this thesis, the following
publications have been achieved during my Ph.D studies.
Alessandro Pieropan, Niklas Bergström, Masatoshi Ishikawa and Hedvig
Kjellström. A Robust Object Tracker Using Structured Learning. In review
for the Advanced Robotics Journal.
Alessandro Pieropan, Niklas Bergström, Hedvig Kjellström and Masatoshi
Ishikawa. Robust Tracking Through Learning. In Annual Conference of the
Robotics Society of Japan, Japan, 2014
Alessandro Pieropan, Giampiero Salvi, Karl Pauwels and Hedvig
Kjellström. A dataset of human manipulation actions. In Grasping
Challenge Workshop at International Conference on Robotics and
Automation, Hong Kong, 2014.
Alessandro Pieropan, Carl Henrik Ek and Hedvig Kjellström. On Object
Affordances. In Grasping Challenge Workshop at International Conference
on Robotics and Automation, Hong Kong, 2014.
Chapter 2
Robot Learning and Object
Understanding
1
Problem Statement
The fundamental problem addressed in this thesis is that of object understanding
in the context of robot learning from a human demonstrator. A robot should be
able to understand how to divide the sensory input into entities, i.e., objects, and it
should be able to reason about them in a way that is effective for modeling human
activity. In order to provide a clear idea of the problem we present an example
scenario.
2
Example Scenario
You want to have a robot cooking okonomiyaki1 in your restaurant. You bought a
dual arm robot and you hired a professional cook to show the robot how to prepare
an okonomiyaki (Figure 1). The robot comes from the factory with a set of limited
motor capabilities: it can move the arms to reach a certain location expressed in
coordinates, it can open and close the gripper and it can move the head to direct
the gaze. In a regular industrial environment this is enough to fulfill many tasks.
As an example, the following sequence of motor primitives can solve a pick and
place task: move arm to location A, close gripper, move arm to location B, open
gripper. In this scenario the robot operates blindly without any real perception of
the environment around it, it has no concept of object, action or task completion.
Without any perception there are many things that can go wrong in this situation.
The object may be misplaced so that when the gripper closes at location A it does
not grasp anything. The container at location B may be misplaced so that when
the gripper opens the object drops on the floor. The robot can break something or
harm someone if they stand in the path from A to B. Clearly a robot that operates
1 typical
Japanese dish
13
14
CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING
Figure 1: Example of a cooking scenario where the robot learns from a human
instructor. The robot needs to detect the possible objects present on the scene,
once they are found it needs to track them while the human is moving them.
Finally it needs to understand the activity the human performed and reproduce it
with its own actuators.
in a restaurant sharing the space with people needs a perceptual mechanism to
constantly check actions and effects in order to adapt to perturbations.
In this specific scenario, the robot is also supposed to learn something new from a
cook. First of all the robot needs to detect the instructor to learn from. Then it
needs to detect the objects in order to handle a possible misplacement. The actions
and sub-actions that are required to complete the recipe need to be detected and
learned by the robot with a particular focus on the end goal achieved rather than
the exact way they are achieved. This is motivated by the fact that two people do
not perform an activity the exact same way. The small differences in our body shape
and proportions influence the way we perform an action. There are also more subtle
factors that influence our behavior such as culture, society or just family education.
[ One person may grasp a glass of wine in the main body while another in the stem.
Despite this difference the end goal is the same: to drink. ] A robot has a different
embodiment compared to a human, e.g., it can have a gripper with three fingers
rather than five. Its activity representation should be invariant to such differences,
and focus on the outcome of the activity. This way it can reproduce the action
using its motor primitives and make sure that the same end goal is achieved.
3
Challenges
The example previously described present many challenges that need to be addressed. The input sensory data, no matter what type of sensor, need to be processed. This thesis focuses on visual and auditory data but much can be done using
other input such as tactile. The task presented in the example can be split in three
3. CHALLENGES
(a) Segmentation
15
(b) Object detection
Figure 2: Example of segmentation algorithms. In the first image the most general
approach is taken, pixels in the image are clustered together only by color coherence. In the second case a supervised approach is taken. The objects are known
beforehand, therefore the problem reduces to object detection.
steps: observe, learn and reproduce. This thesis focuses on the first two aspects
trying to address the observe and learn problem in a similar manner as motivated
in neuroscience [6]:
segregation: the complex patterns in the data need to be discretized in classes.
This is done by extracting from the data entities such as objects, human
instructor, human hands and so on.
combination: the entities extracted need to be merged to understand the experience, in our case understanding the whole action in order to reproduce it.
The end goal of this work is to understand meaningful semantic concepts that
can be transferred to the robot. Therefore, when the robot knows the goal that
needs to be achieved it can use its own way to achieve it. There are many ways to
address the problems introduced. A possible approach, very industrial in a sense,
may consist in having a robot that can recognize exactly the tools present in your
restaurant and place tools and ingredients in predefined positions. This thesis aims
for the opposite, it strives for generality.
3.1
Objects Discovery
In the first part of our work (presented in Paper A), semantic and contextual information are used to find object hypotheses. Such an approach is motivated by recent
studies in Neuroscience that stress the importance of context in helping humans
recognizing and detecting objects [7, 14, 17]. Concepts such as gravity, convexity
16
CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING
or symmetry are helpful for humans in finding objects. As an example, given an
image of a complete white room with no furniture, it can be very difficult for a
human to determine which surface is the floor and where the walls are. Some may
say that the floor is the bottom surface assuming that the photo has been taken
horizontally. However if a human is in the room they immediately determine where
the floor is because the human perceives the vertical direction through gravity.
Yet, the two dominant approaches within the Computer Vision community exploit
context very little. On one side, there is a task-driven approach that consist in
training a robot to detect a predefined set of objects that are needed for modelling
the range of tasks that the robot should learn [33] (see Figure 2b). In general such
an approach can produce good results but it is not feasible in reality. Just consider
a simple object like a cup, how many cups different in appearance may be out
there? Clearly it is not possible to pre-train a robot to recognize every single cup
in the world. A mechanism that can generalize is needed.
In contrast, the second common approach consists in partitioning an image in coherent regions which are hypothesized to correspond to objects. This is called
segmentation. The process is considered to be successful in finding objects when
the segmentation boundaries coincides with the object. The most simple way to
segment an image is to apply a clustering algorithm (Figure see Figure 2a) such
as k-means [29]. More complex methods represent the image as a graph and produce clusters by cutting the connectivity of it when certain conditions are not met
[13, 37].
The main limitation of all the above methods is that the concept of object is illdefined. Sometimes an object may correspond to only one cluster; in that case
the segmentation is successful. However sometimes an object may correspond to
multiple segments, pretty common with high texture objects; this is defined as oversegmentation. Alternatively multiple objects can be grouped in the same segment
(undersegmentation); this may happen when they are very similar in color. The
recent advent of cheap depth cameras improved the task of generating segments
that respect object boundaries. Some recent works have shown that it is possible
to have a robust matching between object boundaries in images by detecting edges
in the range images and applying a simple flood fill algorithm [1].
However, even though such an approach contributes consistently in preserving object edges, it is more suited to find smooth surfaces rather than objects. A cereal
box, as an example, is going to be segmented in six separate surfaces. Clearly one
more step is needed to merge the surfaces in the object. It is not possible to find
something if we do not know what we are looking for. This thesis is concerned about
finding objects without losing generality; this is where the contextual knowledge
comes into play. In the context of actions involving the manipulation of objects,
a robot can exploit contextual knowledge such as gravity, compactness, symmetry,
size or relationships between surfaces [35] to find complex patters that correspond
to objects in the sensory input.
As an example, Figure 3 shows an example where context is important. If we are
looking for chairs we do not consider the object on the desk as a potential candi-
3. CHALLENGES
17
Figure 3: The "chair-challenge" [10]. Synthetic environment where many chairs and
chair-like object are present. Relying only on appearance won’t help in finding real
chairs however the context can help in the discrimination. The small object on the
desk cannot be a chair, the one hanging from the ceiling cannot be as well.
date, or the lamp in the ceiling, since that is not the context in which we usually
find chairs. This figure is also interesting in terms of affordance (See Section 3.3).
3.2
Challenges with Tracking Objects
Once the robot formulates valid object hypotheses, it should be able to follow the
objects with its gaze if they move; this problem is known as tracking. Since we
strive for generality, we contribute with a method (presented in Paper B) able to
track unknown objects robustly. If a robot detects a new unknown object, only a
small fraction of it is initially visible. Therefore the robot needs to adapt and build
its knowledge of the object when new segments of the object become visible. Our
method learns new appearances of the object assuming its appearance may vary.
Visual tracking of unknown objects is a challenging task studied actively within
the Vision community. In the spirit of our approach, some methods learn the appearance of the model continuously so that the algorithm can adapt to changes in
appearance due to the movement of the object [21, 23]. However, such an approach
18
CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING
Figure 4: Example of different tracking algorithms that learn the object model
while it is manipulated. Some of the algorithms learn the hands as part of the
object model and drift away from the object.
presents an aggravating factor to the problem of tracking, often referred to as the
drifting problem: upon learning a new instance of the object it is crucial to understand what part of the new instance effectively belongs to the object and what
portion belongs to something else, like the background or another object occluding
the one of interest. Without a mechanism to discriminate between good and bad
appearance candidates sooner or later the algorithm will learn the background as
part of the object model, and the tracker will lose the focus on the object as shown
in Figure 4. Our method employs a mechanism that limits the drifting problem
and bond the complexity of the learning procedure removing the need of learning
continuously.
There are also works that assume that the object appearance does not change,
therefore there is no need of learning new appearances [32]. Such an approach
is very robust when the underlying assumption is met however, in case of drastic
changes of the appearance, the object is lost. An extensive overview of the stateof-the-art unknown object trackers is presented by Wu et al. [46] that thoroughly
evaluates the best performing trackers.
An alternative and more restrictive approach consists in limiting the tracking problem to a set of objects that are known beforehand [33]. In such a scenario the
problem can be divided in detection and tracking. The 3D object model or a set of
images of the object are used to extract features. The features allow to detect the
known object in new images. Usually, very robust features such as scale invariant
feature transform (SIFT) [27] are used, however, since tracking is a task that requires real time performances, especially in robotics, recently faster but less robust
features are becoming popular. BRISK [26] and ORB [38] are the most popular
among these. The tracking problem is formulated as estimating the motion of the
object by computing the optical flow between two images. Some methods estimates
a dense optical flow [9] to estimate the position of the object while other rely only
3. CHALLENGES
19
Figure 5: Chairs of various shape and color. Some are very common and easy to
define as chair. Some are more difficult.
on the estimation of the movement of a limited set of points [28]. In general, a
dense approach is more robust in estimating the motion correctly while a sparse
approach is faster.
As discussed before, this work aims for generality therefore assuming that an object
is known beforehand is very limiting.
3.3
Challenges with Understanding Object Functionality
In the previous sections we analyzed the challenges of finding and tracking unknown
objects. Once the robot is capable to focus on the objects it then needs to understand what can be done with them. We contribute with two methods (presented
in Paper C and D) that build upon the belief that objects should be modelled in
terms of how they are used by the human demonstrator. In Paper C we show that
it is possible to understand an object functionality by looking at how it is manipulated by the human demonstrator during an action. In Paper D we present how the
spatio-temporal relationships between different objects involved in the same activity encode information about their role in the action. Our methods are motivated
by an inspiring work [16], that stresses the importance that the functionality of
objects directly correlates to the action they are involved in.
This is confirmed by studies in psychology [15] that state that humans and other
agents recognize objects in terms of their function, what kinds of use the objects
afford. Figure 5 shows a set of chairs. Some are common but the last two to the
right are design furniture that are not seen very often in a house. It is very easy to
classify all of them as chairs by looking at this image. Still an interesting question
arises: will it be easy to recognize the last chair out of this context? Probably the
fact that it is in an image with other chairs bias our judgement. However, in a real
scenario, if that piece of furniture is placed close to a table or if a person is sitting
on it a human can infer easily its functionality. Figure 3 shows another illustration
of the affordance concept. There are many chair-like objects in the scene but only
few of them are valid chairs that afford sitting.
To a certain extent it is possible to understand the functionality of an object just
looking at its appearance. An approach that relies solely on visual features to reason about objects is more common within the Computer Vision community. For
an overview of such works see [12]. For tasks such as image retrieval or image
classification it is effective to exploit visual features, however this is limiting for ac-
20
CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING
tion recognition tasks. Visual features can be used to classify functional classes of
objects only under the strong assumption that each object maps to a single semantic class. This is not always true, there are classes of objects that afford different
functionalities in different contexts. Using a knife on a zucchini has a completely
different meaning compared to using it on another person.
Some works are in the spirit of our approach. [16] exploits the human pose to recognize objects that afford sitting. [44] understands functional categories of image
regions, such as roads and walkways, based on the behaviors of moving objects in
the vicinity. [41] proposes a framework that uses contextual interaction between
image regions to learn the contextual models of regions. Paper C presents how the
functionality of objects can be understood by looking at how the instructor manipulates them. Similarly to [16] the object functionality is characterized by how the
human relates to it. Paper D, in the spirit of [41], leverages on the spatio-temporal
relationships between the objects to infer their respective role.
3.4
Challenges with Multi-Modal Action Understanding
Figure 6: Sensors available in robots.
So far all the challenges presented has been addressed from a visual perspective.
However, vision, as any other source of input, is noisy and it has some limitations
(i.e. objects may be occluded to the view). We contribute with a method (presented
in Paper E) to make use of a second source of data, in our case audio, to compesate
the limitation of vision and improve the task of action recognition. First, such an
approach allows to evaluate the descriptive power of each source of information in
solving a predefined task. Second it is a first step towards a multi-modal approach
that takes into account all the sensory input a robot can acquire (Figure 6).
Our approach is motivated by studies in Neuroscience that show that multiple
sensory input are merged in our brain with the purpose of making a coherent in-
3. CHALLENGES
21
terpretation of the world [47, 34]. Yet, in Robotics, an approach that takes into
account more than one sensor is rarely taken into consideration. Furthermore, very
little work in robot understanding exploits auditory information even if previous
attempts shown that much can be done from that perspective. [39] learns object
affordance exploiting a linguistic description of the scene. Similarly, [43] uses language to recognize actions. We instead propose to use the sounds produced by
actions as a distinctive signature for robot understanding. This is in the spirit of
a similar work [42] but we move one step further having a multi-modal approach
that merges audio and vision for action recognition. Such an approach mitigates
the limitations of a single source. From the visual perspective actions may be performed out of the field of view or the object involved in an action may be occluded
by other objects or by the human instructor. From the auditory point of view,
some actions may not produce any sound.
Other works motivate the importance of a multi-modal approach. [18] infers the
content of a container while being grasped, exploiting visual and tactile input.
Moreover, [8] estimates the stability of a grasp using tactile and visual feedback.
Chapter 3
Summary of Papers
A
Unsupervised Object Exploration Using Context
This work tackles the problem of unsupervised unknown object detection since
the thesis strives for generality as discussed in Section. 3.1. This problem can be
addressed by clustering the pixels of an image into plausible object hypotheses. An
object is considered to be found if the algorithm generates one and only one cluster
corresponding to it. The paper builds upon a general graph-based segmentation
algorithm that generates clusters by cutting the minimum spanning tree [13] when
certain conditions are met. However a general approach is not enough to find
objects since it is highly probable that an object may end up oversegmented or
undersegmented. Moreover, the definition of object is ill-posed per se. [45] has
shown that it is very difficult to have an objective evaluation of what an object is.
Figure 1: Example of a simple general segmentation algorithm. Without any contextual knowledge it is very unlikely that each object corresponds to a segment.
Highly textured objects are going to be oversegmented.
23
24
CHAPTER 3. SUMMARY OF PAPERS
(a) Input RGB-D
(c) 3D facets
(b) Oversegmentation
(d) 3D shapes
(e) High objectness
Figure 2: Unsupervised object discovery. (a) Input RGB-D image. (b) Superpixels
generated by an oversegmentation step. (c) Concatenation of super pixels into 3D
surface segments, or facets, according to surface orientation. (d) Concatenation of
facets into convex 3D shapes. (e) Measuring segment objectness; the image shows
segments with high objectness.
Therefore, the strategy in this paper consists in leveraging on contextual knowledge
to find suitable object hypotheses. The context is given by the task that needs to
be executed. In our case, given that we desire a robot able to learn manipulative
activities, the definition of objects is that they are convex assemblies of smooth
surfaces and of the right size to be grasped and moved. The proposed algorithm
generates an oversegmentation of the scene and uses the contextual knowledge
in a hierarchical manner to merge the segments in plausible object hypotheses.
First segments are merged in surfaces assuming that objects have piecewise smooth
curvatures. Then surfaces are grouped according to convexity, size and compactness
of the merged cluster. Finally gravity is used to determine the likelihood of a cluster
to be an object hypothesis. Graspable objects are very likely to be found on a planar
surface rather than floating in air. The steps of the algorithm are shown in Figure 2.
B. ROBUST 3D TRACKING OF UNKNOWN OBJECTS
B
25
Robust 3D Tracking of Unknown Objects
In this paper we address the problem of tracking previously unseen objects presented
in Sec. 3.2 . The paper assumes that the robot does not know the object beforehand
so only the part facing the robot’s sight is visible. No assumption on the shape of
the object or the appearance is made. Therefore the algorithm tries to adapt and
learn new instances of the object appearance while tracking it.
Tracking is performed by a combination of motion estimation done using optical
flow and object detection based on detection of local features. An aggravating factor
that increases the challenge of the problem is the learning mechanism to update the
object model including new views of the object. Many state of the art algorithms
approach this problem by first estimating the position of a bounding box that
ideally will wrap the object in the image. The bounding box is the used to extract
the new appearance of the object. This procedure is sensitive to drifting since the
Figure 3: The tracker uses feature points to estimate the current 3D pose of the
object. The feature points in green are already know as part of the object model.
The point in red are points belonging to a new appearance candidate of the object
so they are included in the object model.
Figure 4: Behavior of different tracking algorithms. The result of the algorithm
proposed in this work in shown in red. Yellow shows the result of a tracker that does
not learn new appearances of the object. Blue and green do learn new instances
but they drift due to the object manipulation.
26
CHAPTER 3. SUMMARY OF PAPERS
bounding box may include the background due to a wrong estimation of the object
position. Additional mechanism are usually implemented to filter bad candidates
for learning.
In this paper we propose to use a 3D bounding cube rather than a bounding
box. It is then possible to estimate not only the object position but also its relative
rotation from the initial scenario and it scale. All this information in the learning procedure making the algorithm more robust and less prone to drifting. Our
tracker is able to cope with 360 degree out-of-plane rotation, large occlusion and
fast motion. Some experimental results can be seen in Figure 4.
C. FUNCTIONAL OBJECT DESCRIPTORS FOR HUMAN ACTIVITY
MODELING
C
27
Functional Object Descriptors for Human Activity
Modeling
The main purpose of this work is to let the robot understand the role of objects
within an activity; its function or affordance. Usually this task is addressed by the
Computer Vision community by extracting visual features and mapping appearance
of objects to their affordance. However, this approach suffers from two limitations.
First objects that share similar appearance features may have belong to completely
different affordance classes (Figure 5). I.e., a knife and a cucumber shares the common feature of having an elongated shape but they have completely different roles in
a cooking activity. Second an object may afford multiple functionalities. Since the
commonly used appearance feature descriptor cannot capture functionalities properly, our idea instead consists in designing a new functional descriptor that can be
used to understand object affordance. In order to do that we propose to take into
account human motion and represent objects directly in terms of their interaction
with human hands. A dataset of human kitchen activities has been recorded since
none of the existing at the time provided information about the objects and the
position of the human executing the activities. We assume that the objects involved
in the activity are unknown. It would be possible to reason about activities taking
the object class into account, however this approach could not scale and it is in
contrast with the main purpose of this thesis of aiming for generality. Therefore a
segmentation algorithm is used to extract object hypotheses from the first frame
Appearance(feature(space(
Func.onal(feature(space(
Figure 5: Appearance based vs. function based object modeling. Left: If objects are
characterized by appearance (shape, color, texture) features, a set of objects might
be grouped into elongated, round and square. Right: Using features reflecting the
object function in human activity, the objects might instead be grouped into tools,
ingredients and support areas.
28
CHAPTER 3. SUMMARY OF PAPERS
Figure 6: Example of the discretized symbol labelling procedure used considering
the object and hand position.
of each video. The appearance of the object hypotheses is learned and used to
track the segments during the whole video. The idea of this work consists in using
the relative position between the objects and the hands to encode the functional
properties of each object. The continuous distance measurement at each time step
in the video is associated to a discretized representation defined by a limited vocabulary of symbols (Figure 6). The set of states are: idle, approaching, close to
human hands, in use, leaving. Our proposed object functional descriptor consists
then in a string of symbols. The similarity between two object descriptors can be
measured using a string kernel. Experiments show that the functional descriptor
can captures affordance similarities that are not captured by appearance features.
D. RECOGNIZING OBJECT AFFORDANCES IN TERMS OF
SPATIO-TEMPORAL OBJECT-OBJECT RELATIONSHIPS
D
29
Recognizing Object Affordances in Terms of
Spatio-Temporal Object-Object Relationships
In this paper we revisit the problem of capturing object affordance from a different
perspective. The idea is that an object functionality is not only determined by the
way a human manipulates the object but also by the other objects involved in an
activity as shown in the example in Figure 7. The object segments extracted from
the dataset presented in Paper C are used to calculate the spatial relationships of
the object involved in the activity.
Figure 7: Object functionality is to a high degree defined by the interaction with
other objects. In this toy example, the functionality of a hammer depends highly
on the context in which it is used. Together with a nail, the hammer affords
hammering (the activity it is designed for). However, together with a beer bottle,
the hammer also affords opening. Furthermore, together with a piggy bank, the
hammer affords breaking. These three affordances are conceptually different, and
tied to the other object that interacts with the hammer. We thus propose to
represent object affordances in terms of object-object relationships.
30
CHAPTER 3. SUMMARY OF PAPERS
P (O3 , O4 |X3 , X4 )
O4
O2
O3
P (X3 |O3 )
O1
Figure 8: Illustration of the joint functional object classification. Each node in the
graphical model corresponds to an object functional class. Each edge is the joint
probability of the connected objects, observed from the video by using the pairwise
object distances.
The main contributions of this paper are twofold. First the spatio-temporal
relationships of objects are encoded without the used of a discrete set of symbols.
This is possible since the path kernel used to measure the similarities applies to
stream of continuous data a strategy similar to the one used by the string kernel.
Second we propose a probabilistic model that takes into account the structure of the
interaction between objects to infer the affordance of individual objects (Figure 8).
The main assumption we take to perform our experiments is that the relationships
between each pair of object within the same activity is dependent. That motivates
the design of our model. We show significant improvement in the classification rate
using such an approach compared to our previous work.
E. AUDIO-VISUAL CLASSIFICATION AND DETECTION OF HUMAN
MANIPULATION ACTIONS
E
31
Audio-Visual Classification and Detection of Human
Manipulation Actions
This paper tackles the problem of sensory fusion for activity recognition. The idea
is that any sensor suffers from some limitations but they can be compensated by
fusing data coming from another source of different nature. Very little work has
been done previously in robotics, mostly focused on fusing tactile sensing with
vision [8].
To our knowledge, the recognition of actions by fusing auditory and visual information, as proposed in this work, has not been addressed before. The first
contribution of this paper is a publicly available dataset of human actions that
includes visual data, in terms of RGB-D video and 6 degree of freedom (DOF)
object pose estimation, and acoustic input recorded with an array of microphones
(Figure 9 and 10).
The second contribution consists in a method, inspired by speech recognition,
to detect and segment actions from continuous audio-visual data. Moreover the
model has been used to extensively evaluate the descriptive power of each single
source employed in the recognition. Finally our experiments show that the joint
use of audio and vision outperforms the recognition of actions done with a single
sensor.
Figure 9: Complementary information can help to mitigate the limitation of a single
source of information. Many action produces distinctive sounds that can be used
in recognition.
Figure 10: Examples taken from the dataset to show the variety in performing the
actions.
Chapter 4
Conclusions
This thesis explored the problem of object detection, tracking and modelling in the
context of robot learning from demonstration. In order to reason about actions
there is a chain of correlated problems that need to be tackled, starting from the
interpretation of the raw sensory data to a representation of the high level semantics.
Each method presented addressed different parts of this task. First we presented a
method to formulate possible object hypotheses given a RGB-D image. The method
exploits task driven contextual knowledge to find plausible objects. Second we
presented a tracking algorithm that tracks a generic object and learn its appearance
since only a small portion of the object is visible in front of the camera at the
beginning. Third we infer the object functionality by looking how the human uses
the objects and how the spatial relationships of the objects change over time. As a
last step we fused auditory and visual input to understand activities. The research
performed on these tasks underlined some important findings:
Object definition is ill-posed. It is, in the general case, not possible to have a
completely bottom-up, data driven algorithm to detect objects. Some topdown knowledge in terms of contextual or task-driven knowledge needs to be
included. This is in the spirit with recent findings in neuroscience that show
that human recognize objects in their context.
To learn or not to learn? This is the big question to answer upon tracking objects. A tracker that does not learn (i.e., adapts over time) may be very
robust under the assumption that the appearance of objects does not change.
However in general this assumption does not hold and a learning mechanism
is needed. In that case it is very important to detect bad learning candidates
in order to limit the drifting problem.
Appearance against functionality. In the context of action recognition it is
more important to know what you can do with an object rather than its
appearance. By looking directly at how humans use objects it is possible to
understand functionality better than relying on appearance.
33
34
CHAPTER 4. CONCLUSIONS
Sensory fusion. Any sensor has limitations, a sensory-fusion approach is able to
improve the performance in the completion of a task, compared to the results
of a single source.
1
Future Work
In this work we addressed the object-centered problems of robot learning. However,
there are a number of subjects not involving objects that are important to address
to understand and learn actions. One interesting topic consists in detecting human
hands and estimate their pose. Having the pose of human hands can help the task
of robot learning for two main reasons. First, object segmentation as is very difficult while objects are being manipulated. Object segmentation may produce poor
results when it is heavily occluded or if the object has color similarities with the
human hands. Having a supportive system that correlates the estimation of object
boundaries with the estimation of the hand pose may mitigate the noise produced
by manipulation ([36]). Second, hand pose estimation related to grasping may help
in understanding an observed activity. As an example, it is very unlikely that a
person will drink from a glass if it is being grasped from the top.
Another topic we would like to explore more is sensory fusion. We have shown in
Paper E that the limitation of a single sensor can be compensated for by including
sensory input gathered with another source. In our study we used audio to understand actions. It will be interesting to continue in this direction by including speech
recognition. An instructor usually uses verbal explanation to clarify an action they
are showing to a student. Such a possibility should be helpful for robots as well
[22]. Moreover, sensory fusion should not be limited to audio-visual fusion. Much
can be done using other sensors. As an example, thermal cameras can be used to
help segmenting human hands and objects during manipulation activities.
The last topic we would like to investigate is the application of Deep Neural Networks for action recognition. Neural Network-based methodologies have been shown
to outperform other methods in the context of image classification and object detection [4]. It would be interesting to investigate their application on temporal sequences and compare to the results achieved with Hidden Markov Models (HMM)
that we used in our experiments.
Bibliography
[1]
R. Haschke A. Ückermann and H. Ritter. Real-time 3d segmentation of cluttered scenes for robot grasping. In International Conference on Humanoid
Robots, 2012.
[2]
E. E. Aksoy, A. Abramov, F. Wörgötter, and B. Dellen. Categorizing objectaction relations from semantic scene graphs. In IEEE International Conference
on Robotics and Automation, 2010.
[3]
B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot
learning from demonstration. Robotics and Autonomous Systems, 57(5), 2009.
[4]
H. Azizpour and S. Carlsson. Self-tuned visual subclass learning with shared
samples an incremental approach. ArXiv, 1405.5732, 2014.
[5]
A. Bandura. Influence of models reinforcement contingencies on the acquisition
of imitative response. Journal of Personality and Social Psychology, 1965.
[6]
A. Bandura. Social foundations of thought and action: A social cognitive
theory. Prentice Hall, 1986.
[7]
M. Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8), 2004.
[8]
Y. Bekiroglu, D. Song, L. Wang, and D. Kragic. A probabilistic framework for
task-oriented grasp stability assessment. In IEEE International Conference on
Robotics and Automation, 2013.
[9]
K. Berthold, P. Horn, and B. G. Schunck. Determining optical flow. International Joint Conference on Artificial Intelligence, 17, 1981.
[10] I. Bülthoff and H. H. Bülthoff. Image-based recognition of biological motion,
scenes and objects. In Perception of Faces, Objects, and Scenes: Analytic and
Holistic Processes. 2003.
[11] P. Cheek, L. Nikpour, and H. D. Nowlin. Aging well with smart technology.
Nursing Administration Quarterly, 29, 2005.
[12] L. Fei-Fei, R. Fergus, and A. Torralba. Recognizing and Learning Object Categories: Short course, 2009.
35
36
BIBLIOGRAPHY
[13] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 2004.
[14] M. J. Fenske, E. Aminoff, N. Gronau, and M. Bar. Top-down facilitation
of visual object recognition: object-based and context-based contributions.
Progress in brain research, 155, 2006.
[15] J. J. Gibson. The Ecological Approach to Visual Perception. Lawrence Erlbaum
Associates, 1979.
[16] H. Grabner, J. Gall, and L. van Gool. What makes a chair a chair? In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition,
2011.
[17] N. Gronau, M. Neta, and M. Bar. Integrated contextual representation for
objects’ identities and their locations. Journal of Cognitive Neuroscience, 20
(3), 2008.
[18] P. Guler, Y. Bekiroglu, X. Gratal, K. Pauwels, and D. Kragic. What’s in the
container? Classifying object contents from vision and touch. In IEEE/RSJ
International Conference on Intelligent Robots and Systems, 2014.
[19] A. Gupta, A. Kembhavi, and L. S. Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 2009.
[20] J. Guttler, C. Georgoulas, T. Linner, and T. Bock. Towards a Future Robotic
Home Environment: A Survey. Gerontology, 2014.
[21] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured output tracking with
kernels. In IEEE International Conference on Computer Vision, 2011.
[22] M. Johnson-Roberson, J. Bohg, G. Skantze, J. Gustafson, R. Carlson, B. Rasolzadeh, and D. Kragic. Enhanced visual scene understanding through humanrobot dialog. In IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2011.
[23] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 7(34), 2012.
[24] H. Kjellström, J. Romero, and D. Kragic. Visual object-action recognition:
Inferring object affordances from human demonstration. Computer Vision and
Image Understanding, 115(1), 2011.
[25] H. Kjellström, J. Romero, D. Martínez, and D. Kragic. Simultaneous visual
recognition of manipulation actions and manipulated objects. In European
Conference on Computer Vision, volume 2, 2008.
37
[26] S. Leutenegger, M. Chli, and R. Y. Siegwart. Brisk: Binary robust invariant
scalable keypoints. In IEEE International Conference on Computer Vision,
2011.
[27] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004.
[28] B. D. Lucas and T. Kanade. An iterative image registration technique with
an application to stereo vision. In International Joint Conference on Artificial
Intelligence, 1981.
[29] J. B. MacQueen. Some methods for classification and analysis of multivariate
observations. In Berkeley Symposium on Mathematical Statistics and Probability, 1967.
[30] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor. Learning object
affordances: From sensory motor coordination to imitation. IEEE Transactions
on Robotics, 24(1), 2008.
[31] T. Mörwald, J. Prankl, A. Richtsfeld, M. Zillich, and M. Vincze. Blort - the
blocks world robotic vision toolbox. In Best practice in 3D perception and
modeling for mobile manipulation (in conjunction with ICRA 2010), 2010.
[32] G. Nebehay and R. Pflugfelder. Consensus-based matching and tracking of
keypoints for object tracking. In IEEE Winter Conference on Applications of
Computer Vision, 2014.
[33] K. Pauwels, L. Rubio, J. Diaz, and E. Ros. Real-time model-based rigid object
pose estimation and tracking combining dense and sparse visual cues. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition,
2013.
[34] A. Revonsuo and J. B. Newman. Binding and consciousness. Consciousness
and Cognition, 8(2), 1999.
[35] A. Richtsfeld, T. Morwald, J. Prankl, M. Zillich, and M. Vincze. Segmentation of unknown objects in indoor environments. In IEEE/RSJ International
Conference on Intelligent Robots and Systems, 2012.
[36] J. Romero, T. Feix, C. H. Ek, H. Kjellstrom, and D. Kragic. Extracting
postural synergies for robotic grasping. IEEE Transactions on Robotics, 29
(6), 2013.
[37] C. Rother, V. Kolmogorov, and A. Blake. “GrabCut": Interactive foreground
extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3),
2004.
38
BIBLIOGRAPHY
[38] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision,
2011.
[39] G. Salvi, L. Montesano, A. Bernardino, and J. Santos-Victor. Language bootstrapping: Learning word meanings from perception-action association. IEEE
Transactions on Systems, Man, and Cybernetics, 2012.
[40] S. Schaal. Is imitation learning the route to humanoid robots?
Cognitive Sciences, 3, 1999.
Trends in
[41] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2010.
[42] J. Stork, L. Spinello, J. Silva, and K. O. Arras. Audio-based human activity
recognition using non-markovian ensemble voting. In International Symposium
on Robot and Human Interactive Communication, 2012.
[43] C. L. Teo, Y. Yang, H. Daumé III, C. Fermüller, and Y. Aloimonos. Towards
a watson that sees: Language-guided action recognition for robots. In IEEE
International Conference on Robotics and Automation, 2012.
[44] M. W. Turek, A. Hoggs, and R. Collins. Unsupervised learning of functional
categories in video scenes. In European Conference on Computer Vision, 2010.
[45] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation
of image segmentation algorithms. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 29(6), 2007.
[46] Y. Wu, J. Lim, and M-H. Yang. Online object tracking: A benchmark. IEEE
Computer Society Conference on Computer Vision and Pattern Recognition,
2013.
[47] S. Zmigrod and B. Hommel. Feature integration across multimodal perception
and action: a review. Multisensory research, 26, 2013.