Action Recognition for Robot Learning ALESSANDRO PIEROPAN Doctoral Thesis Stockholm, Sweden 2015 TRITA-CSC-A-2015:09 ISSN-1653-5723 ISRN-KTH/CSC/A–15/nr-SE ISBN 978-91-7595-561-2 Computer Vision and Active Perception School of Computer Science and Communication KTH Royal Institute of Technology SE-100 44 Stockholm, Sweden Copyright © 2015 by Alessandro Pieropan except where otherwise stated. Tryck: Universitetsservice US-AB 2015 iii Abstract This thesis builds on the observation that robots cannot be programmed to handle any possible situation in the world. Like humans, they need mechanisms to deal with previously unseen situations and unknown objects. One of the skills humans rely on to deal with the unknown is the ability to learn by observing others. This thesis addresses the challenge of enabling a robot to learn from a human instructor. In particular, it is focused on objects. How can a robot find previously unseen objects? How can it track the object with its gaze? How can the object be employed in activities? Throughout this thesis, these questions are addressed with the end goal of allowing a robot to observe a human instructor and learn how to perform an activity. The robot is assumed to know very little about the world and it is supposed to discover objects autonomously. Given a visual input, object hypotheses are formulated by leveraging on common contextual knowledge often used by humans (e.g. gravity, compactness, convexity). Moreover, unknown objects are tracked and their appearance is updated over time since only a small fraction of the object is visible from the robot initially. Finally, object functionality is inferred by looking how the human instructor is manipulating objects and how objects are used in relation to others. All the methods included in this thesis have been evaluated on datasets that are publicly available or that we collected, showing the importance of these learning abilities. Sammanfattning Denna avhandling bygger på tesen att robotar inte kan programmeras för att hantera alla tänkbara situationer. Liksom människor behöver de mekanismer för att hantera tidigare osedda situationer och okända föremål. En av de färdigheter människan förlitar sig på för att hantera nya situation är förmågan att lära sig genom att observera andra människor. Temat för denna avhandling är att möjliggöra för en robot att lära av en mänsklig instruktör. Vi inriktar oss i synnerhet på objekt. Hur kan en robot hitta tidigare osedda objekt? Hur kan den följa objekt med blicken? Hur kan objekt användas för olika syften? Dessa frågor avhandlas med fokus på robotinlärning från demonstration – roboten observerar en mänsklig instruktör och lär sig att utföra en aktivitet. Roboten antas veta mycket lite om världen och måset upptäcka föremål självständigt. Roboten observerar och formulerar objekthypoteser genom att utnyttja de kontextuella kunskaper om fysiska objekt som ofta används av människor (t.ex. gravitation, kompakthet, konvexitet). Objekthypoteserna följs över tiden och roboten lär sig mer om objektens utseende med tiden, eftersom endast en liten del av objektet är synlig för roboten från början. Slutligen häreder roboten objektens funktionalitet genom att titta på hur den mänskliga instruktören manipulerar objekten och hur objekt interagerar med varandra. Alla metoder som ingår i denna avhandling har utvärderats på datamängder som är tillgängliga för allmänheten, och visar vikten av dessa inlärningsförmågor. iv Acknowledgments This journey was not easy at all. Two years ago I couldn’t even imagine myself being at this point. There are some people I would really like to thank if I made it this far. First, thanks to Hedvig for: • Offering me this opportunity. • Guiding me along the journey. • Believing in me even when I did not. • Tolerating my complaints about writing. Furthermore, thanks to Carl Henrik for sharing his energy, positive attitude and for being there when I needed the most. Thanks to Niklas for having so many roles in my life during these years: colleague, landlord, supervisor and more importantly friend. (What is next?). Thanks to Renaud for being such a great travelling companion during conferences (good job my friend!). Cheng, the greatest office mate, you really helped me with your hopeless optimism. Michele for sharing so many coffee breaks with me. And Virgile for appearing always right on time. Giampi for the wonderful collaboration started during a coffee break. Magnus for being a great friend despite I am not from Åkersberga. Miro for showing me the joy of playing beach volley in Sweden. Puren for being such a cheerful office mate. Karl for sharing with me his knowledge about tracking and Belgian beers. I would like to thanks Professor Ishikawa in being so welcoming and hosting me in his laboratory In Tokyo. The group of bandy for so many funny and competitive matches. The Amazon team for such a fun experience: Mr.Hang, Francisco, Johannes, Karl and Michele. Thanks to all past and current Cvappers who contributed in making our laboratory such a great working place: Alper, Jeannette, Xavier, Xavi, Marin, Heydar, Vahid, Hossein, Gert, Oscar, Raresh, Nils, Yasemin, Omid, Andrej, Martin, Sergio, Johan, Ivan, Marianna, Alejandro, Akshaya, Yuquan, Patric, Aaron, Erik, Emil, Ali, Anastasia, Fredrik, Kristoffer, Lazaros, Florian, Ioannis, John, Christian, Petter, Josephine, Atsuto, Mårten, Frine, Jan-Olof, Stefan, Tove, Ramviyas, Zhang and Mikael. Thanks to all reviewers who helped me with invaluable feedback about my work, improving my research and helping me to become the researcher I am. Thanks go to Dani who offered me, an unknown student writing from Italy, a Master thesis project. Finally I would like to thank all my friends and my family. My parents Giorgio and Carla who supported me in all my choices. My sister Francesca who is always present. Furthermore, I am extremely grateful to Serena for always trusting in me and standing by my side all these years. Without you I would not be here writing this. Contents Contents v I Introduction 1 1 Introduction 1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 6 7 2 Robot Learning and Object 1 Problem Statement . . . . 2 Example Scenario . . . . . 3 Challenges . . . . . . . . . Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 14 3 Summary of Papers A Unsupervised Object Exploration Using Context . . . . . . . . . . B Robust 3D Tracking of Unknown Objects . . . . . . . . . . . . . . C Functional Object Descriptors for Human Activity Modeling . . . . D Recognizing Object Affordances in Terms of Spatio-Temporal ObjectObject Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . E Audio-Visual Classification and Detection of Human Manipulation Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 25 27 4 Conclusions 1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 Bibliography 35 II Included Publications 39 A Unsupervised Object Exploration Using Context 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A1 A3 v 29 31 vi CONTENTS 2 Related Work . . . . . . . 3 Contextual Segmentation 4 Experiments . . . . . . . . 5 Conclusions . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A5 A6 A14 A17 A19 B Robust 3D Tracking of Unknown Objects 1 Introduction . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . 3 Method . . . . . . . . . . . . . . . . . . . 4 Experiments . . . . . . . . . . . . . . . . . 5 Conclusions and Future Work . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B1 B3 B5 B6 B13 B16 B17 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C1 C3 C6 C7 C8 C11 C15 C17 C Functional Object Descriptors for Human Activity 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . . . . . . . 3 Extraction Of Object Hypotheses . . . . . . . . . . . 4 Functional Object Representation . . . . . . . . . . . 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . D Recognizing Object Affordances in Terms of Spatio-Temporal ObjectObject Relationships D1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D3 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D5 3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D6 4 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D10 5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D12 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D15 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D15 E Audio-Visual Classification and Detection of Human Manipulation Actions 1 Introduction . . . . . . . . . . . . . . . . . . . 2 Related Work . . . . . . . . . . . . . . . . . . 3 Dataset . . . . . . . . . . . . . . . . . . . . . 4 Pose Estimation . . . . . . . . . . . . . . . . 5 Model . . . . . . . . . . . . . . . . . . . . . . 6 Features . . . . . . . . . . . . . . . . . . . . . 7 Experimental Results . . . . . . . . . . . . . . 8 Conclusions . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E1 E3 E5 E6 E7 E8 E10 E12 E17 E18 Part I Introduction Chapter 1 Introduction Robots are useful for a wide variety of tasks but introducing them comes at a high financial cost. There are three main criteria robotics engineers consider when evaluating the need for robots in a certain application. If a situation meets one of these criteria robots most likely can help. Those criteria are: dull, dirty and dangerous and they are know as the three D of Robotics. As an example, the work in an assembly line meets the first criteria, dull, as it is very boring for a human since it needs to be repeated over and over. Thus, car manufacturing is one of the most successful examples of the application of robotics. However there are two major limitations of the application of robots even in such a relatively well structured scenario. Firstly, an assembly line requires a team of specialized engineers to set up the whole machinery in order to produce a defined product. Nothing is left to chance, everything needs to be exactly where it is supposed to be. As it can be seen in Figure 1, the chassis of a vehicle is positioned on a moving platform that is programmed to stop at some defined location for a fixed amount of time. When the platform stops the robots positioned around the chassis execute the task they were programmed for. Once the task is completed the moving platform transport the chassis to the next check point. Every robot performs its task working exactly on a specific location of the chassis. Therefore, every time the assembly line requires to deploy a different car model the team of engineers has to tune the line according to the new design. This results in a downtime loss that effects the productivity of the assembly line. The shorter the downtime is the more convenient a line involving robots is. Secondly, those robots need to operate in cages since it is very dangerous for a human to be around them while the line is working. Clearly, these limitations reduce the application of robotics because either the setup and changeover time make the assembly line not profitable or the product requires a human in the process able to operate close to robots. Only recently there are companies investing in the development of robots that can work side by side with humans and that can be trained interactively by people with- 3 4 CHAPTER 1. INTRODUCTION Figure 1: Example of assembly line. The chassis of a car model are placed on a moving platform that stops at determined position. The robotic arms at the location execute the actions they are programmed for on the chassis. The robotic assembly line is surrounded by walls to avoid any injury to employees. out special training, e.g. Rethink Robotics. This is already a step forward. Such adaptive robots would be possible to deploy in more flexible assembly lines, allowing to use an existing line to produce a new product and minimizing the changeover time. Production is not the only field that can benefit from such adaptive robots. An application that meets the second criteria, dirty, is the navigation of sewers and detection of clogged sewer pipes. However, in such a scenario, robots are still controlled remotely given the unstructured environment they need to explore. Robots can be used in natural or man-made disaster emergencies where the intervention of humans is not possible due to high risks for health, this meets the third criteria: dangerous. The disaster of Fukushima was a wake up call for the robotics community. The Japanese media wondered why a country well known for its cutting-edge technology in robotics was unable to respond to the emergency 1 . It was not possible to attempt to repair the damaged reactors because the levels of radiation were too dangerous for the emergency crew. Since Japan had no robot able to operate in the scenario, the iRobot company provided four robots (two PackBot 510s and 1 Domestic robots failed to ride to rescue after No. 1 plant blew, The Japan Times 5 two Warrior 710s) to assist. The robots could provide video from inside the power plant, however they were not able to execute any of the tasks needed to slow the meltdown. Robots that can adapt and operate in a scenario similar to Fukushima can be really helpful in preventing other disasters in the future. Moreover, scenarios like the one in Fukushima may have areas where it is not possible to control a robot remotely, therefore it is necessary to have automated robots. The Defense Advance Research Project Agency (DARPA) is investing a lot of effort in this direction now. “An oncoming demographic inversion will leave us with fewer people to do all kinds of jobs" — Professor Rodney Brooks Another sector, that does not meet any of the introduced criteria but that can benefit from the use of robotics, is healthcare. It has been noticed in the past decade that the median age of the population of the western-world countries is increasing while the birthrate is reducing. This phenomena is known as population ageing; the number of elderly people is growing while the number of active working people is shrinking [20]. It is probable that in the future there will not be sufficient facilities and workforce to take care of the patients [11]. If we can have adaptive robots in a house-hold environment, it is possible then to have continuous monitoring and assistance in house making aging-in-place a reality [11]. The limiting factor that prevents the deployment of robots in those situations is the environment itself. A home environment is so unstructured and unpredictable that a robot needs to adapt continuously in order to operate on its own. What is missing today is robots’ ability to deal with the unknown and adapt. Thanks to human skills, a person can adapt to new situations very quickly. One of the fundamental abilities on which humans rely consists in learning from the people around how to perform a certain action or to understand how something works by observing others. This form of learning is distinct especially during childhood [5]. Such skills allow a human being to deal with the unknown using another person as a model to emulate. Clearly, in order to have robots fully integrated in our daily routine, this ability, implemented in robots, is fundamental. Let us take a home robot as an example. Ideally this robot should be able to be instructed in making an espresso just by showing how to use a coffee machine once as any human being is able to do. Once the robot has learned that, it should be able to prepare an espresso even if the position of the coffee machine changes or if the old cups are replaced with some new ones. A robot that can adapt to that is deployable in any house-hold environment. Learning from demonstration is then required [3]. Robot learning from demonstration (also known as Imitation Learning) is the paradigm for enabling a robot to learn new tasks observing an instructor without being specifically programmed 6 CHAPTER 1. INTRODUCTION [40]. First, the robot needs the ability to observe the instructor performing a task. The robot needs to interpret the raw input data acquired trough its sensors by detecting complex patterns and associate them to discrete classes. This is known in Neuroscience as the binding problem [34]. Using our espresso example, if the robot is supposed to learn how to prepare coffee, it needs to observe the instructor and detect the coffee machine otherwise it will learn something else. Second the robot should learn what a task means and its effect on the world. The robot should understand all the objects required to perform the activity make an espresso and it should understand when the task has been completed obtaining the right outcome. Third, the robot imitates the instructor and achieves the same end goal. 1 Thesis Contributions The work in this thesis explores various aspects of the first two steps required in imitation learning: observe and learn. In particular we focus on learning manipulation activities and we contribute with solutions to the following challenges: Object discovery. In this thesis we focus on the unsupervised discovery of objects from a visual sensory input. Our main contribution is a method that segments a scene into regions and performs an estimate of the manipulable objects in the scene, which can be used for reasoning about objects and their functionality in manipulation activities. This is a fundamental capability for a robot that needs to learn a task from a human demonstrator. However such task is challenging since the problem of defining what an object is is ill-defined, as the nature of human perception is not fully understood. Looking at the same scene there can be multiple interpretations of what objects are present [45]. We constrain the problem to objects which can be grasped and moved by a human (or a robot) and we rely on contextual knowledge to find plausible object hypotheses without losing in generality. Object tracking. We propose a tracking algorithm that is robust and versatile since it can track unknown objects by learning the appearance of the object while it is tracked. Such capabilities are fundamental for robot learning in a realistic scenario given the unstructured nature of the environment. Still few methods actually employ fully working general tracking systems [31]. Instead, experiments are performed with simplified environments [2] or trackers work only with a limited set of specific objects [33]. Object affordance. We present object descriptors that can be used to understand how an object can be employed in human activity rather than a more common approach of characterizing an object by its appearance. This is motivated by the fact that in order to learn a task successfully from an instructor, a robot needs to observe an action, its effect on the world, and perform an action that has the same 2. THESIS OUTLINE 7 effect. Therefore, an agent should be able to reason about objects in terms of the current activity [15] by understanding the effect of an object on the world. This is known as object affordance [30]. Multi-modal action understanding. We propose a multi-modal approach that merges audio and visual input to understand an observed action. Such an approach mitigates the limitation of visual recognition, e.g object occlusion, and that actions have to be performed within the field of view. Clearly multi-modal understanding is a capability desired in autonomous robots. However, while much effort has been spent on development of methods for visual perception[19, 24, 25], few studies has been performed in robot understanding using auditory input, e.g., [42]. Moreover, to our knowledge the task of audio-visual action recognition has only been addressed in Robotics very rarely, e.g., in [43]. 2 Thesis Outline The rest of the thesis is structured as follows: 2.1 Chapter 2: Robot Learning and Object Understanding Chapter 2 motivates the importance of the challenges presented in the introduction giving a practical example scenario where robot learning can be applicable. The challenges that arise in such a scenario are presented with a particular focus on the ones addressed in this work. 2.2 Chapter 3: Summary of Papers Chapter 3 introduces the papers included in the second part of this work, and how the problems are addressed in respect to previous work in the same area. 2.3 Chapter 4: Conclusion We conclude with a discussion on the results achieved by this work and the potential work for continuing this in the future. 2.4 Part II: Included Papers Five publications are included in Part II. The abstract of these papers along with the contribution made by the author are given here. A short summary of each paper is given in Chapter 3. The contributions are presented in the same order as the challenges introduced in Section 1. 8 CHAPTER 1. INTRODUCTION Paper A: Unsupervised Object Exploration Using Context Alessandro Pieropan and Hedvig Kjellström. In Proceedings of the 2014 IEEE International Symposium on Robot and Human Interactive Communication (ROMAN’14), Edinburgh, UK, August 2014. Abstract: In order for robots to function in unstructured environments in interaction with humans, they must be able to reason about the world in a semantic meaningful way. An essential capability is to segment the world into semantic plausible object hypotheses. In this paper we propose a general framework which can be used for reasoning about objects and their functionality in manipulation activities. Our system employs a hierarchical segmentation framework that extracts object hypotheses from RGB-D video. Motivated by cognitive studies on humans, our work leverages on contextual information, e.g., that objects obey the laws of physics, to formulate object hypotheses from regions in a mathematically principled manner. Contribution by the author: Designed and implemented the RGB-D segmentation algorithm. Designed an object hypothesis formulation method based on contextual knowledge. Performed the evaluation of the system on a public dataset. Paper B: Robust 3D Tracking of Unknown Objects Alessandro Pieropan, Niklas Bergström, Masatoshi Ishikawa and Hedvig Kjellström. In Proceedings to 2015 IEEE/RSJ International Conference on Robotics and Automation (ICRA’15), Seattle, USA, May 2015. Abstract: Visual tracking of unknown objects is an essential task in robotic perception, of importance to a wide range of applications. In the general scenario, the robot has no full 3D model of the object beforehand, just the partial view of the object visible in the first video frame. A tracker with this information only will inevitably lose track of the object after occlusions or large out-of-plane rotations. The way to overcome this is to incrementally learn the appearances of new views of the object. However, this bootstrapping approach is sensitive to drifting due to occasional inclusion of the background into the model. In this paper we propose a method that exploits 3D point coherence between views to overcome the risk of learning the background, by only learning the appearances at the faces of an inscribed cuboid. This is closely related to the popular idea of 2D object tracking using bounding boxes, with the additional benefit of recovering the full 3D pose of the object as well as learning its full appearance from all viewpoints. We show quantitatively that the use of an inscribed cuboid to guide the learning leads to significantly more robust tracking than with other state-of-the-art methods. 2. THESIS OUTLINE 9 We show that our tracker is able to cope with 360 degree out-of-plane rotation, large occlusion and fast motion. Contribution by the author: Designed a real-time tracking algorithm for unknown objects. Designed an adaptive mechanism to learn new appearances of an object. Acquired a dataset used for evaluation. Manually labeled the dataset with ground truth. Performed evaluation of the system on the dataset. Benchmark against state-of-the-art tracking algorithms. Paper C: Functional Object Descriptors for Human Activity Modeling Alessandro Pieropan, Carl Henrik Ek and Hedvig Kjellström. In Proceedings of the 2013 IEEE/RSJ International Conference on Robotics and Automation (ICRA’13), Karlsruhe, Germany, May 2013. Abstract: The ability to learn from human demonstration is essential for robots in human environments. The activity models that the robot builds from observation must take both the human motion and the objects involved into account. Object models designed for this purpose should reflect the role of the object in the activity – its function, or affordances. The main contribution of this paper is to represent object directly in terms of their interaction with human hands, rather than in terms of appearance. This enables the direct representation of object affordances/function, while being robust to intra-class differences in appearance. Object hypotheses are first extracted from a video sequence as tracks of associated image segments. The object hypotheses are encoded as strings, where the vocabulary corresponds to different types of interaction with human hands. The similarity between two such object descriptors can be measured using a string kernel. Experiments show these functional descriptors to capture differences and similarities in object affordances/function that are not represented by appearance. Contribution by the author: Recorded the dataset used for evaluating the method presented in the paper. Designed a temporal feature descriptor to capture object functionality. Designed the method to segment and track objects used to extract the feature descriptor. Performed evaluation of the system comparing the results with appearance based descriptors. Paper D: Recognizing Object Affordances in Terms of Spatio-Temporal Object-Object Relationships Alessandro Pieropan, Carl Henrik Ek and Hedvig Kjellström. In Proceedings of the 2014 IEEE/RAS International Conference on Humanoid Robots (HUMANOIDS’14), Madrid, Spain, November 2014. 10 CHAPTER 1. INTRODUCTION Abstract: In this paper we describe a probabilistic framework that models the interaction between multiple objects in a scene. We present a spatio-temporal feature encoding pairwise interactions between each object in the scene. By the use of a kernel representation we embed object interactions in a vector space which allows us to define a metric comparing interactions of different temporal extent. Using this metric we define a probabilistic model which allows us to represent and extract the affordances of individual objects based on the structure of their interaction. In this paper we focus on the presented pairwise relationships but the model can naturally be extended to incorporate additional cues related to a single object or multiple objects. We compare our approach with traditional kernel approaches and show a significant improvement. Contribution by the author: Designed a spatio-temporal descriptor that captures pairwise object relationships. Designed a probabilistic model to infer object functionality. Performed the evaluation outperforming the results achieved in Paper C. Paper E: Audio-Visual Classification and Detection of Human Manipulation Actions Alessandro Pieropan, Giampiero Salvi, Karl Pauwels and Hedvig Kjellström. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’14), Chicago, USA, September 2014. Abstract: Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a dataset of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audio-visual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action. Contribution by the author: Acquired a dataset used to measure the quality of the framework. Manually labeled the ground truth. Designed a model to perform recognition of pre-segmented subactions. Designed a model to perform recognition of sub-action in a continuous data stream. Implemented a multi-modal framework for action recognition. Evaluated the quality of the model using the dataset recorded. 2. THESIS OUTLINE 11 Other Publications. Apart from the papers included in this thesis, the following publications have been achieved during my Ph.D studies. Alessandro Pieropan, Niklas Bergström, Masatoshi Ishikawa and Hedvig Kjellström. A Robust Object Tracker Using Structured Learning. In review for the Advanced Robotics Journal. Alessandro Pieropan, Niklas Bergström, Hedvig Kjellström and Masatoshi Ishikawa. Robust Tracking Through Learning. In Annual Conference of the Robotics Society of Japan, Japan, 2014 Alessandro Pieropan, Giampiero Salvi, Karl Pauwels and Hedvig Kjellström. A dataset of human manipulation actions. In Grasping Challenge Workshop at International Conference on Robotics and Automation, Hong Kong, 2014. Alessandro Pieropan, Carl Henrik Ek and Hedvig Kjellström. On Object Affordances. In Grasping Challenge Workshop at International Conference on Robotics and Automation, Hong Kong, 2014. Chapter 2 Robot Learning and Object Understanding 1 Problem Statement The fundamental problem addressed in this thesis is that of object understanding in the context of robot learning from a human demonstrator. A robot should be able to understand how to divide the sensory input into entities, i.e., objects, and it should be able to reason about them in a way that is effective for modeling human activity. In order to provide a clear idea of the problem we present an example scenario. 2 Example Scenario You want to have a robot cooking okonomiyaki1 in your restaurant. You bought a dual arm robot and you hired a professional cook to show the robot how to prepare an okonomiyaki (Figure 1). The robot comes from the factory with a set of limited motor capabilities: it can move the arms to reach a certain location expressed in coordinates, it can open and close the gripper and it can move the head to direct the gaze. In a regular industrial environment this is enough to fulfill many tasks. As an example, the following sequence of motor primitives can solve a pick and place task: move arm to location A, close gripper, move arm to location B, open gripper. In this scenario the robot operates blindly without any real perception of the environment around it, it has no concept of object, action or task completion. Without any perception there are many things that can go wrong in this situation. The object may be misplaced so that when the gripper closes at location A it does not grasp anything. The container at location B may be misplaced so that when the gripper opens the object drops on the floor. The robot can break something or harm someone if they stand in the path from A to B. Clearly a robot that operates 1 typical Japanese dish 13 14 CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING Figure 1: Example of a cooking scenario where the robot learns from a human instructor. The robot needs to detect the possible objects present on the scene, once they are found it needs to track them while the human is moving them. Finally it needs to understand the activity the human performed and reproduce it with its own actuators. in a restaurant sharing the space with people needs a perceptual mechanism to constantly check actions and effects in order to adapt to perturbations. In this specific scenario, the robot is also supposed to learn something new from a cook. First of all the robot needs to detect the instructor to learn from. Then it needs to detect the objects in order to handle a possible misplacement. The actions and sub-actions that are required to complete the recipe need to be detected and learned by the robot with a particular focus on the end goal achieved rather than the exact way they are achieved. This is motivated by the fact that two people do not perform an activity the exact same way. The small differences in our body shape and proportions influence the way we perform an action. There are also more subtle factors that influence our behavior such as culture, society or just family education. [ One person may grasp a glass of wine in the main body while another in the stem. Despite this difference the end goal is the same: to drink. ] A robot has a different embodiment compared to a human, e.g., it can have a gripper with three fingers rather than five. Its activity representation should be invariant to such differences, and focus on the outcome of the activity. This way it can reproduce the action using its motor primitives and make sure that the same end goal is achieved. 3 Challenges The example previously described present many challenges that need to be addressed. The input sensory data, no matter what type of sensor, need to be processed. This thesis focuses on visual and auditory data but much can be done using other input such as tactile. The task presented in the example can be split in three 3. CHALLENGES (a) Segmentation 15 (b) Object detection Figure 2: Example of segmentation algorithms. In the first image the most general approach is taken, pixels in the image are clustered together only by color coherence. In the second case a supervised approach is taken. The objects are known beforehand, therefore the problem reduces to object detection. steps: observe, learn and reproduce. This thesis focuses on the first two aspects trying to address the observe and learn problem in a similar manner as motivated in neuroscience [6]: segregation: the complex patterns in the data need to be discretized in classes. This is done by extracting from the data entities such as objects, human instructor, human hands and so on. combination: the entities extracted need to be merged to understand the experience, in our case understanding the whole action in order to reproduce it. The end goal of this work is to understand meaningful semantic concepts that can be transferred to the robot. Therefore, when the robot knows the goal that needs to be achieved it can use its own way to achieve it. There are many ways to address the problems introduced. A possible approach, very industrial in a sense, may consist in having a robot that can recognize exactly the tools present in your restaurant and place tools and ingredients in predefined positions. This thesis aims for the opposite, it strives for generality. 3.1 Objects Discovery In the first part of our work (presented in Paper A), semantic and contextual information are used to find object hypotheses. Such an approach is motivated by recent studies in Neuroscience that stress the importance of context in helping humans recognizing and detecting objects [7, 14, 17]. Concepts such as gravity, convexity 16 CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING or symmetry are helpful for humans in finding objects. As an example, given an image of a complete white room with no furniture, it can be very difficult for a human to determine which surface is the floor and where the walls are. Some may say that the floor is the bottom surface assuming that the photo has been taken horizontally. However if a human is in the room they immediately determine where the floor is because the human perceives the vertical direction through gravity. Yet, the two dominant approaches within the Computer Vision community exploit context very little. On one side, there is a task-driven approach that consist in training a robot to detect a predefined set of objects that are needed for modelling the range of tasks that the robot should learn [33] (see Figure 2b). In general such an approach can produce good results but it is not feasible in reality. Just consider a simple object like a cup, how many cups different in appearance may be out there? Clearly it is not possible to pre-train a robot to recognize every single cup in the world. A mechanism that can generalize is needed. In contrast, the second common approach consists in partitioning an image in coherent regions which are hypothesized to correspond to objects. This is called segmentation. The process is considered to be successful in finding objects when the segmentation boundaries coincides with the object. The most simple way to segment an image is to apply a clustering algorithm (Figure see Figure 2a) such as k-means [29]. More complex methods represent the image as a graph and produce clusters by cutting the connectivity of it when certain conditions are not met [13, 37]. The main limitation of all the above methods is that the concept of object is illdefined. Sometimes an object may correspond to only one cluster; in that case the segmentation is successful. However sometimes an object may correspond to multiple segments, pretty common with high texture objects; this is defined as oversegmentation. Alternatively multiple objects can be grouped in the same segment (undersegmentation); this may happen when they are very similar in color. The recent advent of cheap depth cameras improved the task of generating segments that respect object boundaries. Some recent works have shown that it is possible to have a robust matching between object boundaries in images by detecting edges in the range images and applying a simple flood fill algorithm [1]. However, even though such an approach contributes consistently in preserving object edges, it is more suited to find smooth surfaces rather than objects. A cereal box, as an example, is going to be segmented in six separate surfaces. Clearly one more step is needed to merge the surfaces in the object. It is not possible to find something if we do not know what we are looking for. This thesis is concerned about finding objects without losing generality; this is where the contextual knowledge comes into play. In the context of actions involving the manipulation of objects, a robot can exploit contextual knowledge such as gravity, compactness, symmetry, size or relationships between surfaces [35] to find complex patters that correspond to objects in the sensory input. As an example, Figure 3 shows an example where context is important. If we are looking for chairs we do not consider the object on the desk as a potential candi- 3. CHALLENGES 17 Figure 3: The "chair-challenge" [10]. Synthetic environment where many chairs and chair-like object are present. Relying only on appearance won’t help in finding real chairs however the context can help in the discrimination. The small object on the desk cannot be a chair, the one hanging from the ceiling cannot be as well. date, or the lamp in the ceiling, since that is not the context in which we usually find chairs. This figure is also interesting in terms of affordance (See Section 3.3). 3.2 Challenges with Tracking Objects Once the robot formulates valid object hypotheses, it should be able to follow the objects with its gaze if they move; this problem is known as tracking. Since we strive for generality, we contribute with a method (presented in Paper B) able to track unknown objects robustly. If a robot detects a new unknown object, only a small fraction of it is initially visible. Therefore the robot needs to adapt and build its knowledge of the object when new segments of the object become visible. Our method learns new appearances of the object assuming its appearance may vary. Visual tracking of unknown objects is a challenging task studied actively within the Vision community. In the spirit of our approach, some methods learn the appearance of the model continuously so that the algorithm can adapt to changes in appearance due to the movement of the object [21, 23]. However, such an approach 18 CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING Figure 4: Example of different tracking algorithms that learn the object model while it is manipulated. Some of the algorithms learn the hands as part of the object model and drift away from the object. presents an aggravating factor to the problem of tracking, often referred to as the drifting problem: upon learning a new instance of the object it is crucial to understand what part of the new instance effectively belongs to the object and what portion belongs to something else, like the background or another object occluding the one of interest. Without a mechanism to discriminate between good and bad appearance candidates sooner or later the algorithm will learn the background as part of the object model, and the tracker will lose the focus on the object as shown in Figure 4. Our method employs a mechanism that limits the drifting problem and bond the complexity of the learning procedure removing the need of learning continuously. There are also works that assume that the object appearance does not change, therefore there is no need of learning new appearances [32]. Such an approach is very robust when the underlying assumption is met however, in case of drastic changes of the appearance, the object is lost. An extensive overview of the stateof-the-art unknown object trackers is presented by Wu et al. [46] that thoroughly evaluates the best performing trackers. An alternative and more restrictive approach consists in limiting the tracking problem to a set of objects that are known beforehand [33]. In such a scenario the problem can be divided in detection and tracking. The 3D object model or a set of images of the object are used to extract features. The features allow to detect the known object in new images. Usually, very robust features such as scale invariant feature transform (SIFT) [27] are used, however, since tracking is a task that requires real time performances, especially in robotics, recently faster but less robust features are becoming popular. BRISK [26] and ORB [38] are the most popular among these. The tracking problem is formulated as estimating the motion of the object by computing the optical flow between two images. Some methods estimates a dense optical flow [9] to estimate the position of the object while other rely only 3. CHALLENGES 19 Figure 5: Chairs of various shape and color. Some are very common and easy to define as chair. Some are more difficult. on the estimation of the movement of a limited set of points [28]. In general, a dense approach is more robust in estimating the motion correctly while a sparse approach is faster. As discussed before, this work aims for generality therefore assuming that an object is known beforehand is very limiting. 3.3 Challenges with Understanding Object Functionality In the previous sections we analyzed the challenges of finding and tracking unknown objects. Once the robot is capable to focus on the objects it then needs to understand what can be done with them. We contribute with two methods (presented in Paper C and D) that build upon the belief that objects should be modelled in terms of how they are used by the human demonstrator. In Paper C we show that it is possible to understand an object functionality by looking at how it is manipulated by the human demonstrator during an action. In Paper D we present how the spatio-temporal relationships between different objects involved in the same activity encode information about their role in the action. Our methods are motivated by an inspiring work [16], that stresses the importance that the functionality of objects directly correlates to the action they are involved in. This is confirmed by studies in psychology [15] that state that humans and other agents recognize objects in terms of their function, what kinds of use the objects afford. Figure 5 shows a set of chairs. Some are common but the last two to the right are design furniture that are not seen very often in a house. It is very easy to classify all of them as chairs by looking at this image. Still an interesting question arises: will it be easy to recognize the last chair out of this context? Probably the fact that it is in an image with other chairs bias our judgement. However, in a real scenario, if that piece of furniture is placed close to a table or if a person is sitting on it a human can infer easily its functionality. Figure 3 shows another illustration of the affordance concept. There are many chair-like objects in the scene but only few of them are valid chairs that afford sitting. To a certain extent it is possible to understand the functionality of an object just looking at its appearance. An approach that relies solely on visual features to reason about objects is more common within the Computer Vision community. For an overview of such works see [12]. For tasks such as image retrieval or image classification it is effective to exploit visual features, however this is limiting for ac- 20 CHAPTER 2. ROBOT LEARNING AND OBJECT UNDERSTANDING tion recognition tasks. Visual features can be used to classify functional classes of objects only under the strong assumption that each object maps to a single semantic class. This is not always true, there are classes of objects that afford different functionalities in different contexts. Using a knife on a zucchini has a completely different meaning compared to using it on another person. Some works are in the spirit of our approach. [16] exploits the human pose to recognize objects that afford sitting. [44] understands functional categories of image regions, such as roads and walkways, based on the behaviors of moving objects in the vicinity. [41] proposes a framework that uses contextual interaction between image regions to learn the contextual models of regions. Paper C presents how the functionality of objects can be understood by looking at how the instructor manipulates them. Similarly to [16] the object functionality is characterized by how the human relates to it. Paper D, in the spirit of [41], leverages on the spatio-temporal relationships between the objects to infer their respective role. 3.4 Challenges with Multi-Modal Action Understanding Figure 6: Sensors available in robots. So far all the challenges presented has been addressed from a visual perspective. However, vision, as any other source of input, is noisy and it has some limitations (i.e. objects may be occluded to the view). We contribute with a method (presented in Paper E) to make use of a second source of data, in our case audio, to compesate the limitation of vision and improve the task of action recognition. First, such an approach allows to evaluate the descriptive power of each source of information in solving a predefined task. Second it is a first step towards a multi-modal approach that takes into account all the sensory input a robot can acquire (Figure 6). Our approach is motivated by studies in Neuroscience that show that multiple sensory input are merged in our brain with the purpose of making a coherent in- 3. CHALLENGES 21 terpretation of the world [47, 34]. Yet, in Robotics, an approach that takes into account more than one sensor is rarely taken into consideration. Furthermore, very little work in robot understanding exploits auditory information even if previous attempts shown that much can be done from that perspective. [39] learns object affordance exploiting a linguistic description of the scene. Similarly, [43] uses language to recognize actions. We instead propose to use the sounds produced by actions as a distinctive signature for robot understanding. This is in the spirit of a similar work [42] but we move one step further having a multi-modal approach that merges audio and vision for action recognition. Such an approach mitigates the limitations of a single source. From the visual perspective actions may be performed out of the field of view or the object involved in an action may be occluded by other objects or by the human instructor. From the auditory point of view, some actions may not produce any sound. Other works motivate the importance of a multi-modal approach. [18] infers the content of a container while being grasped, exploiting visual and tactile input. Moreover, [8] estimates the stability of a grasp using tactile and visual feedback. Chapter 3 Summary of Papers A Unsupervised Object Exploration Using Context This work tackles the problem of unsupervised unknown object detection since the thesis strives for generality as discussed in Section. 3.1. This problem can be addressed by clustering the pixels of an image into plausible object hypotheses. An object is considered to be found if the algorithm generates one and only one cluster corresponding to it. The paper builds upon a general graph-based segmentation algorithm that generates clusters by cutting the minimum spanning tree [13] when certain conditions are met. However a general approach is not enough to find objects since it is highly probable that an object may end up oversegmented or undersegmented. Moreover, the definition of object is ill-posed per se. [45] has shown that it is very difficult to have an objective evaluation of what an object is. Figure 1: Example of a simple general segmentation algorithm. Without any contextual knowledge it is very unlikely that each object corresponds to a segment. Highly textured objects are going to be oversegmented. 23 24 CHAPTER 3. SUMMARY OF PAPERS (a) Input RGB-D (c) 3D facets (b) Oversegmentation (d) 3D shapes (e) High objectness Figure 2: Unsupervised object discovery. (a) Input RGB-D image. (b) Superpixels generated by an oversegmentation step. (c) Concatenation of super pixels into 3D surface segments, or facets, according to surface orientation. (d) Concatenation of facets into convex 3D shapes. (e) Measuring segment objectness; the image shows segments with high objectness. Therefore, the strategy in this paper consists in leveraging on contextual knowledge to find suitable object hypotheses. The context is given by the task that needs to be executed. In our case, given that we desire a robot able to learn manipulative activities, the definition of objects is that they are convex assemblies of smooth surfaces and of the right size to be grasped and moved. The proposed algorithm generates an oversegmentation of the scene and uses the contextual knowledge in a hierarchical manner to merge the segments in plausible object hypotheses. First segments are merged in surfaces assuming that objects have piecewise smooth curvatures. Then surfaces are grouped according to convexity, size and compactness of the merged cluster. Finally gravity is used to determine the likelihood of a cluster to be an object hypothesis. Graspable objects are very likely to be found on a planar surface rather than floating in air. The steps of the algorithm are shown in Figure 2. B. ROBUST 3D TRACKING OF UNKNOWN OBJECTS B 25 Robust 3D Tracking of Unknown Objects In this paper we address the problem of tracking previously unseen objects presented in Sec. 3.2 . The paper assumes that the robot does not know the object beforehand so only the part facing the robot’s sight is visible. No assumption on the shape of the object or the appearance is made. Therefore the algorithm tries to adapt and learn new instances of the object appearance while tracking it. Tracking is performed by a combination of motion estimation done using optical flow and object detection based on detection of local features. An aggravating factor that increases the challenge of the problem is the learning mechanism to update the object model including new views of the object. Many state of the art algorithms approach this problem by first estimating the position of a bounding box that ideally will wrap the object in the image. The bounding box is the used to extract the new appearance of the object. This procedure is sensitive to drifting since the Figure 3: The tracker uses feature points to estimate the current 3D pose of the object. The feature points in green are already know as part of the object model. The point in red are points belonging to a new appearance candidate of the object so they are included in the object model. Figure 4: Behavior of different tracking algorithms. The result of the algorithm proposed in this work in shown in red. Yellow shows the result of a tracker that does not learn new appearances of the object. Blue and green do learn new instances but they drift due to the object manipulation. 26 CHAPTER 3. SUMMARY OF PAPERS bounding box may include the background due to a wrong estimation of the object position. Additional mechanism are usually implemented to filter bad candidates for learning. In this paper we propose to use a 3D bounding cube rather than a bounding box. It is then possible to estimate not only the object position but also its relative rotation from the initial scenario and it scale. All this information in the learning procedure making the algorithm more robust and less prone to drifting. Our tracker is able to cope with 360 degree out-of-plane rotation, large occlusion and fast motion. Some experimental results can be seen in Figure 4. C. FUNCTIONAL OBJECT DESCRIPTORS FOR HUMAN ACTIVITY MODELING C 27 Functional Object Descriptors for Human Activity Modeling The main purpose of this work is to let the robot understand the role of objects within an activity; its function or affordance. Usually this task is addressed by the Computer Vision community by extracting visual features and mapping appearance of objects to their affordance. However, this approach suffers from two limitations. First objects that share similar appearance features may have belong to completely different affordance classes (Figure 5). I.e., a knife and a cucumber shares the common feature of having an elongated shape but they have completely different roles in a cooking activity. Second an object may afford multiple functionalities. Since the commonly used appearance feature descriptor cannot capture functionalities properly, our idea instead consists in designing a new functional descriptor that can be used to understand object affordance. In order to do that we propose to take into account human motion and represent objects directly in terms of their interaction with human hands. A dataset of human kitchen activities has been recorded since none of the existing at the time provided information about the objects and the position of the human executing the activities. We assume that the objects involved in the activity are unknown. It would be possible to reason about activities taking the object class into account, however this approach could not scale and it is in contrast with the main purpose of this thesis of aiming for generality. Therefore a segmentation algorithm is used to extract object hypotheses from the first frame Appearance(feature(space( Func.onal(feature(space( Figure 5: Appearance based vs. function based object modeling. Left: If objects are characterized by appearance (shape, color, texture) features, a set of objects might be grouped into elongated, round and square. Right: Using features reflecting the object function in human activity, the objects might instead be grouped into tools, ingredients and support areas. 28 CHAPTER 3. SUMMARY OF PAPERS Figure 6: Example of the discretized symbol labelling procedure used considering the object and hand position. of each video. The appearance of the object hypotheses is learned and used to track the segments during the whole video. The idea of this work consists in using the relative position between the objects and the hands to encode the functional properties of each object. The continuous distance measurement at each time step in the video is associated to a discretized representation defined by a limited vocabulary of symbols (Figure 6). The set of states are: idle, approaching, close to human hands, in use, leaving. Our proposed object functional descriptor consists then in a string of symbols. The similarity between two object descriptors can be measured using a string kernel. Experiments show that the functional descriptor can captures affordance similarities that are not captured by appearance features. D. RECOGNIZING OBJECT AFFORDANCES IN TERMS OF SPATIO-TEMPORAL OBJECT-OBJECT RELATIONSHIPS D 29 Recognizing Object Affordances in Terms of Spatio-Temporal Object-Object Relationships In this paper we revisit the problem of capturing object affordance from a different perspective. The idea is that an object functionality is not only determined by the way a human manipulates the object but also by the other objects involved in an activity as shown in the example in Figure 7. The object segments extracted from the dataset presented in Paper C are used to calculate the spatial relationships of the object involved in the activity. Figure 7: Object functionality is to a high degree defined by the interaction with other objects. In this toy example, the functionality of a hammer depends highly on the context in which it is used. Together with a nail, the hammer affords hammering (the activity it is designed for). However, together with a beer bottle, the hammer also affords opening. Furthermore, together with a piggy bank, the hammer affords breaking. These three affordances are conceptually different, and tied to the other object that interacts with the hammer. We thus propose to represent object affordances in terms of object-object relationships. 30 CHAPTER 3. SUMMARY OF PAPERS P (O3 , O4 |X3 , X4 ) O4 O2 O3 P (X3 |O3 ) O1 Figure 8: Illustration of the joint functional object classification. Each node in the graphical model corresponds to an object functional class. Each edge is the joint probability of the connected objects, observed from the video by using the pairwise object distances. The main contributions of this paper are twofold. First the spatio-temporal relationships of objects are encoded without the used of a discrete set of symbols. This is possible since the path kernel used to measure the similarities applies to stream of continuous data a strategy similar to the one used by the string kernel. Second we propose a probabilistic model that takes into account the structure of the interaction between objects to infer the affordance of individual objects (Figure 8). The main assumption we take to perform our experiments is that the relationships between each pair of object within the same activity is dependent. That motivates the design of our model. We show significant improvement in the classification rate using such an approach compared to our previous work. E. AUDIO-VISUAL CLASSIFICATION AND DETECTION OF HUMAN MANIPULATION ACTIONS E 31 Audio-Visual Classification and Detection of Human Manipulation Actions This paper tackles the problem of sensory fusion for activity recognition. The idea is that any sensor suffers from some limitations but they can be compensated by fusing data coming from another source of different nature. Very little work has been done previously in robotics, mostly focused on fusing tactile sensing with vision [8]. To our knowledge, the recognition of actions by fusing auditory and visual information, as proposed in this work, has not been addressed before. The first contribution of this paper is a publicly available dataset of human actions that includes visual data, in terms of RGB-D video and 6 degree of freedom (DOF) object pose estimation, and acoustic input recorded with an array of microphones (Figure 9 and 10). The second contribution consists in a method, inspired by speech recognition, to detect and segment actions from continuous audio-visual data. Moreover the model has been used to extensively evaluate the descriptive power of each single source employed in the recognition. Finally our experiments show that the joint use of audio and vision outperforms the recognition of actions done with a single sensor. Figure 9: Complementary information can help to mitigate the limitation of a single source of information. Many action produces distinctive sounds that can be used in recognition. Figure 10: Examples taken from the dataset to show the variety in performing the actions. Chapter 4 Conclusions This thesis explored the problem of object detection, tracking and modelling in the context of robot learning from demonstration. In order to reason about actions there is a chain of correlated problems that need to be tackled, starting from the interpretation of the raw sensory data to a representation of the high level semantics. Each method presented addressed different parts of this task. First we presented a method to formulate possible object hypotheses given a RGB-D image. The method exploits task driven contextual knowledge to find plausible objects. Second we presented a tracking algorithm that tracks a generic object and learn its appearance since only a small portion of the object is visible in front of the camera at the beginning. Third we infer the object functionality by looking how the human uses the objects and how the spatial relationships of the objects change over time. As a last step we fused auditory and visual input to understand activities. The research performed on these tasks underlined some important findings: Object definition is ill-posed. It is, in the general case, not possible to have a completely bottom-up, data driven algorithm to detect objects. Some topdown knowledge in terms of contextual or task-driven knowledge needs to be included. This is in the spirit with recent findings in neuroscience that show that human recognize objects in their context. To learn or not to learn? This is the big question to answer upon tracking objects. A tracker that does not learn (i.e., adapts over time) may be very robust under the assumption that the appearance of objects does not change. However in general this assumption does not hold and a learning mechanism is needed. In that case it is very important to detect bad learning candidates in order to limit the drifting problem. Appearance against functionality. In the context of action recognition it is more important to know what you can do with an object rather than its appearance. By looking directly at how humans use objects it is possible to understand functionality better than relying on appearance. 33 34 CHAPTER 4. CONCLUSIONS Sensory fusion. Any sensor has limitations, a sensory-fusion approach is able to improve the performance in the completion of a task, compared to the results of a single source. 1 Future Work In this work we addressed the object-centered problems of robot learning. However, there are a number of subjects not involving objects that are important to address to understand and learn actions. One interesting topic consists in detecting human hands and estimate their pose. Having the pose of human hands can help the task of robot learning for two main reasons. First, object segmentation as is very difficult while objects are being manipulated. Object segmentation may produce poor results when it is heavily occluded or if the object has color similarities with the human hands. Having a supportive system that correlates the estimation of object boundaries with the estimation of the hand pose may mitigate the noise produced by manipulation ([36]). Second, hand pose estimation related to grasping may help in understanding an observed activity. As an example, it is very unlikely that a person will drink from a glass if it is being grasped from the top. Another topic we would like to explore more is sensory fusion. We have shown in Paper E that the limitation of a single sensor can be compensated for by including sensory input gathered with another source. In our study we used audio to understand actions. It will be interesting to continue in this direction by including speech recognition. An instructor usually uses verbal explanation to clarify an action they are showing to a student. Such a possibility should be helpful for robots as well [22]. Moreover, sensory fusion should not be limited to audio-visual fusion. Much can be done using other sensors. As an example, thermal cameras can be used to help segmenting human hands and objects during manipulation activities. The last topic we would like to investigate is the application of Deep Neural Networks for action recognition. Neural Network-based methodologies have been shown to outperform other methods in the context of image classification and object detection [4]. It would be interesting to investigate their application on temporal sequences and compare to the results achieved with Hidden Markov Models (HMM) that we used in our experiments. Bibliography [1] R. Haschke A. Ückermann and H. Ritter. Real-time 3d segmentation of cluttered scenes for robot grasping. In International Conference on Humanoid Robots, 2012. [2] E. E. Aksoy, A. Abramov, F. Wörgötter, and B. Dellen. Categorizing objectaction relations from semantic scene graphs. In IEEE International Conference on Robotics and Automation, 2010. [3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 2009. [4] H. Azizpour and S. Carlsson. Self-tuned visual subclass learning with shared samples an incremental approach. ArXiv, 1405.5732, 2014. [5] A. Bandura. Influence of models reinforcement contingencies on the acquisition of imitative response. Journal of Personality and Social Psychology, 1965. [6] A. Bandura. Social foundations of thought and action: A social cognitive theory. Prentice Hall, 1986. [7] M. Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8), 2004. [8] Y. Bekiroglu, D. Song, L. Wang, and D. Kragic. A probabilistic framework for task-oriented grasp stability assessment. In IEEE International Conference on Robotics and Automation, 2013. [9] K. Berthold, P. Horn, and B. G. Schunck. Determining optical flow. International Joint Conference on Artificial Intelligence, 17, 1981. [10] I. Bülthoff and H. H. Bülthoff. Image-based recognition of biological motion, scenes and objects. In Perception of Faces, Objects, and Scenes: Analytic and Holistic Processes. 2003. [11] P. Cheek, L. Nikpour, and H. D. Nowlin. Aging well with smart technology. Nursing Administration Quarterly, 29, 2005. [12] L. Fei-Fei, R. Fergus, and A. Torralba. Recognizing and Learning Object Categories: Short course, 2009. 35 36 BIBLIOGRAPHY [13] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 2004. [14] M. J. Fenske, E. Aminoff, N. Gronau, and M. Bar. Top-down facilitation of visual object recognition: object-based and context-based contributions. Progress in brain research, 155, 2006. [15] J. J. Gibson. The Ecological Approach to Visual Perception. Lawrence Erlbaum Associates, 1979. [16] H. Grabner, J. Gall, and L. van Gool. What makes a chair a chair? In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011. [17] N. Gronau, M. Neta, and M. Bar. Integrated contextual representation for objects’ identities and their locations. Journal of Cognitive Neuroscience, 20 (3), 2008. [18] P. Guler, Y. Bekiroglu, X. Gratal, K. Pauwels, and D. Kragic. What’s in the container? Classifying object contents from vision and touch. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2014. [19] A. Gupta, A. Kembhavi, and L. S. Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 2009. [20] J. Guttler, C. Georgoulas, T. Linner, and T. Bock. Towards a Future Robotic Home Environment: A Survey. Gerontology, 2014. [21] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured output tracking with kernels. In IEEE International Conference on Computer Vision, 2011. [22] M. Johnson-Roberson, J. Bohg, G. Skantze, J. Gustafson, R. Carlson, B. Rasolzadeh, and D. Kragic. Enhanced visual scene understanding through humanrobot dialog. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2011. [23] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(34), 2012. [24] H. Kjellström, J. Romero, and D. Kragic. Visual object-action recognition: Inferring object affordances from human demonstration. Computer Vision and Image Understanding, 115(1), 2011. [25] H. Kjellström, J. Romero, D. Martínez, and D. Kragic. Simultaneous visual recognition of manipulation actions and manipulated objects. In European Conference on Computer Vision, volume 2, 2008. 37 [26] S. Leutenegger, M. Chli, and R. Y. Siegwart. Brisk: Binary robust invariant scalable keypoints. In IEEE International Conference on Computer Vision, 2011. [27] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004. [28] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence, 1981. [29] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, 1967. [30] L. Montesano, M. Lopes, A. Bernardino, and J. Santos-Victor. Learning object affordances: From sensory motor coordination to imitation. IEEE Transactions on Robotics, 24(1), 2008. [31] T. Mörwald, J. Prankl, A. Richtsfeld, M. Zillich, and M. Vincze. Blort - the blocks world robotic vision toolbox. In Best practice in 3D perception and modeling for mobile manipulation (in conjunction with ICRA 2010), 2010. [32] G. Nebehay and R. Pflugfelder. Consensus-based matching and tracking of keypoints for object tracking. In IEEE Winter Conference on Applications of Computer Vision, 2014. [33] K. Pauwels, L. Rubio, J. Diaz, and E. Ros. Real-time model-based rigid object pose estimation and tracking combining dense and sparse visual cues. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013. [34] A. Revonsuo and J. B. Newman. Binding and consciousness. Consciousness and Cognition, 8(2), 1999. [35] A. Richtsfeld, T. Morwald, J. Prankl, M. Zillich, and M. Vincze. Segmentation of unknown objects in indoor environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. [36] J. Romero, T. Feix, C. H. Ek, H. Kjellstrom, and D. Kragic. Extracting postural synergies for robotic grasping. IEEE Transactions on Robotics, 29 (6), 2013. [37] C. Rother, V. Kolmogorov, and A. Blake. “GrabCut": Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 2004. 38 BIBLIOGRAPHY [38] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision, 2011. [39] G. Salvi, L. Montesano, A. Bernardino, and J. Santos-Victor. Language bootstrapping: Learning word meanings from perception-action association. IEEE Transactions on Systems, Man, and Cybernetics, 2012. [40] S. Schaal. Is imitation learning the route to humanoid robots? Cognitive Sciences, 3, 1999. Trends in [41] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [42] J. Stork, L. Spinello, J. Silva, and K. O. Arras. Audio-based human activity recognition using non-markovian ensemble voting. In International Symposium on Robot and Human Interactive Communication, 2012. [43] C. L. Teo, Y. Yang, H. Daumé III, C. Fermüller, and Y. Aloimonos. Towards a watson that sees: Language-guided action recognition for robots. In IEEE International Conference on Robotics and Automation, 2012. [44] M. W. Turek, A. Hoggs, and R. Collins. Unsupervised learning of functional categories in video scenes. In European Conference on Computer Vision, 2010. [45] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of image segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 2007. [46] Y. Wu, J. Lim, and M-H. Yang. Online object tracking: A benchmark. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013. [47] S. Zmigrod and B. Hommel. Feature integration across multimodal perception and action: a review. Multisensory research, 26, 2013.
© Copyright 2025