INRS-T´el´ecommunications Institut national de la recherche scientifique Object and Event Extraction for Video Processing and Representation in On-Line Video Applications par Aishy Amer Th`ese pr´esent´ee pour l’obtention du grade de Philosophiae doctor (Ph.D.) en T´el´ecommunications Jury d’´evaluation Examinateur externe Dr. D. Nair Examinateur externe Prof. H. Schr¨oder Examinateur interne Prof. D. O’Shaughnessy Codirecteur de recherche Prof. A. Mitiche Directeur de recherche Prof. E. Dubois c Aishy Amer, Montr´eal, 20 d´ecembre 2001 ° National Instruments, Texas Universit¨at Dortmund, Allemagne INRS-T´el´ecommunications INRS-T´el´ecommunications Universit´e d’Ottawa úG YË @ð ø Q»X úÍ @ úG ñk @ð ø YË @ð úG @ñk @ ø Q»X : h ð QË @ X @P úÍ @ úÍ @ éÒÊ¿ Qº Acknowledgments I have enjoyed and benefited from the pleasant and stimulating research environment at the INRS-T´el´ecommunications. Merci a` tous les membres du centre pour tous les bons moments. I am truly grateful to my advisor Prof. Eric Dubois for his wise suggestions and continuously encouraging support during the past four years. I would like also to express my sincere gratitude to Prof. Amar Mitiche for his wonderful supervision and for the active contributions to improve the text of this thesis. I also thank Prof. Konrad for his help during initial work of this thesis. I am also indebted to my colleagues, in particular Carlos, Fran¸cois, and Souad, for the interesting discussions and for the active help to enhance this document. My special thanks go to each member of the dissertation jury for having accepted to evaluate such a long thesis. I have sincerely appreciated your comments and questions. Some of the research presented in this text was initiated during my stay at Universit¨at Dortmund, Germany. I would like to thank Prof. Schr¨oder for advising my initial work. I am also grateful to my German friends, former students, and colleagues, in particular to Dr. H. Blume, who have greatly helped me in Dortmund. Danke sch¨on. My deep gratitude to my sisters, my brothers, my nieces, and my nephews who have supplied me with love and energy throughout my graduate education time. My special thanks go also to Sara, Alia, Marwan, Hakim, and to all my friends for being supportive despite the distances. I would like to express my warm appreciations to Hassan for caring and being patient during the past four years. His understanding provided the strongest motivation to finish this writing. BñP \ à ñºK Ï ÕÎ ªÜ @ à@ X A¿ CJjJË @ . . é¯ð ÕÎ ªÒÊË Õ¯ \ , ú¯ ú¯ éK Q®Ë @ AÒ» . úG YK A @ Ð Y¯ @ð éJÒÊªË @ . É¿ éÊgQÜÏ @ á k QÓ A« 2001 é A« , Èð@ à ñK A¿ JÓ , QÔ « @ ú¯ @ñK X . ¼P @ Q JºË @ . X AJ B @ Ñêʯ ÑêË Q Ð@ g @ áÓ èQ» A úÎ ë @ : ɾK ð . éK QªË @ . , , úÍ ñð úæK Q ¯ X AJ B @ úÍ @ è Yë úÍ AªË @ð @ YJ» áÓ ð ú¯ ð Ð Y® K @ á £ñË @ AëP ñJ» ð Ì è AJ m '@ ' H AK Ym @ . . úG ñÒÊ« á ÜÏ Q¢« H A¯ AK . vii Abstract As the use of video becomes increasingly popular, and wide spread through, for instance, broadcast services, Internet, and security-related applications, providing means for fast, automated, and effective techniques to represent video based on its content, such as objects and meanings, are important topics of research. In a surveillance application, for instance, object extraction is necessary to detect and classify object behavior, and with video databases, effective retrieval must be based on highlevel features and semantics. Automated content representation would significantly facilitate the use and reduce the costs of video retrieval and surveillance by humans. Most video representation systems are based on low-level quantitative features or focus on narrow domains. There are few representation schemes based on semantics; most of these are context-dependent and focus on the constraints of a narrow application and they lack, therefore, generality and flexibility. Most systems assume simple environments, for example, without object occlusion or noise. The goal of this thesis is to provide a stable content-based video representation rich in terms of generic semantic features and moving objects. Objects are represented using quantitative and qualitative low-level features. Generic semantic features are represented using events and other high-level motion features. To achieve higher applicability, content is extracted independently of the type and the context of the input video. The proposed system is aimed at three goals: flexible content representation, reliable stable processing that foregoes the need for precision, and low computational cost. The proposed system targets video of real environments such as those with object occlusions and artifacts. To achieve these goals, three processing levels are proposed: video enhancement to estimate and reduce noise, video analysis to extract meaningful objects and their spatio-temporal features, and video interpretation to extract context-independent semantics such as events. The system is modular, and layered from low-level to middle-level to high-level where levels exchange information. The reliability of the proposed system is demonstrated by extensive experimentation on various indoor and outdoor video shots. Reliability is due to noise adaptation and due to correction or compensation of estimation errors at one step by processing at subsequent steps where higher-level information is available. The proposed system provides a response in real-time for applications with a rate of up to 10 frames per second on a shared computing machine. This response is achieved by dividing each processing level into simple but effective tasks and avoiding complex operations. ix Extraction d’objets et d’´ ev´ enements pour le traitement et la repr´ esentation de s´ equences vid´ eo dans des applications en ligne par Aishy Amer R´ esum´ e Table I. II. III. IV. V. VI. des mati`eres Contexte et objectif Page ix Aper¸cu sur les travaux r´ealis´es Page xi Approche propos´ee et m´ethodologie Page xii R´esultats Page xviii Conclusion Page xix G´en´eralisation possible Page xxii I. Contexte et objectif L’information visuelle s’est int´egr´e `a tous les secteurs de communication moderne, mˆeme aux services de largeur de bande basse comme la communication mobile. Ainsi, des techniques efficaces pour l’analyse, la description, la manipulation et la r´ecup´eration d’information visuelle sont des sujets importants et pratiques de recherche. La s´equence vid´eo est soumise a` diverses interpr´etations par diff´erents observateurs et la repr´esentation vid´eo bas´ee sur le contenu peut varier selon les observateurs et les applications. Plusieurs syst`emes existants abordent les probl`emes en essayant de d´evelopper une solution g´en´erale pour toutes les applications vid´eo. D’autres se concentrent sur la r´esolution de situations complexes, mais supposent un environnement simple, par exemple la r´esolution d’un probl`eme dans un environnement sans x R´esum´e occlusion et sans bruit ou artefact. La recherche dans le domaine de traitement vid´eo consid`ere les donn´ees vid´eo comme ´etant des pixels, des blocs ou des structures globales pour repr´esenter le contenu vid´eo. Cependant, ceci n’est pas suffisant pour des applications vid´eo avanc´ees. Dans une application de surveillance, par exemple, la repr´esentation vid´eo concernant l’objet n´ecessite la d´etection automatique d’activit´es. Pour des bases de donn´ees vid´eo, la r´ecup´eration doit ˆetre bas´ee sur la s´emantique. Par cons´equent, une repr´esentation vid´eo bas´ee contenu est devenue un domaine de recherche fort actif. Des exemples de cette activit´e sont les standards multim´edia comme MPEG-4 et MPEG-7 et divers projets de surveillance et recherche dans des bases de donn´ees visuelles[129, 130]. ´ Etant donn´e la quantit´e croissante de stockage de donn´ees vid´eo, le d´eveloppement de techniques automatiques et efficaces pour la repr´esentation vid´eo bas´ee sur le contenu, est un probl`eme d’importance croissante. Une telle repr´esentation vid´eo vise une r´eduction significative de la quantit´e de donn´ees vid´eo en transformant une s´equence vid´eo de quelques centaines ou milliers d’images en un petit jeu d’information. Cette r´eduction de donn´ees a deux avantages : premi`erement une grande base de donn´ees vid´eo peut ˆetre efficacement consult´ee en se basant sur son contenu et, deuxi`emement, l’utilisation de m´emoire est r´eduite significativement. Le d´eveloppement de la repr´esentation a` base de contenu exige la r´esolution de deux probl`emes cl´es : D´efinir les contenus vid´eo int´eressants et les attributs appropri´es pour repr´esenter ce contenu. L’´etude des propri´et´es du syst`eme visuel humain (SVH) aide a` r´epondre `a certaines de ces questions. En regardant une s´equence vid´eo, le SVH est plus sensible aux zones qui se d´eplacent et plus g´en´eralement aux objets mobiles et a` leurs caract´eristiques. L’int´erˆet dans cette th`ese porte tout d’abord sur les caract´eristiques de haut niveau des objets (c’est a` dire : la s´emantique) puis `a ceux de bas niveau (par exemple : la texture). La question principale est alors : quel est le niveau de s´emantique de l’objet et quelles sont les caract´eristiques les plus importantes pour des applications vid´eo base´ees contenu ? Par exemple, quelles sont les descriptions intentionnelles de haut niveau? Une observation importante consiste `a ce que le sujet de la majorit´e de la s´equence vid´eo soit un objet en mouvement [105, 72, 56] qui ex´ecute des activit´es et agit pour cr´eer des ´ev´enements. De plus, le SVH est capable de rechercher des activit´es et des ´ev´enements int´eressants en parcourant rapidement une s´equence vid´eo. Quelques syst`emes de repr´esentation vid´eo mettent en oeuvre un tel “flipping” en sautant quelques images sur la base de dispositifs de bas niveau (par exemple, utilisant l’extraction d’image clefs sur la base de la couleur) ou en extrayant des indices globaux. Cela peut cependant omettre des donn´ees int´eressantes parce que le “flipping” doit ˆetre bas´e sur des activit´es d’objet ou des ´ev´enements combin´es avec des caract´eristiques `a bas niveau comme la forme. xi Ceci permettra une recherche et une surveillance vid´eo flexible. Pour repr´esenter efficacement le signal vid´eo bas´e sur le contenu comme des objets et des ´ev´enements, trois syst`emes de traitement vid´eo sont n´ecessaires: L’am´elioration vid´eo pour r´eduire le bruit et l’artefact, l’analyse vid´eo pour extraire les caract´eristiques vid´eo de bas niveau et l’interpr´etation vid´eo pour d´ecrire le contenu s´emantique. II. Aper¸cu sur les travaux r´ ealis´ es R´ecemment, des syst`emes vid´eo incorporant des repr´esentations vid´eo a` base de contenu ont ´et´e d´evelopp´es. La plupart de la recherche les concernant se concentre sur des techniques d’analyse vid´eo sans les int´egrer a` un syst`eme vid´eo fonctionnel. Cela aboutit a` des techniques int´eressantes, mais souvent sans rapport a` l’aspect pratique. En outre, ces syst`emes repr´esentent le signal vid´eo par des caract´eristiques globales de base comme le mouvement global ou les images clefs. Peu de syst`emes de repr´esentation vid´eo ont trait´e ce sujet en utilisant des objets. En effet, la plupart d’entre eux utilisent seulement des caract´eristiques de base comme le mouvement ou la forme. Les repr´esentations vid´eo a` base d’objets les plus d´evelopp´ees se concentrent dans des domaines ´etroits (par exemple, des sc`enes de football ou le control du trafic routier). Les syst`emes qui incorporent des repr´esentations vid´eo a` base d’objet utilisent une description quantitative des donn´ees vid´eo et des objets. Les utilisateurs dans des applications vid´eo avanc´ees comme la recherche vid´eo ne connaissent pas exactement a` quoi ressemble le signal vid´eo. Ils n’ont pas d’information quantitative exacte: le mouvement, la forme, ou la texture. Donc, des repr´esentations vid´eo qualitatives faciles a` utiliser pour la recherche vid´eo ou la surveillance sont essentielles. Dans la recherche vid´eo, par exemple, les outils de r´ecup´eration vid´eo les plus existant demandent `a l’utilisateur un croquis ou une vue de vid´eo, par exemple, l’utilisateur fait la recherche apr`es examen de la base de donn´ees. Cependant, examiner de grandes bases de donn´ees, peut prendre beaucoup de temps, particuli`erement pour des sc`enes complexes (c’est `a dire, la sc`ene r´eelle du monde). Des utilisateurs doivent pouvoir d´ecrire un signal vid´eo par des descripteurs qualitatifs sont essentiels pour le succ`es de telles applications. Il y a peu d’ordre dans les activit´es et les ´ev´enements de vues vid´eo. Plusieurs travaux sur la d´etection et la classification d’´ev´enements se concentrent sur la fa¸con d’exprimer les ´ev´enements en utilisant des techniques d’intelligence artificielle comme, par exemple, le raisonnement et l’inf´erence. Autres syst`emes de repr´esentation vid´eo `a base d’´ev´enements sont developp´es pour des domaines sp´ecifiques. Malgr´e la grande am´elioration de la qualit´e des syst`emes d’acquisition vid´eo modernes, le bruit reste toujours un probl`eme qui complique les algorithmes de traitement du signal vid´eo. De plus, des artefacts de codage divers se retrouvent dans un signal xii R´esum´e vid´eo transmis de mani`ere num´erique. Donc, la r´eduction du bruit et des artefacts est toujours une tˆache importante dans les applications vid´eo. Les artefacts tant sonores que num´eriques affectent la qualit´e de repr´esentation vid´eo et devraient ˆetre pris en compte. Tandis que la r´eduction du bruit a ´et´e sujet de plusieurs publications (peu de m´ethodes traitent des contraintes en temps r´eel) l’impact du codage d’artefact sur l’ex´ecution du traitement de vid´eo n’est pas suffisamment ´etudi´e. ` cause de progr´es r´ealis´e en micro-´electronique, il est possible d’inclure des techA niques sophistiqu´ees de traitement vid´eo, dans les services et les appareils. Cependant, l’aspect temps r´eel de ces nouvelles techniques est crucial pour une application g´en´erale de ces techniques. Plusieurs applications vid´eo n´ecessitant une repr´esentation du contenu vid´eo `a haut niveau se retrouvent dans un environnement en temps r´eel, exigeant donc une performance en temps r´eel. Peu d’approches de repr´esentation a` base de contenu prennent cette contrainte en consid´eration. III. Approche propos´ ee et m´ ethodologie L’objectif de cette th`ese est de d´evelopper un syst`eme pour la repr´esentation vid´eo `a base de contenu par un syst`eme d’extraction automatis´e d’objets int´egr´e `a un syst`eme de d´etection d’´ev´enements sans l’interaction de l’utilisateur. Le but est de fournir une repr´esentation a` base de contenu riche en termes d’´ev´enements g´en´eriques et de traiter une large gamme d’applications vid´eo pratiques. Les objets sont repr´esent´es par des caract´eristiques a` bas niveau quantitatives et qualitatives. Les ´ev´enements quant `a eux, sont repr´esent´es par des caract´eristiques haut niveau des objets telles que les activit´es et les actions. Cette ´etude ´evoque trois questions importantes : 1. Les repr´esentations d’objets flexibles qui sont facilement d´etectables pour la r´ecapitulation, l’indexation et la manipulation vid´eo, 2. L’interpr´etation du signal vid´eo fiable et stable qui pr´ec`ede le besoin de la pr´ecision et 3. Economie en temps de calcul. Ceci n´ecessite la contribution d’algorithmes qui r´epondent `a ces trois questions pour la r´ealisation des applications vid´eo bas´ees sur le contenu et destin´ees au consommateur, comme la surveillance et la recherche dans des bases de donn´ees vid´eo. Ces algorithmes doivent se concentrer sur les questions pratiques d’analyse vid´eo orient´ee vers les besoins de syst`emes vid´eo bas´es sur l’objet et l’´ev´enement. Vue vidéo Amélioration vidéo Analyse vidéo orienté object Interprétation vidéo orienté événement Descripteurs d’objets & d’événements Figure 1: Synoptique du syst`eme propos´e. Le syst`eme propos´e est con¸cu pour des situations r´eelles avec des occlusions xiii d’objets, des changements d’illumination, du bruit et des artefacts. Pour produire une repr´esentation vid´eo de haut niveau, la structure propos´ee implique trois ´etapes (voir figure 1) : am´elioration, analyse et interpr´etation. Le signal vid´eo original est pr´esent´e `a l’entr´ee du module am´elioration vid´eo et la sortie en est une version am´elior´ee. Ce signal vid´eo am´elior´e est alors trait´e par le module d’analyse vid´eo qui produit une description bas niveau de ce vid´eo. Le module interpr´etation vid´eo re¸coit ces descriptions `a bas niveau et produit une description de haut niveau du signal vid´eo original. Les r´esultats d’une ´etape sont int´egr´es pour soutenir les ´etapes suivantes qui corrigent ou soutiennent a` leur tour les pas pr´ec´edents. Par exemple, un objet suivi `a une ´etape est soutenu par la segmentation bas niveau. Les r´esultats du suivi sont a` leur tour int´egr´es dans la segmentation pour la confirmer. Cette approche, par analogie au syst`eme visuel humain (SVH), trouve des objets o` u la d´etection partielle et l’identification pr´esentent un nouveau contexte qui approuve a` son tour la nouvelle identification [103, 3]. Le syst`eme peut ˆetre vu comme une structure de m´ethodes et d’algorithmes pour construire des syst`emes d’interpr´etation de sc`enes dynamiques automatiques. La robustesse des m´ethodes propos´ees sera d´emontr´ee par une exp´erimentation vaste sur des s´equences vid´eo bien connues. La robustesse est le r´esultat de l’adaptation au bruit vid´eo et aux artefacts est due au traitement qui consid`ere les erreurs (obtenues) `a une ´etape pour la correction ou la compensation `a l’´etape suivante. La structure propos´ee dans cette th`ese est con¸cue pour des applications o` u une interpr´etation du vid´eo d’entr´ee est n´ecessaire (“quel est le sujet de la s´equence”). Ceci peut ˆetre illustr´e par deux exemples : surveillance et recherche vid´eo. Dans un syst`eme de surveillance vid´eo, une alarme peut ˆetre activ´ee dans le cas o` u le syst`eme propos´e d´etecte un comportement d’objets particulier. Dans un syst`eme de r´ecup´eration vid´eo, les utilisateurs peuvent rechercher un vid´eo en fournissant une description qualitative, utilisant une information comme les attributs d’objet (par exemple, la forme), les rapports spatiaux (par exemple : l’objet i est pr`es de l’objet j), l’emplacement (par exemple : l’objet i est au fond de l’image) et les caract´eristiques s´emantiques ou de haut niveau (par exemple, action d’objet : l’objet i se deplace a` gauche et est ensuite occlus; ´ev´enement : le d´eplacement ou le d´epˆot d’objets). Le syst`eme de r´ecup´eration peut alors trouver les trames dont le contenu correspond le mieux a` la description qualitative. Une propri´et´e d´esirable pour les strat´egies de repr´esentation vid´eo est de fournir une r´eponse `a des questions simples bas´ees sur l’observation, comme par exemple, comment s´electionner des objets (qui est dans la sc`ene), d´ecrire leur action (qu’est-ce qu’il/elle fait) et d´eterminer leur emplacement (“o` u” l’action a eu lieu) [72, 56]. En l’absence d’une application sp´ecifique, un mod`ele g´en´erique doit ˆetre adaptable (par exemple, avec de nouvelles d´efinitions d’actions et d´ev´enements). xiv R´esum´e Sans consid´eration au temps r´eel, une approche de repr´esentation vid´eo `a base de contenu pourrait perdre son applicabilit´e. En outre, la robustesse au bruit et artefacts du codage est importante pour une utilisation de la solution. Le syst`eme propos´e est con¸cu pour r´ealiser un ´equilibre entre efficacit´e, qualit´e de la solution et temps de calcul. Le syst`eme repr´esent´e dans la figure 2 se d´ecrit ainsi: • Le chapitre de l’am´elioration vid´eo classifie d’abord le bruit et les artefacts dans le vid´eo puis utilise une nouvelle m´ethode pour l’estimation du bruit et une autre pour la r´eduction spatiale du bruit (le Chapitre 2). La technique d’estimation du bruit propos´ee produit des ´evaluations fiables dans des images ayant des r´egions lisses et/ou structur´ees. Cette technique est une m´ethode `a base de blocs qui prend en compte la structure de l’image en consid´eration et qui utilise une mesure autre que la variance pour d´eterminer si un bloc est homog`ene ou non. Elle n’utilise aucun seuil et automatise la proc´edure avec laquelle les m´ethodes `a base de blocs proc`edent pour la moyennisation des variances de blocs. Cette nouvelle technique de r´eduction spatiale du bruit utilise un filtre pass-bas ayant une complexit´e r´eduite pour ´eliminer le bruit spatial non corr´el´e. L’id´ee de base est d’utiliser un ensemble de filtres pass-haut pour d´etecter la direction de filtrage la plus appropri´ee. Le filtre propos´e r´eduit le bruit dans l’image en pr´eservant la structure, et est adapt´e `a la quantit´e du bruit estim´ee. • L’analyse vid´eo est bas´ee principalement sur l’extraction des objets significatifs et des caract´eristiques quantitatives a` bas niveau a` partir de la vid´eo. La m´ethode se fait en quatre ´etapes (les Chapitres 3-6): ◦ ◦ ◦ ◦ segmentation des objets bas´ee sur la d´etection du mouvement, estimation du mouvement bas´ee objet, fusion de r´egions, suivi d’objets bas´e sur une combinaison non lin´eaire de caract´eristiques spatio-temporelles. L’algorithme propos´e extrait les objets vid´eo importants pouvant ˆetre utilis´es comme index dans les vid´eo bas´es sur la repr´esentation d’objets flexibles et pour analyser la vid´eo afin de d´etecter les ´ev´enements li´es aux objets en vue d’une repr´esentation et une interpr´etation s´emantique. • La m´ethode de segmentation d’objets propos´ee classifie les pixels des images vid´eo comme appartenant a` des objets distincts bas´es sur le mouvement et des caract´eristiques de contour (le Chapitre 4). Elle comporte des proc´edures simples, et est r´ealis´ee en quatre ´etapes : ◦ binarisation des images en entr´ee bas´ee sur la d´etection du mouvement. ◦ d´etection morphologique des fronti`eres. ◦ analyse des contours et la squ´eletisation. xv ◦ ´etiquetage des objets. La tˆache la plus critique est la binarisation qui doit ˆetre fiable dans toute la s´equence vid´eo. L’algorithme de binarisation m´emorise le mouvement d´etect´e pr´ec´edemment pour adapter le processus. La d´etection de fronti`eres fait appel `a de nouvelles op´erations morphologiques dont les calculs sont r´eduits de facc importante. L’avantage de la d´etection morphologique consiste en la g´en´eration de fronti`eres continues d’un seul pixel de largeur. L’analyse de contour transforme les fronti`eres en contours et ´elimine les contours non d´esir´es. Cependant, les petits contours sont seulement ´elimin´es s’ils ne peuvent ˆetre associ´es `a des contours pr´ec´edemment extraits, c’est a` dire, si un petit contour n’a aucun contour correspondant dans l’image pr´ec´edente. Les petits contours se trouvant compl`etement `a l’int´erieur d’un grand contour sont fusionn´es avec ce dernier selon des crit`eres d’homog´en´eit´e. • L’estimation du mouvement d´etermine l’´etendue et la direction du mouvement, de l’objet extrait (le Chapitre 5). Dans l’approche propos´ee, l’information extraite de l’objet (par exemple, la taille, la boite a` contour minimal (MBB : Minimum bounding box), la position, la direction du mouvement) est utilis´ee dans un processus `a base de r`egles ayant trois ´etapes : la correspondance d’objet, l’estimation du mouvement MBB bas´e sur le d´eplacement des cˆot´es du MBB, ce qui rend le processus d’estimation ind´ependant du signal d’intensit´e et du type de mouvement de l’objet. • La m´ethode de suivi suit et associe les objets en mouvement et enregistre leurs caract´eristiques temporelles. Elle transforme les objets segment´es provenant du processus de segmentation a` des objets de vid´eo (le Chapitre 6). Le principal probl`eme des syst`emes de suivi est leur fiabilit´e en cas d’occlusion, d’ombrage et de division d’objets. La m´ethode de suivi propos´ee est bas´ee sur un syst`eme de vote non-lin´eaire pour r´esoudre le probl`eme de correspondances multiples. Le probl`eme d’occlusion est att´enu´e par une proc´edure de d´etection simple bas´ee sur les d´eplacements estim´es du MBB de l’objet, suivie d’une proc´edure de pr´ediction bas´ee m´ediane qui fournit un estim´e raisonnable pour les objets occlus (partiellement ou compl`etement). Les objets sont suivis une fois qu’ils entrent en sc`ene et aussi pendant l’occlusion, chose tr´es importante pour l’analyse d’activit´e. Des r`egles de plausibilit´e pour la coh´erence, l’allocation d’erreur (error allowance) et le contrˆole sont propos´ees pour un suivi efficace pendant de longues p´eriodes. Une contribution importante au niveau du suivi est la fusion des r´egions fiables qui am´eliore la performance du syst`eme d’analyse vid´eo en entier. L’algorithme propos´e a ´et´e d´evelopp´e pour des applications vid´eo bas´ees sur le contenu comme la surveillance ou l’indexation et la r´ecup´eration. • L’interpr´etation vid´eo est principalement concern´ee par l’extraction de car- xvi R´esum´e act´eristiques vid´eo qualitatives et s´emantiques (le Chapitre 7). Son objectif principal est de fournir des outils de repr´esentation par une combinaison des caract´eristiques a` bas niveau, ainsi que des donn´ees vid´eo a` haut niveau. Cette int´egration est essentielle pour faire face au contenu visuel g´en´erique ´enorme contenu dans une s´equence vid´eo. L’accent est donc mis sur la r´ealisation d’une proc´edure de d´etection d’´ev´enements g´en´eriques simple, robuste et automatique. Pour identifier les ´ev´enements, une description qualitative du mouvement de l’objet est un pas important vers l’association des caract´eristiques de ` cette fin, base au processus de r´ecup´eration des caract´eristiques haut niveau. A le comportement du mouvement des objets vid´eo est analys´e pour repr´esenter les ´ev´enements ainsi que les actions importantes. Cela signifie que les caract´eristiques de base peuvent ˆetre combin´ees de mani`ere `a produire un effet sur celles de haut niveau. D’abord des descriptions qualitatives des caract´eristiques bas niveau des objets ainsi que les relations entre objets sont d´eriv´ees. Ensuite, des m´ethodes automatiques pour la repr´esentation vid´eo des contenus haut niveau bas´ees sur les ´ev´enements sont propos´ees. Le but du vid´eo est, en g´en´eral, de documenter les ´ev´enements et les activit´es des objets ou ensemble d’objets. Les utilisateurs g´en´eralement recherchent des objets vid´eo qui v´ehiculent un certain message [124] et ils captent et retiennent en m´emoire [72, 56] : 1) des ´ev´enements (“ce qui est arriv´e”), 2) des objets (“qui est dans la sc`ene”), 3) des emplacements (“o` u cela arrive”), et 4) le temps (“quand cela est arriv´e”). Les utilisateurs sont ainsi attir´es par les objets et leurs caract´eristiques et se concentrent d’abord sur les caract´eristiques a` haut niveau li´ees au mouvement. En cons´equence, l’analyse vid´eo propos´ee est con¸cue pour : ◦ prendre des d´ecisions sur des donn´ees de plus bas niveau pour supporter les niveaux suivants de traitement, ◦ repr´esenter qualitativement les objets ainsi que leurs caract´eristiques spatiales, temporelles et relationnelles, ◦ extraire les caract´eristiques s´emantiques des objets, qui sont g´en´eralement utiles et ◦ fournir automatiquement et efficacement une r´eponse (op´eration en temps r´eel). xvii Vue vidéo Amélioration vidéo Estimation & réduction du bruit Vidéo amélioré Vidéo amélioré Stabilisation de l’image Extraction des attributs globaux Vidéo stable Mise à jour de Compensation du mouvement l’arrière-plan global σn Vidéo stable Analyse vidéo Segmentation d’objets basée sur le mouvement => Pixels à objets Pixels à vidéo objets Estimation de mouvements basée sur l’objet Suivie d’objet basée sur le vote => Objets à vidéo objets Descripteurs d’objets spatio-temporal Descripteurs globaux de vue Interprétation vidéo Objets vidéo à évènements Analyse & interprétation des descripteurs à bas niveau Détection & classification d’événements Résultats Requète (événements & objets) Application basée sur l’objet et le mouvement exemple: Décision basée sur l’événement Figure 2: Diagramme de la structure propos´ee pour la repr´esentation de s´equence vid´eo a` base d’objets et d’´ev´enements. Les contributions sont sch´ematis´ees par des blocs de couleur grise et les interactions entre module sont marqu´ees par des fl`eches en pointill´es. σn est l’´ecart-type du bruit d’image. xviii R´esum´e IV. R´ esultats Dans les applications vid´eo en temps r´eel, une analyse vid´eo orient´ee objet non supervis´ee et rapide est n´ecessaire. Une ´evaluation tant objective que subjective et des comparaisons montrent la robustesse de la m´ethode d’analyse vid´eo propos´ee sur des images bruit´ees ainsi que sur des images pr´esentant des changements d’illumination, alors que la complexit´e de cette m´ethode est r´eduite. Cette derni`ere utilise peu de param`etres qui sont automatiquement ajust´es au bruit et aux changements temporels dans la s´equence vid´eo (la figure 3 illustre un exemple d’analyse vid´eo de la s´equence ‘Autoroute’). Cette th`ese propose un sch´ema d’interpr´etation vid´eo orient´e ´ev´enements. Pour d´etecter les ´ev´enements, des descriptions perceptuelles des ´ev´enements communes pour une large gamme d’applications sont propos´ees. Les ´ev´enements d´etect´es incluent : {entrer, apparaitre, sortir, disparaitre, se d´eplacer, arrˆeter, occluer/est Occlu´e, a enlev´e/a ´et´e enlev´e, d´epos´e/a ´et´e d´epos´e, mouvement anormal }. Pour d´etecter les ´ev´enements, le syst`eme propos´e contrˆole le comportement et les caract´eristiques de chaque objet dans la sc`ene. Si des conditions sp´ecifiques sont rencontr´ees, les ´ev´enements li´es `a ces conditions sont d´etect´es. L’analyse des ´ev´enements est faite en ligne, c’est a` dire, les ´ev´enements sont d´etect´es quand ils arrivent. Des caract´eristiques sp´ecifiques telles que le mouvement ou la taille de l’objet sont m´emoris´ees pour chaque image et sont compar´ees aux images suivantes dans la s´equence. La d´etection des ´ev´enements n’est pas bas´ee sur la g´eom´etrie des objets, mais sur leurs caract´eristiques et relations dans le temps. La th`ese propose des mod`eles approximatifs mais efficaces pour d´efinir les ´ev´enements utiles. Dans diverses applications, ces mod`eles approximatifs mˆeme s’ils ne sont pas pr´ecis, sont ad´equats. Des exp´eriences utilisant des s´equences vid´eo bien connues ont permis de v´erifier l’efficacit´e de l’´etude propos´ee (par exemple la figure 4). Les ´ev´enements d´etect´es sont suffisamment communs pour une large gamme d’applications vid´eo pour aider la surveillance et la recherche vid´eos. Par exemple: 1) le d´eplacement/d´epˆot d’objets dans un site de surveillance peut ˆetre contrˆol´e et d´etect´e d`es qu’il arrive, 2) le d´eplacement des objets en mouvement peut ˆetre contrˆol´e et annonc´e, et 3) le comportement de clients dans des magasins ou des passages souterrains peut ˆetre surveill´e. Le syst`eme en entier (l’analyse vid´eo et l’interpr´etation) n´ecessite en moyenne entre 0.12 et 0.35 secondes pour traiter les donn´ees entre deux images. Typiquement, la vid´eo surveillance est enregistr´e `a une vitesse de 3 `a 15 images par seconde. Le syst`eme propos´e produit une r´eponse en temps r´eel pour des applications de surveillance avec une vitesse de 3 a` 10 images par seconde. Afin d’augmenter la performance xix du syst`eme pour des applications o` u la fr´equence des images est ´elev´ee, l’optimisation du code est n´ecessaire. Une acc´el´eration du processus peut ˆetre r´ealis´ee, i) en optimisant l’impl´ementation des occlusions et de la s´eparation d’objets, ii) en optimisant l’impl´ementation des techniques de d´etection du changement et iii) en travaillant avec des valeurs de type entier au lieu des r´eels (quand c’est appropri´e) et avec des op´erations d’addition au lieu des multiplications. V. Conclusion Cette ´etude a apport´e un nouveau mod`ele pour le traitement et la repr´esentation vid´eo d’une fa¸con ind´ependante du contexte et bas´e sur les objets et les ´evenements. Le traitement et la repr´esentation vid´eo bas´es sur objet-´ev´enement sont n´ecessaires pour la recherche automatique de base de donn´ees vid´eo et pour la surveillance vid´eo. Ce mod`ele repr´esente la s´equence vid´eo en termes d’objets, un ensemble riche d’´ev´enements g´en´eriques pouvant supporter des applications vid´eo bas´ees sur le contenu et orient´ees utilisateur. Il permet une analyse efficace et flexible et une interpr´etation de la s´equence vid´eo dans des environnements r´eels o` u occlusions, changements d’illumination, bruit et artefacts peuvent survenir. ` partir du mod`ele propos´e, le traitement se fait sur des niveaux, allant du bas A niveau au haut niveau en passant par un niveau interm´ediaire. Chaque niveau est organis´e de facon modulaire et est responsable d’un certain nombre d’aspects sp´ecifiques d’analyse. Les r´esultats du traitement du niveau inf´erieur sont int´egr´es pour appuyer le traitement aux niveaux sup´erieurs. Les trois niveaux de traitement sont : Am´ elioration vid´ eo Une nouvelle m´ethode pour le filtrage du bruit spatial, pr´eservant la structure et `a complexit´e r´eduite a ´et´e d´evelopp´ee. Cette m´ethode de filtrage est soutenue par une proc´edure qui estime correctement le bruit dans l’image. Le bruit estim´e est aussi utilis´e pour soutenir l’analyse vid´eo qui suit (le Chapitre 2). Analyse vid´ eo Une m´ethode d’extraction d’objets vid´eo significatifs et de leurs caract´eristiques d´etaill´ees a ´et´e d´evelopp´ee. Elle est bas´ee sur une segmentation fiable et efficace du point de vue calcul, ainsi que sur le suivi d’objets. Elle est tol´erante aux erreurs et peut corriger et d´etecter les erreurs. Le syst`eme peut fournir une r´eponse en temps r´eel pour des applications de surveillance a` une vitesse de 3 a` 10 images par seconde. La m´ethode de suivi est efficace et applicable `a une large classe de s´equences vid´eo. L’efficacit´e du syst`eme d’analyse vid´eo a ´et´e d´emontr´ee par plusieurs exp´eriences stimulantes (les Chapitres 3-6). Interpr´ etation vid´ eo Un syst`eme d’interpr´etation vid´eo ind´ependant du contexte a ´et´e d´evelopp´e. Il permet une repr´esentation vid´eo assez riche en termes d’´ev´enements g´en´eriques et de caract´eristiques qualitatives d’objets, ce qui fait de lui un syst`eme xx R´esum´e StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 (a) La trajectoire de l’objet dans le plan image. StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 0 50 100 0 50 100 150 x y 150 200 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 200 250 250 300 350 0 50 100 150 200 250 Img No. (b) La trajectoire pour la direction horizontale. 300 300 0 50 100 150 200 250 300 Img No. (c) La trajectoire pour la direction verticale. Figure 3: Trajectoire des objets dans le s´equence ‘Autoroute’. Fig. 3(b) and (c) permet une interpr´etation du comportement du mouvement de l’objet: par exemple, O2 d´emarre `a gauche de l’image et se d´eplace jusqu’`a la fronti`ere de l’image. Des objets divers entrent dans la sc`ene `a plusieurs reprises. Quelques objets se d´eplacent vite tandis que d’autres sont plus lents. Le syst`eme suit tous les objets r´eguli`erement. xxi Figure 4: Images d’´ev´enements clefs de la s´equence ‘Hall’. Cette s´equence est un exemple d’une application de surveillance a` l’int´erieur. ´ev´enement clef: O6 est d´epos´e par l’objet O1 . xxii pouvant ˆetre utilis´e pour une large gamme d’applications. Des descripteurs d’objet qualitatifs sont extraits par quantification des descriptions param´etriques des objets. Pour extraire les ´ev´enements, des changements de mouvement et des caract´eristiques sont continuellement trait´ees. Les ´ev´enements sont d´etect´es quand les conditions qui les d´efinissent sont rencontr´ees. Des exp´eriences utilisant des s´equences vid´eo bien connues ont d´emontr´e l’efficacit´e de la technique propos´e (le Chapitre 7). VI. G´ en´ eralisation possible Il y a un certain nombres de questions a` consid´erer si on veut am´eliorer la performance du syst`eme propos´e et ´elargir ses domaines d’applications. • Temps d’ex´ ecution L’impl´ementation peut ˆetre optimis´ee pour une ex´ecution plus rapide. • Segmentation d’objet Dans le contexte du codage vid´eo MPEG, les vecteurs de mouvement sont disponibles. Une extension imm´ediate de la technique de segmentation propos´ee est d’int´egrer l’information du mouvement a` partir du flot MPEG (MPEG-stream) pour supporter la segmentation d’objet. Cette int´egration aurait pour but d’am´eliorer la segmentation sans une augmentation significative du coˆ ut de calcul. • Estimation du mouvement Le mod`ele de mouvement propos´e peut ˆetre encore am´elior´e pour permettre une estimation plus pr´ecise. Une extension directe serait d’examiner les d´eplacements des extensions diagonales de l’objet et d’adapter l’estimation du mouvement pr´ec´edemment ´evalu´e pour une plus grande stabilit´e. • Points culminants et ombres Le syst`eme peut profiter de la d´etection d’ombres et de la compensation de leurs effets. • Stabilisation de l’image Les techniques de stabilisation d’image peuvent ˆetre utilis´ees pour permettre une analyse des donn´ees vid´eo `a partir de d´eplacement de cam´eras et de changement d’arri`ere-plans. • Interpr´ etation vid´ eo Un ensemble plus large d’´ev´enements peut ˆetre consid´er´e pour servir un ensemble plus grand d’applications. Une interface peut ˆetre con¸cue pour faciliter l’interaction entre le syst`eme et l’utilisateur. La d´efinition d’une telle interface exige une ´etude du besoin des utilisateurs de ces applications vid´eo. Une classification des objets en mouvement et un mouvement de d´esordre (clutter motion) tel que le mouvement des arbres dans le vent, peut ˆetre utilis´e pour rejeter des ´ev´enements. Une classification possible est de diff´erencier entre un mouvement avec but (celui d’un v´ehicule ou d’une personne) et un mouvement sans but. Contents R´ esum´ e vii 1 Introduction 1.1 Background and objective . . . . . . 1.2 Review of related work . . . . . . . . 1.3 Proposed approach and methodology 1.4 Contributions . . . . . . . . . . . . . 1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Video Enhancement 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Noise and artifacts in video signals . . . . . . . . . . . . 2.3 Modeling of image noise . . . . . . . . . . . . . . . . . . 2.4 Noise estimation . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Review of related work . . . . . . . . . . . . . . . 2.4.2 A homogeneity-oriented noise estimation . . . . . 2.4.3 Evaluation and comparison . . . . . . . . . . . . 2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . 2.5 Spatial noise reduction . . . . . . . . . . . . . . . . . . . 2.5.1 Review of related work . . . . . . . . . . . . . . . 2.5.2 Fast structure-preserving noise reduction method 2.5.3 Adaptation to image content and noise . . . . . . 2.5.4 Results and conclusions . . . . . . . . . . . . . . 2.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . 3 Object-Oriented Video Analysis 3.1 Introduction . . . . . . . . . . . . . 3.2 Fundamental issues . . . . . . . . . 3.3 Related work . . . . . . . . . . . . 3.4 Overview of the proposed approach 3.5 Feature selection . . . . . . . . . . 3.5.1 Selection criteria . . . . . . 3.5.2 Feature descriptors . . . . . 3.6 Summary and outlook . . . . . . . xxiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 4 8 9 . . . . . . . . . . . . . . 11 11 13 15 16 16 18 20 24 24 24 28 29 31 33 . . . . . . . . 39 39 41 42 44 46 46 48 51 xxiv 3.6.1 3.6.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Object Segmentation 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Overall approach . . . . . . . . . . . . . . . . . . . . . 4.3 Motion detection . . . . . . . . . . . . . . . . . . . . . 4.3.1 Related work . . . . . . . . . . . . . . . . . . . 4.3.2 A memory-based motion detection method . . . 4.3.3 Results and comparison . . . . . . . . . . . . . 4.4 Thresholding for motion detection . . . . . . . . . . . . 4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.4.2 Review of thresholding methods . . . . . . . . . 4.4.3 Artifact-adaptive thresholding . . . . . . . . . . 4.4.4 Experimental results . . . . . . . . . . . . . . . 4.5 Morphological operations . . . . . . . . . . . . . . . . . 4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.5.2 Motivation for new operations . . . . . . . . . . 4.5.3 New morphological operations . . . . . . . . . . 4.5.4 Comparison and discussion . . . . . . . . . . . . 4.5.5 Morphological post-processing of binary images 4.6 Contour-based object labeling . . . . . . . . . . . . . . 4.6.1 Contour tracing . . . . . . . . . . . . . . . . . . 4.6.2 Object labeling . . . . . . . . . . . . . . . . . . 4.7 Evaluation of the segmentation method . . . . . . . . . 4.7.1 Evaluation criteria . . . . . . . . . . . . . . . . 4.7.2 Evaluation and comparison . . . . . . . . . . . 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 5 Object-Based Motion Estimation 5.1 Introduction . . . . . . . . . . . . . . . . . . 5.2 Review of methods and motivation . . . . . 5.3 Modeling object motion . . . . . . . . . . . 5.4 Motion estimation based on object-matching 5.4.1 Overall approach . . . . . . . . . . . 5.4.2 Initial estimation . . . . . . . . . . . 5.4.3 Motion analysis and update . . . . . 5.5 Experimental results and discussion . . . . . 5.5.1 Evaluation criteria . . . . . . . . . . 5.5.2 Evaluation and discussion . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 52 . . . . . . . . . . . . . . . . . . . . . . . . 53 53 54 54 56 58 60 61 61 61 64 66 67 67 71 72 75 76 77 77 81 81 81 82 84 . . . . . . . . . . . 91 91 92 94 95 95 96 98 102 102 103 107 xxv 6 Voting-Based Object Tracking 6.1 Introduction . . . . . . . . . . . . . . . . . . 6.2 Review of tracking algorithms . . . . . . . . 6.3 Non-linear object tracking by feature voting 6.3.1 HVS-related considerations . . . . . . 6.3.2 Overall approach . . . . . . . . . . . 6.3.3 Feature selection . . . . . . . . . . . 6.3.4 Feature integration by voting . . . . 6.3.5 Feature monitoring and correction . . 6.3.6 Region merging . . . . . . . . . . . . 6.3.7 Feature filtering . . . . . . . . . . . . 6.4 Results and discussions . . . . . . . . . . . . 6.5 Summary and outlook . . . . . . . . . . . . . . . . . . . . . . . . 109 109 110 112 112 113 116 117 122 125 127 129 131 . . . . . . . . . . . . . . 145 145 145 147 149 150 153 153 153 155 156 163 165 169 170 8 Conclusion 8.1 Review of the thesis background . . . . . . . . . . . . . . . . . . . . . 8.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 181 181 184 Bibliography 185 7 Video Interpretation 7.1 Introduction . . . . . . . . . . . . . . . . . . 7.1.1 Video representation strategies . . . 7.1.2 Problem statement . . . . . . . . . . 7.1.3 Related work . . . . . . . . . . . . . 7.1.4 Proposed framework . . . . . . . . . 7.2 Object-based representation . . . . . . . . . 7.2.1 Spatial features . . . . . . . . . . . . 7.2.2 Temporal features . . . . . . . . . . . 7.2.3 Object-relation features . . . . . . . 7.3 Event-based representation . . . . . . . . . . 7.4 Results and discussions . . . . . . . . . . . . 7.4.1 Event-based video summary . . . . . 7.4.2 Key-image based video representation 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Applications 195 A.1 Video surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.2 Video databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.3 MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 xxvi B Test Sequences 200 B.1 Indoor sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 B.2 Outdoor sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 C Abbreviations 203 Chapter 1 Introduction Video is becoming integrated in various personal and professional applications such as entertainment, education, tele-medicine, databases, security applications and even low-bandwidth wireless applications. As the use of video becomes increasingly popular, automated and effective techniques to represent video based on its content such as objects and semantic features are important topics of research. Automated and effective content-based video representation is significant in dealing with the explosion of visual information through broadcast services, Internet, and security-related applications. For example, it would significantly facilitate the use and reduce costs of video retrieval and surveillance by humans. This thesis develops a framework for automated content-based video representation rich in terms of object and semantic features. To keep the framework generally applicable, objects and semantic features are extracted independently of the context of a video application. To test the reliability of the proposed framework, both indoor and outdoor real video environments are used. 1.1 Background and objective Given the ever-increasing amount of video and related storage, maintenance, and processing needs, developing automatic and effective techniques for content-based video representation is a problem of increasing importance. Such video representation aims at a significant reduction of the amount of video data by transforming a video shot of some hundreds or thousands of images into a small set of information based on its content. This data reduction has two advantages: large video databases can be efficiently searched based on video content and memory usage is reduced significantly. Despite the many contributions in the field of video and image processing, the scientific community has debated the low impact on applications: video is subject to different interpretation by different observers and video description can vary ac- 2 Introduction cording to observers and applications [141, 103, 78, 73]. Many video processing and representation techniques address problems by trying to develop a solution that is general for all video applications. Some focus on solving complex situations but assume a simple environment, for example, without object occlusion, noise, or artifacts. Video processing and representation research has mainly extracted video data in terms of pixels, blocks, or some global structure to represent video content. This is not sufficient for advanced video applications. In a surveillance application, for instance, object-related video representation is necessary to automatically detect and classify object behavior. With video databases, advanced retrieval must be based on high-level object features and semantic interpretation. Consequently, advanced content-based video representation has become a highly active field of research. Examples of this activity are the setting of multimedia standards such as MPEG-4 and MPEG-7, and various video surveillance and retrieval projects [129, 130]. Developing advanced content-based video representation requires the resolution of two key issues: defining what are interesting video contents and what features are suitable to represent these contents. Properties of the human visual system (HVS) help in solving some aspects of these issues: when viewing a video, the HVS is, in general, attracted to moving objects and their features; it focuses first on the highlevel object features (e.g., meaning) and then on the low-level features (e.g., shape). The main questions are: what level of object features and semantic content is most important and most common for content-oriented video applications? Are high-level intentional descriptions such as what a person is thinking needed? Is the context of a video data necessary to extract useful content? An important observation is that the subject of the majority of video is related to moving objects, in particular people, that perform activities and interact creating object meaning such as events [105, 72, 56]. A second observation is that the HVS is able to search a video by quickly scanning (“flipping”) it for activities and interesting events. In addition, to design widely applicable content-based video representations, the extraction of video content independently of the context of the video data is required. It can be concluded that objects and event-oriented semantic features are important and common for a wide range of video applications. To effectively represent video, three video processing levels are required: video enhancement to reduce noise and artifacts, video analysis to extract low-level video features, and video interpretation to describe content in semantic-related terms. 3 1.2 Review of related work Recently, video systems supporting content-based video representations have been developed1 . Most of these systems focus on video analysis techniques without integrating them into a functional video system. These techniques are interesting but often irrelevant in practice. Furthermore, many systems represent video either by low-level global features such as global motion or key-images. Some video representation systems implement flipping of video content by skipping some images based on low-level features (e.g., using color-based key-image extraction) or extracting global features. However, this may miss important data. Video flipping based on object activities or related events combined with low-level features such as shape allow a more focused yet flexible video representation for retrieval or surveillance. Few video representation systems are based on objects; most of these use only low-level features such as motion or shape to represent video. In addition, many object-based video representations focus on narrow domains (e.g., soccer games or traffic monitoring). Furthermore, some assume a simple environment, for example, without object occlusion, noise, or artifacts. Moreover, systems that address objectbased video representations use a quantitative description of the video data and objects. Users of advanced video applications such as retrieval do not exactly know what the video they are searching for looks like. They do not have exact quantitative information (do not memorize) the motion, shape, or texture. Therefore, user-friendly qualitative video representations for retrieval or surveillance are essential. In video retrieval, most existing video retrieval tools ask the user to sketch, to select an example of a video shot, e.g., after browsing the database, the user is looking for, or to specify quantitative features of the shot. Browsing in large databases can be, however, time-consuming and sketching is a difficult task, especially for complex scenes (i.e., real world scene). Providing users with means to describe a video by qualitative descriptors is essential for the success of such applications. There are few representation schemes concerning events occurring in video shots. Much of the work on event detection and classification focuses on how to express events using artificial intelligence techniques using, for instance, reasoning and inference methods. In addition, most high-level video representation techniques are context-dependent. They focus on the constraints of a narrow application and they lack, therefore, generality and flexibility. Despite the large improvement of the quality of modern acquisition systems, noise is still a problem that complicates video processing algorithms. In addition, various coding artifacts are introduced in digitally transmitted video, for example, using the 1 Pertinent literature and specific applications of the proposed methods and algorithms are reviewed at the respective sections of the main chapters of this thesis. 4 Introduction MPEG-2 video standard. Therefore, noise and artifact reduction is still an important task and should be addressed. While noise reduction has been subject of many publications (where few methods deal with real-time constraints), the impact of coding artifacts on the performance of video processing is not sufficiently studied. Due to progress in micro-electronics, it is possible to include sophisticated video processing techniques in video services and devices. Still, the real-time aspect of new techniques is crucial for a wide application of these techniques. Many video applications that need high-level video content representation occur in real-time environments so that their real-time performance is a critical requirement. Few of the content-based representation approaches take this constraint into account. 1.3 Proposed approach and methodology The objective of this thesis is to develop a modular automatic low-complexity functional system for content-based video representation with integrated automated object and event extraction systems without user interaction. The goal is to provide stable representation of video content rich in terms of generic semantic features and moving objects. Objects are represented using quantitative and qualitative low-level features. The emphasis is on stable moving objects rather than in the accuracy of their boundaries. Generic semantic meaning is represented using events and other high-level object motion features, such as trajectory. The system should provide stable video representation for a broad range of practical video applications of indoor and outdoor real environments of different contexts. The proposed end-to-end system is oriented to three requirements: 1. flexible object representations that are easily cooperatively searched for video summarizing, indexing and manipulation, 2. reliable, stable processing of video that foregoes the need for precision, and 3. low computational cost. This thesis contributes algorithms that answer these three issues for the realization of content-based and consumer-oriented video applications such as surveillance and video database retrieval. It focuses on practical issues of video analysis oriented to the needs of object- and event-oriented video systems, i.e., it focuses on the so-called “original core” of the problem as defined in [141]. The proposed processing and representation target video of real environments such as those with object occlusions, illumination changes, noise, or artifacts. To achieve these requirements, the proposed system involves three processing modules (Fig. 1.1): enhancement, analysis, and interpretation. The input to the video enhancement module is the original video and its output is an enhanced version of it. This enhanced video is then processed by the video analysis module which outputs low-level descriptions of the enhanced video. The video interpretation module takes these low-level descriptions and produces high-level descriptions of the original video. 5 Video shot Video Enhancement Enhanced video σn Object-oriented Video analysis Objects & features Event & object-oriented Video interpretation Event- & objectdescriptors Figure 1.1: Abstract diagram of the proposed system. σn is the estimated standard deviation of the input image noise. The proposed system can be viewed as a framework of methods and algorithms to build automatic dynamic scene interpretation and representation. Such interpretation and representation can be used in various video applications. Besides applications such as video surveillance and retrieval, outputs of the proposed framework can be used in a video understanding or a symbolic reasoning system. The proposed system is designed for applications where an interpretation of the input video is needed (“what is this sequence about?”). This can be illustrated by two examples: video surveillance and retrieval. In a video surveillance system, an alarm can be activated in case the proposed system detects a particular behavior of some objects. In a video retrieval system, users can query a video by qualitative description, using information such as object features (e.g., shape), spatial relationships (e.g., object i is close to object j), location (e.g., object i is at the bottom of the image), and semantic or high-level features (e.g., object action: object i moves left and then is occluded; event: removal or deposit of objects, or object j stops and changes direction). The retrieval system can then find the video frames whose contents best match the qualitative description. An advantage of such a representation strategy is that it allows the construction of user-friendly queries based on the observation that the interpretation of most people is often imprecise. When viewing a video, they mainly memorize objects and related semantic features. For example, who is in the scene, what he/she is doing, and where the action takes place? People do not usually memorize quantitative object features [72, 56]. In the absence of a specific application, such a generic model allows scalability (e.g., by introducing new definitions of object actions or events). The proposed system is designed to balance demands for effectiveness (solution quality) and efficiency (computational cost). Without real-time consideration, a content-based video representation approach could lose its applicability. Furthermore, robustness to image noise and coding artifacts is important for successful use of the proposed solution. These goals are achieved by adaptation to noise and artifacts, by detection and correction or compensation of estimation errors at the various processing levels, and by dividing the processing system into simple but effective tasks so that complex operations are avoided. In Fig. 1.2, a block diagram of the proposed system is displayed where contributions are underlaid with gray boxes, module interactions are marked by a dashed arrowed line, R(n) represents the background image 6 Introduction of the video shot, and σn is the noise standard deviation. The system modules are: Video shot Video enhancement Noise estimation & reduction Enhanced video Enhanced video Image stabilization Stable video Global feature extraction Background update R(n) Video analysis Pixels to video objects Object-based motion estimation Global-motion compensation σn Stable video Motion-based object segmentation => Pixels to objects Voting-based object tracking => Objects to video objects Spatio-temporal object descriptors Global shot descriptors Video interpretation Analysis & interpretation Video objects to events of low-level descriptors Event detection & classification Results Requests (Events & Objects) Object & Event-based application e.g., event-based decision-making Figure 1.2: The proposed framework for object-and-event based video representation. • The video enhancement module is based on new methods to estimate the image noise and to reduce the image noise to facilitate subsequent processing. • Image stabilization is the process of removing unwanted image changes. There are global changes due to camera motion, jitter [53] and local changes due to unwanted (e.g., moving of background objects) object motion. Image stabilization facilitates object-oriented video analysis by removing irrelevant changes. It can be performed by global motion compensation or by object update techniques. Global motion can be the result of camera motion or illumination change. The latter can produce apparent motion. Robust estimation techniques aim at estimating accurate motion from an image sequence. Basic camera motions are pan (right/left motion), zoom (focal length change), and tilt (up/down mo- 7 tion). Different parametric motion models can be used to estimate global motion [54, 25, 68, 144]. In practice, as a compromise between complexity and flexibility, 2-D affine motion models are used [54]. Global motion compensation stabilizes the image content by removing camera motion while preserving object motion. Several studies show the effectiveness of using global motion compensation in the context of motion-based segmentation [85, 6, 144, 54, 25, 68, 144]. Also, background update is needed in object segmentation that uses image differencing based on a background image (cf. Section 4.3). In such object segmentation, the background image needs to be updated, for example, when background objects move or when objects are added to or subtracted from the background image. Various studies have addressed background update and shown its usefulness for segmentation [35, 70, 65, 36, 50]. • The video analysis module extracts video objects and their low-level quantitative features. The method consist of four steps: motion-detection-based object segmentation, object-based motion estimation, region merging, and object tracking based on a non-linear combination of spatio-temporal features. The object segmentation classifies pixels of the video images into objects based on motion and contour features. To focus on meaningful objects, the proposed object segmentation uses a background image which can be extracted using a background updates method. The motion estimation determines the magnitude and direction of the motion, both translation and non-translation, of extracted object. The tracking method tracks and links objects as they move and registers their temporal features. It transforms segmented image objects of the object segmentation module into video-wide objects. The main issue in tracking systems is reliability in case of occlusion and and object segmentation errors. The proposed method focuses on solution to these problems. Representations of object and global video features to be used in a low-level content-based video representation are the output of the video analysis method. • The video interpretation module extracts semantic-related and qualitative video features. This is done by combining low-level features and high-level video data. Semantic content is detected by integrating analysis and interpretation of video content. Semantic content is represented by generic events independently of the context of an application. To identify events, a qualitative description of the object motion is an important step towards linking low-level features to high-level feature retrieval. For this purpose, the motion behavior and low-level features of video objects is analyzed to represent important events and actions. The results of this processing step are qualitative descriptions of object features and high-level descriptions of video content based on events. Within the proposed framework, results of one step are integrated to support 8 Introduction subsequent steps that in turn correct or support previous steps. For example, object tracking is supported by low-level segmentation. Results of the tracking are in turn integrated into segmentation to support it. This approach is by analogy to the way the HVS finds objects where partial detection and recognition introduces a new context which in turn supports further recognition ([103], Fig. 3.2, [3]). The robustness of the proposed methods will be demonstrated by extensive experimentation on commonly referenced video sequences. Robustness is the result of adaptation to video noise and artifacts and due to processing that accounts for errors at one step by correction or compensation at the subsequent steps where higher level information is available. The proposed system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second on a multitasking SUN UltraSPARC 360 MHz without specialized hardware. 1.4 Contributions Because of the extensive growth in technical publications in the field of video processing, it is difficult to possess a comprehensive overview of the field and of published methods. The following list states which parts of this thesis are original to the knowledge of the author. • A new approach to estimate white noise in an image is proposed. The novelty here is twofold: the first introduces a new homogeneity measure to detect intensity-homogeneous blocks; the second is a new way to automate averaging of estimated noise variances of various blocks. • A new enhanced filter for computationally efficient spatial noise reduction in video signals is proposed. The filter is based on the concept of implicitly finding homogeneous image structure to adapt filtering. The novelty is twofold: effective detection of image structure to adapt the filter [74]; effective adaptation of the filter parameters such as window size and weights to the estimated noise variance. • A new object segmentation method based on motion and contour data using ◦ ◦ ◦ ◦ a a a a memory-based motion detection method, fast noise-adaptive thresholding method for motion detection, set of new morphological binary operations, and robust contour tracing technique. • A new efficient object-based motion estimation designed for video applications such as video surveillance. The method aims at a meaningful representation of object motion towards high-level interpretation of video content. The main contribution is the approximate estimation of object scaling, and acceleration. 9 • A new object tracking method that solves the correspondence problem effectively for a broad class of image sequences and controls the quality of the segmentation and motion estimation techniques: ◦ Voting system: the correspondence or object matching problem is solved based on a voting system by combining feature descriptions non-linearly. ◦ Multiple object: the new tracking contributes a solution to tracking objects in the case of multi-object occlusion. The method can simultaneously track various objects, as soon as they enter the field of the camera. ◦ Error detection and correction: the proposed tracking process is faulttolerant: it takes into account possible errors from the object segmentation methods and compensates for their effects. ◦ Region merging: the proposed tracking process contributes a reliable region merging technique based on geometrical relationships, temporal coherence, and matching of objects rather than on single local features. • A new context-independent video interpretation technique which provides a high-level video representation rich in terms of generic events and qualitative object features. This representation is well-chosen because it represents a good compromise between containing too many special operators and being a too small set of generic operators. The interpretation technique consists of ◦ an object-and motion based event detection and classification method, ◦ a key-image extraction method based on events, and ◦ qualitative descriptions of video objects, their features, and their related generic events. In addition, this thesis addresses the targeted applications of the proposed system and contributes definitions of frameworks for advanced event-based video surveillance and retrieval. 1.5 Thesis outline This thesis demonstrates the performance of the proposed three video processing and representation levels at the respective section of each chapter (Video enhancement in Chapter 2, video analysis in Chapters 3-6, and video interpretation in Chapter 7). Pertinent literature and specific applications of the proposed methods are also reviewed at the respective sections. Furthermore, each proposed algorithm is summarized at the respective section. • Chapter 2 first classifies noise and artifacts in video into various categories and then uses a novel method for noise estimation and a novel spatial noise reduction method. 10 Introduction • In Chapter 3, after reviewing related techniques, the proposed approach for video analysis is described. Then possible implementations of an image stabilization algorithm are discussed. Then representations of object and global video features to be used in a low-level content-based video representation are proposed. • The steps of the video analysis are proposed in details in Chapter 4 (object segmentation), in Chapter 5 (object-based motion estimation), and in Chapter 6 (object tracking). • In Chapter 7, qualitative descriptions of low-level object features and of object relationships are first derived. Then automatic methods for high-level content interpretation based on motion and events are proposed. • At the end of the thesis, Chapter 8 reviews the background, goal, and achievements of the thesis. It furthermore summarizes key results and mentions possible extensions. • In Appendix A, Section A.1 defines an object-and-event based surveillance system and Section A.2 designs a content-based retrieval system. Section A.3 describes relations to MPEG-7 activities. • A detailed description of the test sequences used is given in Appendix B. Chapter 2 Video Enhancement 2.1 Motivation Image enhancement is a fundamental task in various imaging systems such as cameras, broadcast systems, TV and HDTV-receivers, and other multimedia systems [120]. Many enhancement methods have been proposed, which range from sharpness improvement to more complex operations such as temporal image interpolation1 . Each of these methods is important with respect to the imaging system it is used for. For example, noise reduction is usually used in various imaging systems as an enhancement technique. Noise reduction is often a preprocessing step in a video system and it is important that its computational cost stays low while its performance is reliable. For example, preserving image content such as edges, textures, or moving areas to which the HVS is sensitive is an important performance feature. Because of the significant improvement in the quality of modern analogue video acquisition and receiver systems, studies show that TV viewers are more critical even to low noise [97, 107]. In digital cameras, the image noise may increase because of the higher sensitivity of the new CCD cameras and the longer exposure [88]. Noise reduction is, therefore, still a fundamental and important task in image and video processing. It is an attractive feature, especially under sub-optimal reception conditions. This calls for effective noise reduction techniques. The real-time aspect of new techniques will be a very attractive property for consumer devices such as digital cameras and TV-receivers. The focus of a noise reduction technique is not to remove noise completely (which is difficult or impossible to achieve) but to reduce the influence of noise to become almost imperceptible to the HVS. Noise occurs both in analogue and digital devices, such as cameras. Noise is always present in an image. When not visible, it is only masked by the HVS. Its visibility can increase with the camera sensor sensitivity or under low lighting conditions and 1 For a broad overview of video enhancement methods see [120]. 12 Video enhancement especially in images taken at night. Noise can be introduced into an image in many ways (Section 2.2) and can significantly affect the quality of image and video analysis algorithms. Noise reduction techniques attempt to recover an underlying true image from a degraded (noisy) copy. Accordingly, in a noise reduction process, assumptions are made about the actual structure of the true image. The reduction of noise can be performed by both linear and nonlinear operators (Fig. 2.1) which use correlations within and between images. Also spatio-temporal noise reduction algorithms have been devised to exploit both temporal and spatial signal correlations (see Fig. 2.1 and [13, 47, 46, 74]). Spatial methods, which are computationally less costly than temporal methods, are widely used in various video applications. Temporal methods which are more demanding computationally and require more memory are mainly used in TV receivers. Temporal noise reduction algorithms that use motion have the disadvantage that little, if any, noise reduction is performed in strongly moving areas. To compensate for this drawback, spatial noise filters can be used [74]. Several noise reduction methods make, implicitly or not, assumptions regarding image formation and image noise which are associated with a particular algorithm and thus usually perform best for a particular class of images. For example, some methods assume low noise levels or low image variations and fail in the presence of high-noise levels or textured image segments. temporal spatial hybrid static linear cascade connected motion content adaptive combined adaptive motion vector median median morphological based Figure 2.1: A classification of noise reduction methods. Noise reduction techniques make various assumptions, depending on the type of images and the goals of the reduction. Considering that a noise reduction method is a preprocessing step, an effective practical noise reduction method should take the following observations into account (cf. also [87]): • it should incorporate few assumptions about the distribution of the corrupting 13 noise. For example, in case of iterative noise reduction, the characteristics of the noise might be modified and any assumption can lose its significance, • it should be parallel at the pixel level meaning that the value of the processed pixel is computed in a small window centered on it. Iterations of the same procedure extend, however, the “region of influence” beyond the small window, • it should preserve significant discontinuities. This can be done by adapting the algorithms to conform to image discontinuities and noise, • it should take behavior and reaction of the HVS into account. Each element, for example, piece-wise constant gray level areas or textures, of a visual field (an image) has a different influence on human visual perception. The behavior of the HVS is not well understood. Nevertheless, the following list describes some of the known properties of the HVS that can be of interest in designing a noise reduction technique: ◦ the HVS adapts to environmental conditions, ◦ in high-frequency structures, the HVS is less sensitive to artifacts than in low-frequency structures, ◦ the HVS is very sensitive to processing errors at image discontinuities such as object contours, ◦ the HVS is not as sensitive to the diagonal orientation as to the horizontal and vertical orientations (oblique effect). Structures in real images are more horizontally and vertically oriented, i.e., the spectra of real image contents are mainly concentrated along the horizontal and vertical frequency axis, and ◦ motion of objects can mask some artifacts in an image sequence. Using some of these observations, this thesis develops a noise estimation method (Section 2.4) to support a noise reduction algorithm, and then contributes a novel spatial noise reduction method (Section 2.5) that is fast and adaptive to image content. In [13], a temporal noise filter is proposed that adapts the reduction process to the high and low frequencies of the image content. The methods proposed in this thesis make no assumption on the underlying image model except that the image noise is white Gaussian noise, which is most common in images. In the next Section, a classification of noise and artifacts in video is given. The proposed noise estimation and reduction are described in the following sections. 2.2 Noise and artifacts in video signals An image can be corrupted by noise and artifacts due to image acquisition, recording, processing and transmission (Table 2.1). Other image artifacts can be due to 14 Noise and artifacts reflections, blinking lights, shadows, or natural image clutter. Acquisition noise may be generated by signal pick-up in the camera or by film grain especially under bad lightning conditions. Here, different types of noise are added due to the amplifiers and other physical effects in the camera. Furthermore, noise can be added to the signal by transmission over analogue channels, e.g., satellite or terrestrial broadcasting. Further noise is added by image recording devices. In these devices Gaussian noise or, in the case of tape drop-outs, impulse noise is added to the signal. Digital transmission inserts other distortions which also may have a noisy characteristic. ‘Blocking’ are block structures which become visible in an MPEG-2 image due to the block-based and motion-compensated MPEG-coding scheme. These block structures appear in an image sequence as a ‘Dirty Window’. The boundaries of blocking and dirty windows are small details which are located in the high frequencies. The quantization of the DCT-coefficient in MPEG-2 coding causes overshoot on object contours, which is called ‘Ringing’. The ‘Mosquito effect’ on high-frequency image parts is caused by different quantization results in successive images and by faulty block-based motion estimation at object contours. As mentioned earlier, the HVS is very sensitive to abrupt changes and details which are located in the high frequencies, and artifact reduction is needed in modern video applications to reduce the effect of these artifacts. Beside input artifacts in an image, intermediate artifacts and errors (e.g., false motion data or edges) and end result error (e.g., reliability) in video and image analysis are unavoidable and can have a large impact on the end results of analysis. Since corrupted image data may significantly deviate from the assumed image analysis model, the effectiveness of successive image analysis steps can be significantly reduced. This calls for modeling of these artifacts and errors. In many video applications, however, exact modeling of errors is not necessary. Analogue channel, recording, film grain, and CCD-camera noise can be modeled as white Gaussian noise, which is usually of low amplitude and can be reduced by linear operations. High amplitude noise like impulse noise can be generated, for example, through a satellite transmission. It requires different approaches, using adaptive median operators. Because of the various kinds of noise and artifacts, it is important, in a TV receiver or imaging system as in surveillance, to perform a reduction of these noise and artifacts. Attention has to be paid to use methods that do not introduce artifacts on the enhanced images. For example, non-adaptive low-pass filtering would reduce high frequency noise and artifacts but may deteriorate object edges. 15 Artifact origin Artifact type Reduction method white noise temporal, spatial white noise impulse noise temporal, spatial median pattern noise impulse noise median, edge-based median white noise FM-noise, pattern (‘Fish’) noise temporal, spatial temporal, spatial median, edge-based Digital coding (MPEG) Blocking, Dirty Window, Ringing, Mosquito Effects spatial, object-based, temporal Digital transmission Bit-error block and image dropout Bit-error protection, object-based, error concealment Processing artifacts false motion data or edges Sampling • camera (CCD), film grain Recording • video tape noise • tape drop-out Disturbed image material • film damage • bit error Analogue transmission • cable, terrestrial • satellite Table 2.1: Noise and artifacts in images. 2.3 Modeling of image noise The noise signal can be modeled as a stochastic signal which is additive or multiplicative to an image signal. Furthermore, it can be modeled as signal-dependent or signal-independent. Quantization and CCD noise are modeled additive and signalindependent. Image noise can have different spectral properties; it can be white or colored. Most commonly, the noise signal in images is assumed to be independent identically distributed (iid) additive and stationary zero-mean noise (i.e., white Gaussian noise), I(n) = S(n) + η(n) (2.1) where S(n) is the original (true) image signal at time instant n, I(n) is the observed noisy image signal at the same time instant, and η(n) is the noise signal. In practice, I(n) and S(n) are defined on an X × Y lattice, and each pixel I(i, j, n) (row i, column j) is an integer value between 0 and 255. Under the above assumptions, the proposed noise estimation method and noise reduction method work. The main difficulty when estimating or reducing noise is in images that contain fine structures or textures. 16 Noise estimation To evaluate the quality of a video enhancement technique, different numerical measures such as the Signal-to-Noise Ratio (SNR defined in [108]) or Mean Squared Error (MSE) can be used. These measures compare an enhanced (e.g., noise reduced) image with an original image. The basic idea is to compute a single number that reflects the quality of the enhanced image. Enhanced images with higher SNR are assumed to have a better subjective quality. SNR measures do not, however, necessarily reflect human subjective perception. Several research groups are working on perceptual measures, but no standard measures are known and signal-to-noise measures are widely used because they reflect image improvements and are easier to compute. A better measure that is less dependent on the input signal is the Peak Signal to Noise Ratio (PSNR) as defined in Eq. 2.2. The PSNR is a standard criterion for objective noise measuring in video systems. Here, the image size is X × Y , Ip (i, j, n) and Ir (i, j, n) denote the pixel amplitudes of the processed and reference image, respectively, at the position (i, j): (255)2 PSNR = 10 · log 1 XY Y P X P (Ip [i, j, n] − Ir [i, j, n]) . (2.2) 2 i=1 j=1 Typical PSNRs of TV video signals range between 20 and 40. They are usually reported to two decimal points (e.g., 36.61). A threshold of 0.5 dB PSNR can be used to decide whether a method delivers an image improvement that would be visible2 . The PSNR indicates the Signal-to-Noise-Improvement in dB but unweighted with respect to visual perception. PSNR measurement should be, therefore, given with subjective image comparisons. The use of either SNR or PSNR as measures of image quality is certainly not ideal since it generally does not correlate well with perceived image quality. Nevertheless, they are commonly used in the evaluation of filtering and compression techniques, and do provide some measure of relative performance. The PSNR quality measure will be used throughout this thesis. 2.4 2.4.1 Noise estimation Review of related work The effectiveness of video processing methods can be significantly reduced in the presence of noise. For example, the performance of compression techniques can decrease due to noise in the image. Intensity variation due to noise may introduce motion 2 The MPEG-committee used this informal threshold. Reasons not to use PSNR are described in [66]. 17 estimation errors. Furthermore, the detection of high frequency image content such as edges can be significantly disturbed. When information about the noise becomes available, processing can be adapted to the amount of noise to provide stable processing methods. For instance, edge detection [31], image segmentation [143, 112], motion estimation [91], and smoothing [13, 47, 87, 74] can be significantly improved when the noise variance can be estimated. In current TV receivers the noise is typically estimated in the black lines of the TV signal [47]. In other applications, the noise estimate is provided by the user and a few methods have been proposed for automated robust noise estimation. Noise can be estimated within an image (intra-image estimation) or between two or more successive images (inter-image estimation) of an image sequence. Interimage estimation techniques require more memory (to store one or more images) and are, in general, more computationally demanding [52]. Intra-image noise estimation methods can be classified as smoothing-based or block-based. In the smoothing-based methods the image is first smoothed, for example, using an averaging filter and then the difference of the noisy and enhanced image is assumed to be the noise; noise is then estimated at each pixel where the gradient is larger than a given threshold. These methods have difficulties in images with fine texture and they tend to overestimate the noise variance. In the block-based method, the variance over a set of blocks of the image is calculated and the average of the smallest variances is taken as an estimate. Different implementations of block-based methods exist. In general, they tend to overestimate the noise variance in good quality images and underestimate it in highly noisy images. In some cases, no estimate is even possible [98, 52]. The blockbased method in its basic form is less complex and is several times faster than the smoothing-based method [98, 52]. The main difficulty with the block-based methods is that their estimate may vary significantly depending on the input image and noise level. In [98], an evaluation of noise estimation methods is given. There, the averaging methods were found to perform well with high-noise levels. No techniques were found to perform best for various noise levels and input images. Some noise estimation methods determine the noise variance within the larger context of an image processing system. Such techniques are then adapted to specific needs of the imaging systems (e.g., in the context of coding [77], TV signal processing [47] and image segmentation [126]). Many noise estimation methods have difficulties estimating noise in highly noisy images and in highly textured images [98]. Such a lack of accuracy can be a problem for noise-adaptive image processing methods. Some methods use thresholds, for example, to decide whether an edge is given at a particular image position [98]. The purpose of this section is to introduce a fast noise estimation technique which gives reliable estimates in images with smooth and textured areas. This technique is a 18 Noise estimation block-based method that takes image structure into account and uses a measure other than the variance to determine if a block is homogeneous. It uses no thresholds and automates the way that block-based methods stop the averaging of block variances. The method selects intensity-homogeneous blocks in an image by rejecting blocks of line structure using new proposed masks to detect lines. 2.4.2 A homogeneity-oriented noise estimation The method proposed in this section estimates the noise variance σn2 from the variances of a set of regions classified as the most intensity-homogeneous regions in the image I(n), i.e., regions showing the lowest variation in structure. The method uses a new homogeneity measure ξBh to determine if an image region has uniform intensities, where uniformity is equated to piece-wise constant gray-level pixels. This novel noise estimation operates on the input image data: 1) without any prior knowledge of the image or noise, 2) without context, i.e., it is designed to work for different image processing domains, and 3) without thresholds or user interactions. The only underlying assumption is that in an image there exist neighborhoods (usually chosen as a 2-dimensional (2-D) rectangular window or W × W block) with smooth intensities (i.e., the proposed homogeneity measure ξBh ' 0). This assumption is realistic since real-world images have well-defined regions of distinct properties, one of which is smoothness. The proposed noise estimation operates as follows: • Detection of intensity-homogeneous blocks: the pixels in an intensity-homogeneous block Bh = {I(i, j)}(i,j)∈Wij are assumed to be independent identically-distributed (iid) but not necessary zero-mean. Wij denotes the rectangular window of size W × W . These uniform samples {I(i, j)} of the image have variance σB2 h , which is assumed to represent the variance of the noise. The signal in a homogeneous block is approximately constant and variation is due to noise. With the iid property their empirical mean and variance are defined as P P 2 (i,j)∈Wij I(i, j) (i,j)∈Wij (I(i, j) − µh ) 2 µh = σh = . (2.3) W ×W W ×W With l = W × W and by the law of large numbers lim σB2 h = σn2 . l→∞ (2.4) • Averaging: to estimate the image global noise variance, σn2 , the local variances of P 2 the m most homogeneous blocks, {Bh }, are averaged to σn2 = µσB2 = m h=1 σBh . h Since the noise is assumed to be stationary, the average of the variances of the m most homogeneous regions can be taken as a representative for the noise in the whole image. To achieve faster noise variance estimation, ξBh ' 0 is 19 calculated for a subset of the image pixels by skipping each sth pixel of an image row. Simulations are carried using different skipping steps. Simulations of this technique show that a good compromise between efficiency (computational costs) and effectiveness (solution quality) is obtained with s = 5. • Adaptive averaging: since the most homogeneous blocks could show strongly variable homogeneities and hence highly variable variances, only blocks which show similar homogeneities and hence similar variances σB2 h to a reference representative variance σB2 r are included in the averaging process. This stabilizes the averaging process and adapts the number of blocks to the structure of the image. Therefore, no threshold is needed to stop the averaging process. To decide whether the reference and a current variance are similar, a threshold tσ is used, i.e., σB2 h is similar to σB2 r if |σB2 r − σB2 h | < tσ . This threshold tσ is relatively easy to define and does not depend on the input image content. It can be seen as the maximal affordable difference (i.e., error) between the true variance and the estimated variance. For example, in noise reduction in TV receivers a tσ between 3 and 5 is common [13, 46]. In the simulations of this study, tσ is set to 3. Detection of homogeneous blocks The image is first divided into blocks {Bh } of the size W × W . In each block Bh a homogeneity measure ξBh is then computed using a local image analyzer based on high-pass operators that are able to measure homogeneity in eight different directions as shown in Fig. 2.2: special masks for corners are also considered which stabilize the homogeneity estimation. In this local uniformity analyzer, high-pass operators with coefficients {-1 -1 ... (W-1) -1 -1} (e.g., if W = 3 the coefficients are {-1, 2, -1 }) are applied along all directions for each pixel of the image. If in one direction the image intensities are uniform then the result of the high-pass operator is close to 0. To calculate the homogeneity measure for all eight directions, all eight quantities are added and this sum provides a measure, ξBh , for homogeneity. In this thesis, these masks are also proposed to adapt spatial noise reduction as will be discussed in Section 2.5 (see also [74]). The operation to determine the homogeneity measure can be expressed as a second derivative of the image function I. The following example illustrates this in the horizontal direction: Io (i) = −I(i − 4i) + 2 · I(i) − I(i + 4i) = −I 0 (i) + I 0 (i − 4i) = −(I 0 (i) − I 0 (i − 4i)) (2.5) Therefore, Io (i) is a second-order finite-difference operator which acts as a high-pass operator. Note that the detection of homogeneity is done along edges and never across edges. Various simulations (Section 2.4.3) show that this proposed homogeneity 20 Noise estimation mask 1 mask 5 mask 2 mask 6 mask 3 mask 4 mask 7 mask 8 current pixel Figure 2.2: Directions of the local intensity homogeneity analyzer. measure performs better than that using the variance to decide whether a block has uniform intensities. A variance-based homogeneity measure fails in the presence of fine structures and textures (see Fig. 2.6(c) and 2.6(i)). Defining a reference variance σB2 r To stabilize the averaging process, the reference variance is chosen as the median of the variances of the first three most homogeneous blocks (i.e., the blocks with the smallest sum). The first three values are taken because they are most representative of the noise variance since they are calculated from the three most homogeneous blocks. Higher-order median operators can be also used. Instead of the median, the mean can be used to reduce computation. Simulations show that better estimation is achieved using the 3-tab (i.e., of order 3) median operator. In some cases, the difference between the first three variances can be large and a median filter would result in a good estimate of the true reference variance. Further investigation can determine the best order of the median filter, or examine if there are cases where the mean operator would give better results. 2.4.3 Evaluation and comparison The new estimator has been tested using, in the image processing literature, commonly used images. Eight are represented in Fig. 2.6 and the ninth image is one with a constant gray value of 128. White additive noise is the most common form of noise in images and it has been used in the tests. Typical PSNR values in real-world images range between 20 and 40 dB. To test the reliability of the proposed method, noise giving a PSNR between 20 and 50 dB is added to the nine images. Noise is also estimated in the noiseless case, i.e., in the reference image. Due to the limited range of intensities ([0, 255]), saturation effects result in a Gaussian noise signal not having exactly zero-mean, especially with large noise variances. In this thesis, therefore, attention is paid to this saturation or clipping effect. 21 This has been done according to the CCIR Recommendation 601.1 for the YCrCb video standard. In this recommendation, the reference black is represented by 16 and the reference white by 235 for the 8-bit range [0,255]. Thus, noise is estimated solely in regions of these ranges so that clipping effects are excluded from the estimation process. This, however, could limit the performance of the algorithm where the homogeneous regions lay outside these ranges. To evaluate the performance of the algorithm, the estimation error En = |σn2 − σe2 | is first calculated. En is the difference between the true value and the estimated noise variance. The average µEn and the standard deviation σEn of the estimation error are then computed from all the measures, as a function of the input noise3 as follows: PN PN (En (i) − µEn )2 i=1 En (i) µEn = ; σEn = i=1 (2.6) N N where N is the number of tested images and En is the estimation error for a particular noise variance σn on a single image. The reliability of a noise estimation method can be measured by the standard deviation σEn or the average of the estimation error µ En . Evaluation results are given in Table 2.2. As can be seen, the proposed method is reliable for both high and low input noise levels. In [98], a evaluation of noise estimation methods is given. When our results are compared to those of Table 1 in [98], the comparison suggests that the proposed method outperforms the block-variance-based method, which has been found in [98] to be a good compromise between efficiency and effectiveness. Moreover, the proposed method adapts thresholds to the image whereas the block-based method requires the specification of the percentage of the image area to be considered (which has been set to 10 in [98]). As noted in [98], the performance of the method can be improved by tuning this parameter value. As Table 2.2 reveals, the estimation errors of the proposed method remain reliable even in the worst-case when deviation is around 1.81. The method remains suitable for noise-based parameter adaptation such as in noise reduction or segmentation techniques. For example, in high-quality noise reduction techniques, the adjustment is done in an interval of 2-5 dB [13, 21, 47]. Fig. 2.3(a) reveals that the average estimation error using the proposed method is lower than that of the block-variance method for all input noise variances. Interestingly, the standard deviation of the estimation error using the proposed method is significantly less than that of the block-based method, as shown in Fig. 2.3(b). Recently, a new interesting averaging-based noise estimation method has been proposed [109]. The main difficulty with this new method is its heavy computational 3 Some studies use the averaged squared error instead of the average absolute error as a quality criterion. The variance of this error among different test images is, however, an important indicator for the stability of the estimation. 22 Noise estimation PSNR σn µ En σEn noiseless 0 1.85 1.73 50 45 40 35 30 25 20 0.80 1.43 2.55 4.53 8.06 14.33 25.50 1.99 1.78 1.32 1.04 0.90 1.45 2.55 1.81 1.40 1.16 0.71 1.40 1.37 1.27 average 1.61 1.36 Table 2.2: The average µEn and the standard deviation σEn of the estimation error as a function of the input noise σn (W = 5). 20 8 Proposed Block−based 18 7 16 6 14 5 12 µE σE PSNR 10 4 PSNR 8 3 6 2 4 1 2 Proposed Block−based 0 20 25 30 35 40 In−PSNR(dB) 45 50 (a) Average of the estimation error. 55 0 20 25 30 35 40 In−PSNR(dB) 45 50 55 (b) Std. deviation of the estimation error. Figure 2.3: Comparison of the block-based and the proposed method (W = 5). cost even when using some optimization procedures. The success of this method seems to depend heavily on many parameters to fix, for example, on the number of process iterations, or the shape of the fade-out cosine function to evaluate the variance histogram (Eq. 9 and 10 in [109]). Furthermore, no information is given about the fluctuation of the estimation error En , i.e, about the σEn , which is an important criterion when evaluating noise estimation methods. We have carried out simulations to evaluate the proposed method using different window sizes W = 3, 5, 7, 9, 11. As shown in Fig. 2.4, using a window size of 3 × 3 results in a better estimation in less noisy images (PSNR>40 dB), whereas using a window size of 5 × 5 gives better results in noisy images4 . This is reasonable since, in noisy images, larger samples are needed to calculate the noise accurately. The choice of the window size can be oriented to some image information if available. As a compromise between efficiency and effectiveness, a window size of 5 × 5 is used which gives good results compared to other estimation methods. If a reduction in 4 Results in Fig. 2.4 are shown for a subset of the test images displayed in Fig. 2.6. 23 computation cost is required, the proposed noise estimation can be carried out only in the horizontal direction, i.e., along one line, for example, using 3 × 1 or 5 × 1 window size. 12 Win5x5 Win3x3 10 8 µE PSNR 6 4 2 0 20 25 30 35 40 In−PSNR(dB) 45 50 55 Figure 2.4: The performance (µEPSNR ) of the proposed method using different window sizes. The effectiveness of the proposed estimation method is further confirmed when applied to motion images with various motions such as pan, zoom. These simulations show the stability of the algorithm through an image sequence. For example, the sequence ‘Train’ (Fig. 2.12(a)) is overlaid with 30 dB PSNR white noise. Throughout the sequence the PSNR is estimated to be between 29.40 dB and 31.18 dB. These results show the stability of the method and are suitable for temporal video applications in which the adjustment of parameters is oriented to the amount of noise in the image (for example, [13, 47]). Table 2.3 summarizes the performance (effectiveness and complexity) of the proposed, the block-based, and the average methods. As shown, both the average and the standard deviation of the proposed methods are significantly better than the reference methods. The computational cost of the method is investigated in simulations using images of different sizes and noise levels. Results show (Table 2.3) that the proposed method (without using special optimization techniques) is four times faster5 than the block-based method which has been found [98] to be the most computationally efficient among tested noise estimation methods. 5 The proposed method needs on average 0.02 seconds on a ‘SUN-SPARC-5 360 MHz’. 24 Noise reduction average of µEn average of σEn Tc Average method 2.22 2.51 6× slower than block-based Block-based 4.45 3.25 Proposed 1.61 1.36 4× faster than block-based Table 2.3: Effectiveness and complexity comparison between the proposed method and other methods presented in [98]. Tc is the computational time for one 512 × 512image. 2.4.4 Summary This thesis contributes a reliable real-time method for estimating the variance of white noise in an image. The method requires a 5×5 mask followed by averaging over blocks of similar variances. The proposed mask for homogeneity measurement is separable and can be implemented using simple FIR-filters. The local image analyzer used is based on high-pass operators which allow the automatic implicit detection of image structure. The local image analyzer measures the high-frequency image components. In case of noise, the direction filter compensates for the noise along different directions and stabilizes the selection of homogeneous blocks. The method performs well even in textured images (e.g., Fig. 2.6(i) and Fig. 2.6(f)) and in images with few smooth areas, like the Cosine2 image in Fig. 2.6(c). As shown in Fig. 2.5, for a typical image quality of PSNR between 20 and 40 dB the proposed method outperforms other methods significantly and the worst case PSNR estimation error is approximately 3 dB which is suitable for real video applications such as surveillance or TV signal broadcasts. The method has been applied to estimate white noise from an uncompressed input image. The performance of the method in compressed images, for instance, using MPEG-2, has to be further studied. The estimation of the noise after applying a smoothing filter is also an interesting point for further investigation. 2.5 2.5.1 Spatial noise reduction Review of related work The introduction of new imaging media such as ‘Radio with Picture’ or ‘Telephone with Picture’ makes real-time spatial noise reduction an important issue of research. Studies show that with digital cameras image noise may increase because of the higher 25 7 Proposed Block−based 6 5 µ E PSNR 4 3 2 1 20 22 24 26 28 30 32 In−PSNR(dB) 34 36 38 40 Figure 2.5: The performance (µEPSNR , W = 5) of the proposed method in typical PSNR range. sensitivity 6 and the longer exposure of the new CCD cameras [88]. Spatial noise reduction is, therefore, an attractive feature in modern cameras, video recorders, and other imaging systems [88]. Real-time performance will be an attractive property for digital cameras, TV receivers and other modern image receivers. This thesis develops a spatial noise reduction method with low complexity that is intended for real-time imaging applications. Spatial noise reduction7 is usually a preprocessing step in a video analysis system. It is important that it preserves image structures such as edges and texture. Structurepreserving noise reduction methods (e.g., Gaussian and Sigma filtering) estimate the output image pixel gray value Io from a weighted average of neighboring pixel gray values as follows: P (l,m)∈Wij wσn (l, m) · I(l, m) P Io (i, j) = (2.7) (l,m)∈Wij wσn (l, m) where: • σn is related to the image degradation, in the case of white noise reduction that is the standard deviation of the noise, • Wij denotes the neighborhood system which is usually chosen as a 2-D rectangular window of size W × W , containing neighbors of the current pixel gray 6 As the sensitivity of the camera sensor is increased to light, so is its sensitivity increased to noise. 7 For a review of spatial noise reduction methods see [125, 122]. 26 Noise reduction (a) Uniform (b) Cosine1 (c) Cosine2 (d) Synthetic (e) Portrait (f) Baboon (g) Aerial (h) Field (i) Trees Figure 2.6: Test images used for noise estimation comparison. 27 value I(i, j), • wσn (l, m) is the weighting factor which ranges between 0 and 1 and acts as the probability that two neighboring pixels belong to the same type of image structure, • Io (i, j) represents the current noise-reduced pixel gray value, and • I(l, m) represents a noisy input neighboring pixel gray value. A large difference between neighboring pixel gray values implies an edge and, therefore, a low probability of belonging to the same structure. These pixel values will then have minor influence on the estimated value of their neighbors. With a small difference, meaning that two pixels belong presumably to the same structure type, the opposite effect takes place. The weighting factor depends on the parameter σn which quantifies the notions of “large” and “small’. Structure-preserving filtering can be usually applied iteratively to reduce noise further. The Gaussian filter [69] weights neighboring pixel values with a spatial Gaussian distribution as follows 1 wσn (l, m) = exp{− 2 [I(l, m) − I(i, j)]2 }. (2.8) 4σ The aim, here, is to reduce small-scale structures (corresponding to high spatial frequencies) and noise without distorting large-scale structures (corresponding to lower spatial frequencies). Since the Gaussian mask is smooth, it is particularly good at separating high and low spatial frequencies without using information from a larger area of the image than necessary. An increase in noise reduction using linear filters such as the Gaussian filters corresponds, however, to an increase in image blurring especially at fine details (see [122] Section 4). The Sigma filter [79] averages neighboring pixel values that have been found to have the same type of structure as follows ½ 1 : (I(i, j) − 2σn ) ≤ I(l, m) ≤ (I(i, j) + 2σn ) wσn (l, m) = (2.9) 0 : otherwise. Therefore, the Sigma filter takes an average of only those neighboring pixels whose values lie within 2σn of the central pixel value, where σn is determined once for the whole image. This attempts to average a pixel with only those neighbors which have values “close” to it, compared with the image noise standard deviation. Another well-known structure-preserving noise filter is the anisotropic diffusion filter [106]. This filter uses local image gradient to perform anisotropic diffusion where smoothing is prevented from crossing edges. Because pixels both sides of an edge have high gradient associated with them, thin lines and corners are degraded by this process. This method is computationally expensive and is, usually, not considered for real-time video applications. 28 Noise reduction 5 σin2 4.5 z-1 z-1 4 3.5 (1-c)/2 (1-c)/2 3 R[dB] c Σ 2.5 2 1.5 1 0.5 σout2 0 0 0.1 0.2 0.3 0.4 0.5 c 0.6 0.7 0.8 0.9 1 (b) Noise reduction gain R in dB as function of the central coefficient c. (a) A symmetrical 3-tap FIR-filter. Figure 2.7: Spatial noise filtering. In summary, current structure-preserving spatial noise filters are either computationally expensive, require a kind of manual user intervention, or still blur the image structure, which is critical when the noise filter is a preprocessing step in a multi-step video processing system where robustness along edges is needed. 2.5.2 Fast structure-preserving noise reduction method In this section, a new technique for spatial noise reduction is proposed which uses a simple low-pass filter with low complexity to eliminate spatially uncorrelated noise from spatially correlated image content. Such a filter can be implemented, e.g., as a horizontal, vertical or diagonal 3-tap FIR-filter with the central c coefficient (Fig. 2.7(a)). If the input noise variance of the filter is σn2 the output signal variance σo2 can be computed by Eq. 2.10 µ σo2 2 =c · σn2 +2· 1−c 2 ¶2 · σn2 . (2.10) The gain R (ratio of signal to noise values of input and output) in noise reduction of this filter can be computed by (Eq. 2.11) µ ¶ σn2 2 R[dB] = 10 · log 2 = 10 · log . (2.11) σo 3c2 − 2c + 1 This noise reduction gain, clearly, depends on the choice of the central coefficient c. This dependency is depicted in Fig. 2.7(b). For a cos2 -shaped (i.e., c = 12 ) filter a noise reduction gain of R = 4.26 dB can be achieved. The maximum of R is achieved 29 by a mean filter, i.e., c = 14 , which is suitable for homogeneous regions. In structured regions, the coefficient has to be selected adaptively to not blur edges and lines of the image. This spatial filter is only applied along object boundaries or in unstructured areas but not across them. To achieve this adaptation an image analyzing step has to be applied to control the direction of the low-pass filter, as depicted in Fig. 2.8. I(n) Structure analyzer I(n) Noise estimation σn Coefficient control Mask selection Noise and structure-adaptive spatial noise reduction Quality-enhanced image: I (n) o Figure 2.8: Proposed noise and structure-adaptive spatial noise reduction. 2.5.3 Adaptation to image content and noise Adaptation to image content Several algorithms for effective detection of structure have been proposed [122, 125, 79, 82]. In [82] a comparison of the accuracy of many different edge detectors is given. In Section VI it is shown that precise edge detection is computationally expensive and therefore not suitable for real-time video systems. Other computationally less expensive algorithms either need some manual tuning (cf. [122, 79]) to adapt to different images or are not precise enough. In this thesis, a computationally efficient method for detecting edge directions is proposed (see also [74]). The basic idea is to use a set of high-pass filters to detect the most suitable direction among a set of eight different directions as defined in Fig. 2.2. Then noise reduction is performed in this direction. To take structure at object corners into account, four additional corner masks are defined (Fig. 2.2) which preserve the sharpness of structure at corners. For each pixel of the image, high-pass filters with coefficients {-1 2 -1} are first applied along all eight directions. Then the direction with the lowest absolute highpass output is chosen to apply the noise reduction by weighted averaging along this direction. Doing this, the averaging is adapted to the most homogeneous direction and thus image blurring is implicitly avoided. Adaptation to image noise The averaging process along an edge direction should be for optimal noise reduction adapted to the amount of noise in the image. As shown in Fig. 2.9, the gain of the 30 Noise reduction spatial noise reduction can be roughly doubled when adapting to the estimated noise. Especially in images with higher PSNR, the noise estimator stabilizes the performance of the spatial filter. 4 Noise−Adaptive Non−Adaptive 3 Gain(dB) 2 1 0 −1 −2 −3 20 22 24 26 28 30 32 In−PSNR(dB) 34 36 38 40 Figure 2.9: Noise reduction gain in dB averaged over three sequences (each of 60 images). The gain can be roughly doubled when adapting to the estimated noise. Noise adaptation can be done by weighting the central pixel. Assume that, for example, the most homogeneous edge direction is the horizontal one; the weighted average is: I(i, j − 1) + w(σn ) · I(i, j) + I(i, j + 1) . (2.12) Io (i, j) = w(σn ) + 2 This weighting should be low for highly noisy images and high (emphasis on the central pixel) for less noisy images. To keep the implementation cost low, the following adaptation is chosen w(σn ) = a · σn a < 1. (2.13) Thus the spatial filter automatically adapts to the source noise level, which is estimated by the new method described in Section 2.4. This estimation algorithm measures video noise and can be implemented in simple hardware. Fig. 2.10(a) shows the effect of the weight adaptation to the estimated noise: higher weighting achieves better noise reduction in less noisy images and lower weighting achieves better noise reduction in more noisy images. In addition, higher noise reduction can be achieved if the window size (in terms of the number of taps or size of the FIR-filter [121]) can also be adapted to the estimated amount of noise. In the case of highly noisy images, better noise reduction can be achieved if a larger window size is used. A larger window 31 4 4 Auto W1 W2 W4 W10 3 2 2 1 1 Gain(dB) Gain(dB) 3 0 0 −1 −1 −2 −2 −3 20 22 24 26 28 30 32 In−PSNR(dB) 34 36 38 Auto Win3x3W1 Win3x3Wauto Win5x5Wauto 40 (a) Average gain by different weights. Higher weights are suitable for less noisy images. −3 20 22 24 26 28 30 32 In−PSNR(dB) 34 36 38 40 (b) Average gain by different windows. Larger window size is needed in highly noisy images. Figure 2.10: Comparison of the proposed method by different weights and windows. size means more averaging and higher noise reduction gain. Fig. 2.10(b) illustrates this discussion and shows the effectiveness of the noise adaptation. 2.5.4 Results and conclusions Quantitative evaluation In simulations, the new spatial noise reduction achieves an average PSNR gain of 1–3.5 dB. The actual gain depends on the contents of the image. In structured areas a higher gain is achieved than in areas of complex structure. With complex structure, the spatial analysis filter may fail to detect edges. In such a case, the mask selection is not uncorrelated to noise. This leads to lower achieved gains compared to the theory (Fig. 2.7(b)). Noise is, however, reduced even in unstructured images. Noise adaptation to the estimated input PSNR achieves a higher noise reduction. This gain is especially notable in images with both high and low noise levels and in structured areas. This adaptation can achieve gains up to 5 dB (Fig. 2.9). An objective (quantitative) comparison between the Sigma filter with a window size of 3 × 3 and the proposed filter (Fig. 2.11) shows that higher PSNR can be achieved using the proposed spatial noise filter especially in strong noise images. Further simulations show that using a 5 × 5 window size the Sigma filter achieves higher PSNR than with a 3 × 3 window but this, however, increases the image blur significantly in some image areas. This suggests that parameters of the Sigma filter 32 Noise reduction need to be tuned manually. 3.5 ProposedFilter SigmaFilter 3 Gain(dB) 2.5 2 1.5 1 0.5 0 20 22 24 26 28 30 32 In−PSNR(dB) 34 36 38 40 Figure 2.11: Noise reduction gain: Proposed versus Sigma filter. Subjective evaluation To show the performance of the proposed spatial filter, critical images that include both fine structure and smooth areas are used (Fig. 2.12). The performance of the proposed method with and without noise adaptation is shown subjectively in Fig. 2.13 where the Mean-Square error (MSE) has been compared. As can be seen in Fig. 2.13(a), significantly higher Mean-Square errors are given without noise adaptation. The results emphasize the advantage of the proposed noise adaptation in the spatial filter. The proposed method has been subjectively compared to the Sigma filter method. As shown in Figures 2.14(a), 2.15(a), and 2.16(a), the Sigma filter blurs edges and produces granular artifacts both in smooth (Fig. 2.14(a)) and structured (Fig. 2.16(a)) areas while the proposed filter reduces noise while protecting edges and structure (e.g., Fig. 2.16(a)). The reason for this is that the Sigma filter structure preserving component is image global whereas the proposed filter is adaptive to local image structure. Computational efficiency In general, the Sigma filter requires more computations (Table 2.4) than the proposed method. In addition, the computational cost of the Sigma filter strongly depends on 33 the size of the window used while the cost of the proposed filter increases slightly when using a larger window. The algorithms were coded in C and no special efforts were devoted to accelerate their executions. Algorithm average execution time Noise estimation Proposed noise filter (3-tap) Sigma filter (win. size 3 × 3) Proposed noise filter (5-tap) Sigma filter (win. size 5 × 5) 0.14 0.22 0.47 0.24 0.75 Table 2.4: Average computational cost in seconds on a ‘SUN-SPARC-5 360 MHz’ for a PAL-image. 2.5.5 Summary Current structure-preserving spatial noise filters require user intervention, and are either computationally expensive or blur the image structure. The proposed filter is suitable for real-time video applications (e.g., noise reduction in TV receivers or video surveillance). The proposed noise filter reduces image noise while preserving structure and retaining thin lines without the need to model the image. For example, the filter reduces noise both in moving and non-moving areas, as well in structured and non-structured ones. The proposed method applies first a local image analyzer along eight directions and then selects a suitable direction for filtering. The filtering process is adapted to the amount of noise by different weights and window sizes. Quantitative and qualitative simulations show that the proposed filtering method is more effective at reducing Gaussian white noise without structure degradation than the reference filters. Therefore, this method is well suited for video preprocessing, for example, for video analysis (Chapter 3) or temporal noise reduction [13]. 34 Noise reduction (a) Original image. (b) 25 dB Gaussian noisy image. Figure 2.12: Images for subjective evaluation. 35 (a) MSE without noise-adaptation (inverted). (b) MSE with noise-adaptation (inverted). Figure 2.13: Subjectively, the proposed noise adaptation (in Fig. (b)) produces less MS errors than without noise adaptation (In Fig. (a)). 36 Noise reduction (a) Sigma noise filtering: note the granular effects. (b) Proposed noise filtering. Figure 2.14: Proposed noise filter gives subjectively better noise reduction. 37 (a) Sigma noise filtering (zoomed). (b) Proposed noise filtering (zoomed). Figure 2.15: Performance in smooth area: Proposed noise filter has higher noise reduction than the Sigma filter. 38 Noise reduction (a) Sigma noise filtering (zoomed). (b) Proposed noise filtering (zoomed). Figure 2.16: Performance in structured area: Proposed noise filter better protect edges and structre than the Sigma filter. Chapter 3 Object-Oriented Video Analysis 3.1 Introduction Typically, a video is a set of stories, scenes, and shots (Fig. 3.1). Examples of a video are a movie, a news clip, or a traffic surveillance clip. In movies, each scene is semantically connected to previous and following ones. In surveillance applications, a video does not necessarily have semantic flows. Video Stories Scenes Shots Objects & Meaning Figure 3.1: Video units. A video contains, usually, thousands of shots. To facilitate extraction and analysis of video contents, a video has to be first segmented into shots ([93, 25]). A shot is a (finite) sequence of images recorded contiguously (usually without viewpoint change) and represents a continuous, in time and space, action or event driven by moving objects (e.g., an intruder moving, an object stopping at a restricted site). There is little semantic change in the visual content of a shot, i.e., within a shot there is a short-term temporal consistency. Two shots are separated by a cut, which is a transition at the image boundary between two successive shots. A cut can be thought of as an “edge” in time. A shot consists, in general, of multiple objects, their semantic interpretation (i.e., objects’ meaning), their dynamics (i.e., objects’ movement, activities, action, or related events), and their syntax (i.e., the way objects 40 Video analysis are spatially and temporally related, e.g., ‘a person is close to a vehicle’)1 . The demand for shot analysis becomes crucial as video is integrated in various applications. An object-oriented video analysis system aims at extracting objects and their features for object-based video representation that are more searchable than pixel- or block-based representations. Such object-based representations allow advanced video processing and manipulation. Various research results show that with the integration of extracted relevant objects and their features in video processing, high efficiency could be achieved [26, 55, 45]. For example, video coding techniques, such as MPEG-2 and MPEG-4, use video analysis to extract motion and object data from the video to achieve better coding and representation quality. In contentbased video representation, video analysis is an important first step towards highlevel understanding of the video content. Real-time implementation and robustness demands make the video analysis, however, an especially difficult task. Two levels of video analysis can be distinguished: low-level and high-level. In lowlevel analysis, high performance operators use low-level image features such as edges or motion. An example is video coding where the goal is to achieve low bit-rates, and low-level features can support high quality coding. High-level analysis is required to determine the perceptually significant features of the video content. For example, in video retrieval higher-level features of the object are needed for effective results. The goal of this thesis is to develop a high-level modular video analysis system that extracts video objects robustly with respect to noise and artifacts, reliably with respect to the precision needed for surveillance and retrieval applications, and efficiently with regards to computational and memory cost. The focus is on automated fast analysis that foregoes precise extraction of image objects. Retinal early or low-level processing vision processing High-level vision processing Motion Memory Shape Orientation Retina More processing combined processing Color ... Figure 3.2: Diagram for human visual processing: a set of parallel processors, each analyzing some particular aspect of the visual stimulus [3]. 1 In the remainder of this thesis, the term video refers to a video shot. Fundamental issues 41 The structure of the proposed video analysis technique is modular, where results of analysis levels are combined to achieve the final analysis. Results of lower level processing are integrated to support higher processing. Higher levels support lower levels through a memory-based feedback loop. This is similar to the human visual perception as shown in Fig. 3.2, where visual data are analyzed and simplified to be integrated for higher-level analysis. The HVS finds objects by partial detection and recognition introduces new context which in turn supports further recognition. 3.2 Fundamental issues Video analysis aims at describing the data in successive images of a video in terms of what is in the real scene, where it is located, when it occurred, and what are its features. It is a first step towards understanding the semantic contents of the video. Efficient video analysis remains a difficult task despite progress in the field. This difficulty originates in several issues that can complicate the design and evaluation of video analysis methods. 1. Generality Much research has been concerned with the development of analysis systems that are of general application. Specific applications require, however, specific parameters to be fixed and even the designers of general systems can have difficulty adapting the system parameters to a specific application. Therefore, it seems more appropriate to develop analysis methods that focus on a well-defined range of applications. 2. Interpretation Object-oriented video analysis aims at extracting video objects and their spatio-temporal features from a video. To extract object, technically the following are given: 1) a video is a finite set of images and each image consists of an array of pixels; 2) the aim of analysis is to give each pixel a label based on some properties; and 3) an object consists of a connected group of pixels that share the same label. The technical definition of an object may not be, however, one that interpretation needs. For instance, does interpretation consider a vehicle with a driver one object or two objects? a person moving the body parts one or more objects? These questions indicate that there is no single object-oriented video analysis method that is valid for all applications. Analysis is subjective and can vary between observers and the analysis of one observer can vary in time. This subjectivity cannot always be formulated by a precise mathematical definition of an analysis concept that also humans cannot define uniquely. For some application the use of heuristics is an unavoidable part of solution approaches [100, 37]. 42 Video analysis 3. Feature selection and filtering A key difficulty in selecting features to solve an analysis task, such as segmentation or matching, is to find useful features that stay robust throughout an image sequence. Main causes for inaccuracy are sensitivity to artifacts, object occlusion, object deformation, articulated and non-rigid objects, and view change. The choice of these features and their number varies across applications and within various tasks of the video analysis. In some tasks a small number of features is sufficient, in other tasks a large number of features may be needed. A general rule is to select features that do not significantly change over a video and that can be combined to compensate for each other’s weakness. For example, the most significant features can be noisy and thus difficult to analyze, therefore requiring a filtering procedure to exclude these from being used in the analysis. 4. Feature integration Since features can be noisy, incomplete, and variant, the issue is to find ways to effectively combine these features for robustness. The most used methods for feature integration are linear. The HVS performs many vision tasks, however, in a non-linear way. In high-level video analysis HVS oriented integration is needed. Another important issue is to chose an integration method that is task-specific. Chapters 4 and 6 consider two such integration strategies. 5. Trade-off: quality versus efficiency Video analysis is further complicated by various image and object changes, such as noise, artifacts, clutter, illumination changes, and object occlusion. This complicates further two conflicting requirements: precision and simplicity. In a surveillance application, for instance, the emphasis is on the robustness of the features extracted with respect to varying image and object conditions rather than on precise estimation. In object-based retrieval, on the other hand, obtaining pixel-precise objects is not necessary but the representation of objects must have some meaning. Beside robustness, complexity has an impact on the design of analysis systems. Offline applications, such as object-based coding, tolerate analysis algorithms that need processing power and time. Other applications such as surveillance require real-time analysis. In general, however, the wide use of an analysis tool strongly depends on its computational efficiency [61]. 3.3 Related work Video analysis techniques can be classified into contour, region, and motion based methods. An analysis based on a simple feature (such as edge) cannot deal with complex object structures. Various features are often combined to achieve useful object-oriented video analysis. Related work 43 The MPEG-4 oriented analysis method proposed in [89] operates on binary edge images generated with the Canny operator and tracks objects using a generalized Hausdorff distance. It uses enhancements, such as adaptive maintenance of a background image over time and improvement of boundary localization. In the video retrieval system VideoQ [33], an analysis based on the combination of color and edges is used. Region merging is performed using optical flow estimation. Optical flow methods are not applicable in the presence of large motion and object occlusion. Region merging produces regions that have similar motion. An object may consist, however, of regions that move differently. In such a case, an object may be divided into several regions that complicate subsequent processing steps. The retrieval and surveillance AVI system [38] uses motion detection information to extract objects based on a background image. Results show that the motion detection method used in AVI is sensitive to noise and artifact. The system can operate in simple environments where one human is tracked and translational motion is assumed. It is limited to applications of indoor environments and cannot deal with occlusion. In the retrieval system Netra-V [48], a semi-automatic object segmentation (the user has to specify a scale factor to connect region boundaries) is applied based on color, texture, and motion. This is not suitable for large collections of video data. Here, the shot is first divided into image groups; the first of a group of images is spatially segmented into regions, followed by tracking based on a 2-D affine motion model. Motion estimation is done independently in each group of images. The difficulty of this image-group segmentation is to automatically estimate the number of images in a group. In [48], this number is manually fixed. This introduces several artifacts, for example, when the cut is done at images where important object activities are present. Further, objects disappearing (respectively appearing) just before (respectively after) the first image of the group are not processed in that group. In [48], regions are merged based on coherent-motion criteria. A difficulty arises when different objects with the same motion are erroneously merged. This complicates subsequent analysis steps. Recently, and within the framework of the video analysis model of the COST-211Group, a new state-of-the-art object-oriented video analysis scheme, the COST-AM scheme, has been introduced [85, 6, 60]. The basic steps of its current implementation are camera-motion compensation, color segmentation, and motion detection. Subjective and objective evaluation suggests that this method produces good objectoriented analysis of input video (see [127]). Difficulties arise when the combination of features (here motion and color) fails and strong artifacts are introduced in the resulting object masks (Fig. 4.24-4.27). Moreover, the method produces outliers at object boundaries where large areas of the background are estimated as belonging to 44 Video analysis objects. Big portions of the background are added to the object masks in some cases. In addition, the algorithm fails in some cases to produce temporally reliable results. For example, the method loses some objects and no object mask can be generated. This can be critical in tracking and event-based interpretation of video. Most video analysis techniques are rarely tested in the presence of noise and other artifacts. Also, most of them are tested in a limited set of video shots. 3.4 Overview of the proposed approach The proposed video analysis system is designed for both indoor and outdoor real environments, has a modular structure, and consists of: 1) motion-detection-based object segmentation, 2) object-based motion estimation, 3) feature representation, 4) region merging, and 5) voting-based object tracking (Fig. 3.3). The object segmentation module extracts objects based on motion data. In the motion estimation module, temporal features of the extracted objects are estimated. The feature representation module selects spatio-temporal object features for subsequent analysis steps. The tracking module combines spatial and temporal object features and tracks objects as they move through the video shot. Segmentation may produce objects that are split into sub-regions. Region merging intervenes to improve segmentation based on tracking results. This improvement is used to enhance the tracking. The proposed system performs region merging based on temporal coherence and matching of objects rather than on a single or combined local features. The core of the proposed system is the object segmentation step. Its robustness is crucial for robust analysis. Video analysis may produce incorrect results and much research has been done to enhance given video analysis techniques. This thesis proposes to compensate for possible errors of low-level techniques by higher-level processing because at higher levels more information is available that is quite useful and more reliable for detection and correction of errors. In each module of the proposed system, complex operations are avoided and particular attention is paid to intermediate errors of analysis. The modules cooperate by exchanging estimates of video content, thereby enhancing quality of results. This proposed video analysis system results in significant reduction of the large amount of video data: it transforms a video of hundreds of images into a list of objects described by low-level features (details in Section 3.5). For each extracted object, the system provides the following information between successive images: Identity number to uniquely identify the object throughout the shot, Age to denote its life span, minimum bounding box (MBB) to identify its borders, Area (initial, average, and present), Perimeter (initial, average, and present), Texture (initial, average, and present), Position (initial and present), Motion (initial, average, and present), Proposed approach 45 R(n) I(n-1) I(n) Motion-based object segmentation Change detection & thresholding Pixels to objects D(n) D(n-1) Object labeling - Morphological edge detection - Contour analysis O(n-1) O(n) Feature extraction & Motion estimation Object-based motion estimation feature selection Objects to features Multi-feature object tracking Voting-based Region merging Objects to video objects object matching Objects & features Vector fields Figure 3.3: Video analysis: from pixels to video objects. R(n) is a background image, I(n), I(n − 1) are successive images of the shot, D(n), D(n − 1) are the difference images, and O(n), O(n − 1) are lists of objects and their features in I(n) and I(n − 1). and finally Corresponding object to indicate its corresponding object in the next image. This list of features can be extended or modified for other applications. When combined, these object features provide a powerful object description to be used in interpretation (Chapter 7). This data reduction has advantages: large video databases can be searched efficiently and memory requirements are significantly reduced. For instance, a video shot of one-second length at 10 frames per second containing three objects is represented by hundreds of bytes rather than Megabytes. Extensive experiments (see Sections 4.7, 5.5, and 6.4) using 10 video shots containing a total of 6071 images illustrate the good performance of the proposed video analysis. Indoor and outdoor real environments including noise and coding artifact are considered. Algorithmic complexity of video analysis systems is an important issue even with the significant advancements in modern micro-electronic and computing devices. As the power of computing devices increases, large problems will be addressed. Therefore, the need for faster running algorithms will remain of interests. Research oriented to real-time and robust video analysis is and will stay both important and practical. For example, the large size of video databases requires fast analysis systems and, in surveillance, images must be processed as they arrive. Computational costs2 of 2 Algorithms of this thesis are implemented in C and run on a multitasking SUN UltraSPARC 360 MHz without specialized hardware. Computational cost is measured in seconds and given per a CIF(352x288)-image. No attention was paid to optimize the software. 46 Video analysis the proposed analysis modules are given in Table 3.1. As shown, the current nonAlgorithm Object segmentation Motion estimation Tracking min. cost 0.11 0.0005 0.001 max. cost 0.18 0.003 0.2 Table 3.1: Computational cost in seconds of the analysis steps. optimized implementation of the proposed video analysis requires on average between 0.11 and 0.35 seconds to analyze the content of two images. This time includes noise estimation, change detection, edge detection, contour extraction, contour filling, feature extraction, motion estimation, object tracking, and region merging. In the presence of severe occlusion, the processing time of the proposed method increases. This is mainly due to handling of occlusion especially in the case of multiple occluded objects. Typically surveillance video is recorded at a rate of 3-15 frames per second. The proposed system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second. To accelerate the system performance for higher frame-rate applications, optimization of the code is needed. Acceleration can be achieved, for example, by optimizing the implementation of the occlusion and object separation, working with integer values (where it is appropriate), and additions instead of multiplications. The current version (v.4x) of the state-of-the-art reference method, COST-AM [85, 6, 60], takes on average 175 seconds to segment objects of an image. The COST-AM includes color segmentation, global motion estimation, global motion compensation, scene cut detection, change detection, and motion estimation. 3.5 Feature selection One of the fundamental challenges in video processing is to select a set of features appropriate to a broad class of applications. In video retrieval and surveillance, for example, it is important that features for matching of objects exploit properties of the HVS as to the perception of similar objects. The objective in this section is to define local and global features that are suitable for real-time applications. 3.5.1 Selection criteria Features can be divided into low-level and high-level features. Low-level features include texture, shape, size, contour, MBB, center of gravity, object and camera motion. They are extracted using video analysis methods. High-level features include Feature selection 47 movement (i.e., trajectory), activity (i.e., a statistical sequence of movements such as ‘pitching a ball’), action (i.e., meaning of a movement related to the context of the movement such as ‘following a player’) [22]), event (i.e, a particular behavior of a finite set of objects such as ‘depositing an object’). They are extracted by interpretation of low-level features. High-level features provide a step toward semantic-based understanding of a video shot. Low-level features are generic and relatively easy to extract [81]. By themselves, they are not sufficient for video understanding [81]. High-level features may be independent (e.g., deposit) or dependent (e.g., ‘following’ or ‘being picked up by a car’) of the context of an application. They are difficult to extract but are an important basis for semantic video description and representation. Low-level features can be further classified into global and local features. Local features, such as shape, are related to objects in an image or between two images. Global features, such as camera motion or dominant objects (e.g., an object related to an important event), refer to the entire video shot. It is now recognized that, in many domains, content-based video representation cannot be carried out using only local or only global features (cf. [124, 115, 8, 5, 142]). A key problem in feature selection is stability throughout a shot. Reasons for feature instability are noise, occlusion, deformation, and articulated movements. This thesis uses the following criteria to select stabile feature descriptions: Uniqueness: Features must be unique to the entities they describe (objects or video shots). Unique features do not change significantly over time. Robustness: Since the image of an object undergoes various changes such as gray level and scale change, it is important to select feature descriptions that are robust or invariant to relevant image and object transformations (e.g., translation, rotation, scale change, or overall lightness change). For example, area and perimeter are invariant to rotation whereas ratios of sizes and some shape properties (see next section) are invariant to geometrical magnification (details in [111]). Object changes such as rotation can be easily modeled. There are changes such as reflection or occlusion that are, however, difficult to model and a good analysis scheme should take these changes into account to ensure robust processing based on the selected feature representation. Completeness: As discussed in the introduction of this thesis, there has been a debate over the usefulness of developing video analysis methods that are general and applicable for all video applications. The same arguments can be used when selecting features. Since the need for object features may differ for different applications –a feature that is important for one application can be useless for another application– a description of a video object or a video can be only complete when it is chosen based on a specific application. Combination: Since features can be incomplete, noisy, or variant to transformation, 48 Video analysis the issue is to find ways to effectively combine these features to solve a broad range of problems. The most widely used technique is a weighted combination of the features. However, the HVS is non-linear. In Chapter 6, a new effective way to combine features based on a non-linear voting system is proposed. Filtering: Selection of good features is still not a guarantee that matching applications would give the expected results. One reason is that even expressive features, such as texture, can get distorted and complicate processing. A second reason is that good features can be occluded. Such features should be detected and excluded from processing. Therefore it is important to monitor and, if needed, temporally filter features that are used in the analysis of an image sequence. For example, occluded features can be ignored (cf. Section 6.3.4). Efficiency: Extraction of feature description can be time consuming. In real-time applications such as video retrieval or surveillance, a fast response is expected and simple descriptions are required. 3.5.2 Feature descriptors Since the human perception of object and video features is subjective, there is no one single best description of a given feature. A feature description characterizes a given feature from a specific perspective of that feature. In the following paragraphs, lowlevel object and shot feature descriptions are proposed that are easy to compute and match. The proposed descriptors are simple but combined together in an efficient way (as will be shown in Chapter 6) they provide a robust tool for matching of objects. In this section, models for feature representation are proposed that balance the requirements of being effective and efficient for a real-time application. In the following, let Oi represent an object of an image I(n). Size: The size of an object is variant to some transformation such as scale change, but combined with other features such as shape it can compensate for errors in these features, for instance, when objects get smaller or when noise is present. • Local size descriptors: ◦ area Ai , i.e, the number of pixels of Oi , ◦ perimeter Pi , i.e., the number of contour (border) points of Oi (both area and perimeter are invariant under translation and rotation), ◦ width Wi , i.e., the maximal horizontal extent of Oi , and ◦ height Hi , i.e., the maximal vertical extent of Oi (both width and height are invariant under translation). • Global size descriptor: Two descriptors are proposed: 1) the initial and the last size (e.g., area) of the Feature selection 49 object across the shot or/and 2) the median of the object sizes across the video shot. The optimal selection depends on the applications. This descriptor can be used to query shots based on objects. For example, the absolute value of the difference between the query size and indexed size can be used to rank the objects in terms of similarity. Shape: Shape is one of the most important characteristics of an object, and particularly difficult to describe both quantitatively and qualitatively. There is no generally accepted methodology of shape descriptions, especially because it is not known which element of shape is important for the HVS. Shape descriptions can be based on boundaries or intensity variations of the object. Boundary-based features tend to fail under scale change or when noise is present, whereas region-based features are more stable in these cases. The use of shape in matching is difficult because estimated shapes are rarely exact due to algorithmic error and because few of the known shape feature measures are accurate predictions of human judgments of shape similarity. Shape cannot be characterized by a single measure. This thesis uses the following set of simple measures to describe object shape: • Minimum bounding box (MBB) BOi : the MBB of an object is the smallest rectangle that includes all pixels of Oi , i.e., Oi ⊂ BOi . It is parameterized by its top-left and bottom-right corners. • Extent ratio: ei = • Compactness: Hi , the ratio of height and width. Wi ci = HAi Wi i the ratio of object area, Ai and MBB area, Hi Wi . • Irregularity (also called elongation or complexity): ri = Pi2 /(4πAi ). The perimeter Pi is squared to make the ratio independent of the object size. This ratio increases when the shape becomes irregular or when its boundaries become jerky. ri is invariant to translation, rotation, and scaling [111]. Texture: Texture is a rich object feature that is widely used. The difficulty is how to find texture descriptors that are unique and expressive. Despite many research efforts no unique definition of texture exists and no texture descriptors are widely applicable. The difficulty of using texture for tracking or similarity comes from the fact that it is blurred by strong motion or noise, and its expressive power becomes thus limited. In such cases, shape features are more reliable. This thesis uses the following simple texture measure, µt (Oi ), shown in [39] to be as effective as the more computationally demanding co-occurrence matrices. The average grey value difference, µg (p), for each pixel p ∈ Oi is defined as L 1 X µg (p) = |I(p) − I(qdl )| Wd l=1 (3.1) 50 Video analysis and Ai 1 X µt (Oi ) = (µg (pl )), Ai l=1 (3.2) where {qd1 · · · qdL } is a set of points neighboring p at a distance of d pixels and Ai is the area of Oi . The best choice of d depends on the coarseness of texture within the object, which can vary over the image of the object. In this thesis, d was fixed to one pixel, and the neighborhood size Wd was fixed to be the 4-neighborhood of p. Spatial homogeneity of an object: A spatial homogeneity measure of an object Oi describes the connectivity of its pixels. In this thesis, the following simple measure is selected [58]: Ai − AR H(Oi ) = (3.3) Ai where Ai is the area of an object Oi and AR is the total area of all holes (regions) inside Oi . A hole or a region Ri is inside an object Oi if Ri is completely surrounded by Oi . H(O) = 1 when Oi contains no regions. Center-of-gravity: Accurate estimation of the center-of-gravity (centroid) of an object is time consuming. The level of accuracy depends on an application. A simple estimate is the center of BOi . This estimate is quickly determined but suffers from gross errors only under certain conditions, such as gross segmentation errors. Location: The spatial position of an image object can be represented by the coordinates of its centroid or by its MBB. The temporal position of an object or event can be given by specifying the start and end image. Motion: Object motion is an important feature that the HVS uses to detect objects. Motion estimation or detection methods can relate motion information to objects. • Object motion: ◦ The motion direction δ = (δx , δy ) and displacement vector w = (wx , wy ). ◦ Object trajectory is approximated by the motion of the estimated centroid of Oi . A trajectory is a set of tuples {(xn , yn ) : n = 1 · · · N } where (xn , yn ) is the estimated centroid of Oi in the nth image of the video shot. The trajectory is estimated via object matching (Chapter 5). ◦ Average absolute value of the objectPmotion throughout a shot µw = PN N wyn n=1 wxn and µy = n=1 . This feature is by analogy (µx , µy ) with µx = N N to HVS motion perception. The HVS integrates local object motion at different positions of an object into a coherent global interpretation. One form of this integration is vector averaging [4]. Summary 51 • Camera motion: Basic camera motions are pan, zoom, tilt. In practice, as a compromise between complexity and flexibility, 2-D affine motion models are used to estimate camera motion (cf. Page 6 of Section 1.3). Why not use color? Some video analysis systems rely heavily on color features when extracting video content. In general, luminance is a better detector of small details and chrominance performs better in rendering coarser structures and areas. In this thesis, the analysis system does not rely on color for the following reasons: first, color processing requires high computation and memory which can be critical for many applications. Second, color data may not be available, as is often the case in video surveillance, especially at night or under low-light conditions. Third, color when objects are small or when color variations are high is not useful. In video retrieval, the user is often asked to specify color features and the use of color causes difficulties: first, the human memory of color is not very accurate and absolute color is difficult to discriminate and describe quantitatively. Second, the perceived color of an object depends on the background, the illumination conditions, and monitor display settings. Third, the judgment of the perceived color is not the same as the color data represented in computers or in the HVS. Therefore, a user can request a perceived color different from the computer-recorded color. 3.6 3.6.1 Summary and outlook Summary This Chapter introduces a method to extract meaningful video objects and their low-level features. The method consists of four steps: motion-detection-based object segmentation (Chapter 4), object-based motion estimation (Chapter 5), and object tracking (Chapter 6) based on a non-linear combination of spatio-temporal features. State-of-the-art studies show that object segmentation and tracking is a difficult task, particularly in the presence of artifacts and articulated objects. The methods of the proposed video analysis system are tested using critical sequences and under different environments using analog, encoded (e.g., MPEG-2), and noisy video (see Section 4.7, 5.5, and 6.4). Evaluations of the algorithms show reliable and stable object and feature extraction throughout video shots. In the presence of segmentation errors, such as object merging or multi-object occlusion, the method accounts for these errors and gives reliable results. The proposed system performs region merging based on temporal coherency and matching of objects rather than on single local features. The proposed method extracts meaningful video objects which can be used to index video based on flexible objects (Section 3.5) or to detect events related to ob- 52 Video analysis jects for semantic-oriented video interpretation and representation. In this context the usefulness of the proposed object and feature extraction will be shown in Chapter 7. To focus on meaningful objects, the proposed video analysis system needs a background image. A background image is available in surveillance applications. In other applications, a background update method has to be used which must adapt to different environments (cf. Page 6 of Section 1.3). 3.6.2 Outlook In a multimedia network, video data is encoded (e.g, using MPEG-2) and either transmitted to a receiver (e.g., TV) or stored in a video database. Effective coding, receiver-based post-processing, and retrieval of video data require video analysis techniques. For example, for effective coding motion and object data are required [136]. In a receiver, motion information is needed for advanced video post-processing, such as image interpolation [120]. Effective retrieval of large video databases requires effective analysis of video content [129, 8]. In a multimedia network, several models of the use of video analysis are possible (Fig. 3.4): 1) different video analysis for the encoder and the receiver, 2) the same video analysis for both encoder and receiver, and 3) cooperative video analysis. For example, motion information extracted for coding can be used to support retrieval or post-processing techniques. Studies show that MPEG-2 motion vectors are not accurate enough for a receiver-based video postprocessing [9, 21]. These studies suggest that the use of the second model of Fig. 3.4 may not produce interesting results. In this Chapter, we have proposed a video analysis system for the first model of Fig. 3.4. An interesting subject of research is the integration of extracted coding-related video content to support the video analysis for a video retrieval or surveillance application (the third model of Fig. 3.4). 11 00 00 11 00 11 MPEG encoder channel/ database video processing e.g.: MPEG decoder interpretation, enhancement video analysis video analysis e.g.: Model 1: separated video analysis for encoding and processing Model 2: the same video analysis for encoding and processing e.g., use MPEG-vectors, or segmented objects object segmentation motion estimation MPEG decoder * Model 3: combined video analysis for coding and processing e.g., MPEG-vector based object segmentation processing MPEG decoder * processing video analysis * object- and motion-data Figure 3.4: Models of video analysis in a multimedia network. Chapter 4 Object Segmentation 4.1 Motivation Many advanced video applications require the extraction of objects and their features. Examples are object-based motion estimation, video coding, and video surveillance. Object segmentation is, therefore, an active field of research that has produced a large variety of segmentation methods [119, 129, 11, 128, 127]. Each method has emphasis on different issues. Some methods are computationally expensive but give, in general, accurate results and others have low computation but fail to provide reliable segmentation. Few of the methods are adequately tested particularly on a large number of video shots and are evaluated throughout large shots. Furthermore, many methods work only if the parameters are fine tuned for various sequences by experts. A drawback common to most methods is that they are not tested on noisy images and images with artifacts. An object segmentation algorithm classifies the pixels of a video image into a certain number of classes that are homogeneous with respect to some characteristic (e.g., texture or motion). It aggregates image pixels into objects. Some methods focus on color features and others on motion features. Some methods combine various features aiming at better results. The use of more features does not, however, guarantee better result since some features can become erroneous or noisy and complicate the achievement of a good solution. The objective in this section is to propose an automated modular object segmentation method that stays stable throughout an image sequence. This method uses a small number of features for segmentation, but focuses on their robustness to varying image conditions such as noise. This foregoes precise segmentation such as at object boundaries. This interpretation of segmentation is most appropriate to applications such as surveillance and video database retrieval. In surveillance applications, robustness with respect to varying image and object conditions is of more concern than 54 Object segmentation accurate segmentation. In object-based retrieval, the detailed outline of objects is often not necessary but the semantic meaning of these objects is important. 4.2 Overall approach This proposed segmentation method consists of simple but effective tasks, some of which are based on motion and object contour information. Segmentation is realized in four steps (Fig. 4.1): motion-detection-based binarization of the input gray-level images, morphological edge detection, contour analysis and tracking, and object labeling. The critical task is the motion-based binarization which must stay reliable throughout the video shot. Here, the algorithm memorizes previously detected motion to adapt current motion detection. Edge detection is performed by novel morphological operations with a significantly reduced number of computations compared to traditional morphological operations. The advantage of morphological detection is that it produces gap-free and single-pixel-wide edges without need for post-processing. Contour analysis transforms edges into contours and uses data from previous frames to adaptively eliminate noisy and small contours. Small contours are only eliminated if they cannot be matched to previously extracted contours, i.e., if a small contour has no corresponding contour in the previous image. Small contours lying completely inside a large contour are merged with that large contour according to a spatial homogeneity criterion, as will be shown in Section 4.6.1. The elimination of small contours is spatially adapted to the homogeneity criterion of an object and temporally to corresponding objects in previous images. This is different from methods that delete small contours and objects based on fixed thresholds (see, for example, [61, 119, 118, 85, 70]). This object segmentation method is evaluated in the presence of MPEG-2-coding artifacts, white and impulsive noise, and illumination changes. Its robustness to these artifacts is shown in various simulations (Section 4.7). The computational cost is low and results are reliable. Few parameters are used; these are adjusted automatically to the amount of noise and to the local image content. The result of the segmentation process is a list of objects with descriptors (Section 3.5) to be used for further object-based video processing. To reduce storage space, object and contour points are compressed using a differential run-length code. 4.3 Motion detection In a real video scene, there are, generally, several objects which are moving differently against a background. Motion plays a fundamental role in segmentation by the Motion detection 55 input images -1 Z binarization (motion detection) Binary images morphological edge detection Edges contour analysis -1 Z Contours object labeling (contour filling) Objects (a) Block diagram. (b) Original image. (c) Binarization. (d) Edge detection: gap-free edges. (e) Contour analysis: small contours and noise reduction. (f) Object labeling: objects with unique labels. Figure 4.1: Four-step object segmentation. 56 Object segmentation HVS. Motion information can be extracted by motion estimation or motion detection. Motion estimation computes motion vectors using successive images, and points with similar motion are grouped into objects. There are several drawbacks to using motion estimation for segmentation. First, most motion estimation techniques tend to fail at object boundaries. Second, motion estimation techniques are, generally, too computationally expensive to serve real-time applications. Motion detection aims at finding which pixels of an image have moved in order to group them into objects. Motion can be detected based on inter-frame differencing followed by thresholding. The problem, however, is that changes between images occur not only due to object motion but also to local illumination variations, shadows, reflection, coding, and noise or artifacts. The main goal is to detect image changes that are due to object motion only. Detection of motion using inter-frame differencing is common in many applications. Common applications include object segmentation, coding, video surveillance (e.g., of intruder or vehicle), satellite images (e.g., to measure land erosion), and medical images (e.g., to measure cell distribution). It is also used for various TVapplications such as noise reduction and image interpolation. This thesis develops a fast motion detection method that is adaptive to noise and robust to artifacts and local illumination changes. The proposed method uses a thresholding technique to reduce to a minimum the typical errors of motion detection, for instance, errors associated with shadows and reflections. Performance of the proposed method will be shown against other methods. 4.3.1 Related work Motion detection methods often use a reference image R(n). R(n) can be a background image or any successive image of a sequence I(n ± k)1 . Assume that the camera is static or the video images are stabilized (cf. Page 6 of Section 1.3). Assume that background changes are much weaker compared to object changes, and that moving objects can be detected by thresholding a difference2 image D(n) generated by subtracting the current image I(n) from the reference image R(n). The value of a pixel of D(n), D(i, j, n), can be expressed as: X D(i, j, n) = LP ( |[I(i, j, n) − R(i, j, n)]|) (4.1) i,j∈W where W describes a neighborhood of the current pixel and LP is a low-pass filter. Large values in the difference map indicate locations of significant change. All pixels 1 Depending on the application, motion detection can be performed between images in the shortterm (e.g., k = ±1), medium-term (e.g., k = ±3), or long-term (e.g., k = ±10). 2 The difference is a map indicating the amount and sign of changes for each pixel. Motion detection 57 above a threshold are classified as changing. This results in a binary image, B(n), representing objects against background. In [135], the image difference between two successive images is filtered by a 2-D median filter followed by deletion of small regions. This method is not robust to noise and produces objects where outline deviates significantly from the real boundaries. In [49], the difference image is low-pass filtered, thresholded, and post-processed by a 5×5 median filter. Changed regions that are too small are removed, and all unchanged regions laying inside a changed region are merged into the unchanged region so that holes are closed. Much work based on statistical tests of hypothesis and Bayesian formulations has been done to enhance motion detection based on differencing for robust differencing3 . In [1], a statistical, model-based technique is used. This method computes a global threshold to segment the difference image into moving and non-moving regions. This is done according to a model of the noise probability density function of the difference image. Detection is refined by the Maximum a posteriori probability (MAP) criterion. Despite refinement, over-segmented images are often produced. Moreover, MAPbased techniques require large amounts of computational cost. The method in [145] uses a background images and a statistical motion detection method [1] to segments objects. Parameters used are need to be manually adjusted to account for noise, artifact, and illumination changes. The method in [84] improves on the accuracy of the motion detection method introduced [135] by a local adaptive relaxation technique that smoothes boundaries. This method considers previous masks and assumes that pixels that were detected as moving in the last images should be classified as moving in the current image. This introduces some sort of temporal coherence. This method misses, however, relevant objects and produces inaccurate detection (see Section 4.7). Because of scene complexity, illumination variations, noise, and artifacts, accurate motion detection remains a challenge. Three types of detection errors need investigation. The first type of error occurs when pixels are misclassified because of noise. In case the misclassified pixels are between objects so that objects are connected or in case the image is overlaid with high noise, this misclassification can produce errors that complicate subsequent processing steps. The second type occurs when objects have shadows. The third type of error occurs when objects and background have similar gray level pattern. This thesis develops a motion detection method that aims at reducing these types of errors. 3 for reviews cf. [76, 123, 112]. 58 Object segmentation (a) Original image. (b) Motion detection using a successive image. (c) Motion detection using a background image. Figure 4.2: Motion detection schemes. 4.3.2 A memory-based motion detection method As discussed earlier, motion can be detected either between successive images or between an image and a background image of the scene. A major difficulty with techniques using consecutive images is that they depend on inter-image motion being present between every image pair. Any moving object (or part of an object in the case of articulated objects) that becomes stationary or uncovered is erroneously merged with the background. Furthermore, temporal changes between two successive images may be detected in areas that do not belong to objects but are close to object boundaries and in uncovered background as shown in Fig. 4.2(b). In addition, removed or deposited objects cannot be correctly detected using successive images. This thesis develops an effective fast motion detection method based on image differencing with respect to a background image. The background image can be updated using a background updating technique (cf. Page 6 of Section 1.3). The disadvantages in using a background image is that shadows and reflections of moving objects can be highlighted in the difference image. As will be shown, simulations show that the proposed approach successfully reduces the effect of both artifacts. Global illumination changes can be additive and multiplicative. Assuming the images of a video shot are affected by global illumination change and by Gaussian noise. Then two successive images of the shot are modeled by: I(n) = S(n) + η(n) I(n + 1) = S(n + 1) + η(n + 1) + ξ(n + 1) (4.2) where S(n) and S(n + 1) are the projections of the scene into the image plane. η(n) and η(n + 1) are additive noise. ξ(n + 1) = a + bS(n + 1) represents the additive and multiplicative illumination changes. The constants a and b describe the strength of the global illumination changes. Thus an image difference may include artifacts due to noise and illumination changes. The basic concept to detect motion between the current image I(n) and the background image R(n) is shown in Fig. 4.3. The method comprises spatial filtering of Motion detection 59 the difference image using a 3 × 3 average filter, a 3 × 3 maximum filter (Eq. 4.3), and spatio-temporal adaptation based on thresholding. D(n) = max(LP (|I(n) − R(n)|)) (4.3) where LP is the averaging operator and max the maximum operator. R(n) I(n) - + absolute value spatial averaging filter spatial D(n) Spatio-temporal MAX adaptation filter B(n) T(n) T(n-T z ) Noise estimation σn -1 z Figure 4.3: Diagram of the motion detection technique. In real images, the difference |I(n) − R(n)| includes artifacts, for example, due to noise and illumination changes. To increase spatial accuracy of detection, an average and a maximum filter are used. Averaging causes a linear addition of the correlated true image data, whereas the noise is uncorrelated and is reduced by averaging. Hence, motion detection becomes less sensitive to noise and the difference image becomes smoother. The maximum operator limits motion detection to a neighborhood of the current pixel, causing stability around object boundaries and reducing granular noisy points. To partially compensate for global illumination changes, an adaptation to the difference image is proposed in Section 4.4. To increase temporal stability of detection throughout the video, a memory component is added to the motion detection process as shown next. Spatio-temporal adaptation Two main difficulties with traditional motion detection methods which use differencing are 1) they do not distinguish between object motion and other changes, for example, due to background movement as with tree leaves shaking, or illumination changes and 2) they do not account for changes occurring throughout a long video. Usually a fixed threshold is used for all the images of the video shot. A fixed threshold method fails when the amount of moving regions changes significantly. To answer these difficulties, this thesis proposes a three step thresholding method. 1. Adaptation to noise: To adapt the detection to image content and noise, an imagewide spatial threshold, Tn , is estimated using a robust noise-adaptive method (Section 4.4). 60 Object segmentation 2. Quantization: To stabilize thresholding spatio-temporally, this threshold, Tn , is then quantized to Tq into m values. This quantization partly compensates for background and local illumination changes and significantly reduces fluctuations of the threshold and hence stabilizes the binary output of the motion detector. Experiments using different quantization levels performed on different video shots suggest that the following three level quantization is a good choice: Tmin Tq = T mid Tmax : Tn ≤ Tmin : Tmin < Tn ≤ Tmid : otherwise. (4.4) Other quantization functions, such as using the middle values of the intervals instead of the limits, Tmin , Tmid , Tmax , can be also used. 3. Temporal integration: To adapt motion detection to temporal changes throughout a video shot the following temporal integration (memory) function is proposed: : Tq ≤ Tmin Tmin T (n) = T (n − 1) : Tq < T (n − 1) : otherwise. Tq (4.5) This function examines if there has been a significant motion change, i.e., Tq > T (n − 1), in the current image and, if so, the current threshold Tq is selected. If no significant temporal change is detected, i.e., Tq < T (n − 1), the previous threshold T (n − 1) is selected. When no or little motion is detected, Tq ≤ Tmin , Tmin is selected. This temporal integration stabilizes the detection of binary object masks throughout a video shot. It favors changes due to strong motion and rejects changes due to small changes or artifacts. Note that other temporal integration functions, like integration over more than one image, could be also used. 4.3.3 Results and comparison In this section, results of the proposed motion detection method are given and compared to a statistical motion detection method [145] which builds on the well-known method in [1].This method determine a global threshold for binarization by a statistical test of hypothesis using a noise model. It compares the statistical behavior of a neighborhood of a pixel to an assumed noise probability density function. This comparison is done using a significance test that needs the specification of a threshold α which represent the significance level. This method works very well in images where noise fits the assumed model. However, various experiments using this method show that parameters of the significance test and noise model need to be tuned for Thresholding 61 different images especially when illumination changes, shadows, and other artifacts are present in the scene. The method provides no indication as to how to adapt these parameters to the input images4 . As can be seen in Fig. 4.4 and 4.5, the proposed method displays better robustness especially in images with local illumination change (for example, when the sun shines, Fig. 4.5, objects enter the scene, Fig. 4.4, and doors are opened, Fig. 4.5). Also, the proposed method remains reliable in the presence of noise because it compensates for noise by adapting its parameter automatically to the amount of noise estimated in the image. Another important factor is the introduction of the temporal adaptation which makes the procedure reliable throughout the whole video shot. An additional advantage of the proposed method is that it has a low computational cost. For example, it requires an average of 0.1 seconds compared to 0.25 seconds for the reference method on a SUN-SPARC-5 360 MHz. 4.4 4.4.1 Thresholding for motion detection Introduction Thresholding methods5 for segmentation are useful when separating objects from a background, or discriminating objects from other objects that have distinct gray levels. This is also the case with difference images. Threshold values are critical for motion detection. A low threshold will cause either over-segmentation6 or noisy segmentation. A high threshold suppresses significant changes due to object motion and causes either under-segmentation or incomplete objects. In both cases the shape of the object can be grossly affected. Therefore, a threshold must be chosen automatically to adapt to image changes. In this section, a non-parametric robust thresholding operator is proposed which adapts to image noise. The proposed method is shown to be robust under various conditions. 4.4.2 Review of thresholding methods A thresholding method classifies, depending on a threshold T , each pixel D(i, j, n) in a difference image D(n) as belonging to an object and labeled white in a binary 4 In the following experiments, fixed parameters are used, i.e., the significance level α = 0.1 being the probability of rejecting the true hypothesis that at a specific pixel there are no moving objects, and the noise standard deviation σn = 15.0. 5 For thorough surveys see [117, 137]. 6 Over-segmentation is common to most motion-based segmentation because of the aperture problem, i.e., different physical motions are indistinguishable [4, 83]. 62 Object segmentation Original image I(42) Background image Proposed method Reference method Original image I(145) Proposed method Reference method Figure 4.4: Motion detection comparison for the ‘Survey’ sequence. Thresholding 63 Original image Proposed method Original image Proposed method Background image Reference method Background image Reference method Figure 4.5: Motion detection comparison for ‘Stair’ and ‘Hall’ sequences. 64 Object segmentation image B(n) or to the background and labeled black (Eq. 4.6). ½ 1 : D(i, j, n) > T B(i, j, n) = 0 : D(i, j, n) ≤ T. (4.6) Thresholding methods can be divided into global, local, and dynamic methods. In global methods, a gray-level image is thresholded based on a single threshold T . In local methods, the image is partitioned into sub-images and each sub-image is thresholded by a single threshold. In dynamic methods, T depends on the spatial coordinates of the pixel to which it is applied. The study in [2] further classifies thresholding methods into parametric and nonparametric. Based on the gray-level distribution of the image, parametric approaches try to estimate the parameters of the image probability density function. Such estimation is computationally expensive. Non-parametric approaches try to find the optimal threshold based on some criteria such as variance or entropy. Such methods have been proven to be more effective than parametric methods [2]. Dynamic and parametric approaches have high computational costs [41, 17]. Local approaches are more sensitive to noise, artifacts and illumination changes. For effective fast threshold determination, the combination of global and local criteria is needed. There are several strategies to determine a threshold for binarization of an intensity image [137, 117, 41, 17, 94, 99, 2]. Most methods make assumptions about the histogram of the intensity signal (e.g., some methods assume a Gaussian distribution). The most common thresholding methods are based on histograms. For a bimodal histogram it is easy to fix a threshold between the local maxima. Most real images do not, however, have a bimodal histogram. A difference image, however, differs from intensity images, and thresholding methods for intensity images may not be appropriate to difference images. There are few thresholding methods for motion detection. The methods presented in [112] have some drawbacks. First, fine tuning of parameters, such as window size, is required. Second, adaptation to image noise is problematic. In addition, the methods are computationally expensive. Finally, these methods do not consider the problem of adapting the threshold throughout the image sequence. To overcome these problems, a non-parametric thresholding method is proposed which uses both global (block-based) and local (block-histogram-based) decision criteria. Doing this, the threshold is adapted to the image contents and can change throughout the image sequence (e.g., noisy and MPEG-2 images Fig. 4.22). 4.4.3 Artifact-adaptive thresholding Fig. 4.6 gives an overview of the proposed thresholding method for motion detection. The image is first divided into K equal blocks of size W × H. For each block, Thresholding 65 the histogram is computed and divided into L equal partitions or intervals. For each histogram partition, the most frequent gray level gpl , l ∈ {1 · · · L} is fixed. This is done to take small regions, noise, and illumination changes into account. To take global image content into the thresholding function, an average gray level µk , k ∈ {1 · · · K} of each block is calculated. Finally, the threshold Tg is calculated by averaging all the gpl and all the µk for all the K blocks (Eq. 4.7). Simulations show this thresholding function is reliable with respect to image changes: K P L P ( (gpl ) + µk ) Tg = with k=1 l=1 (4.7) K ·L+K W P H P [ (D(i, j))] µk = i=1 j=1 Adaptation to global object-changes (e.g., contrast change) K: blocks (=> K avarages) W . W ·H Adaption to local changes and to small gray-level regions freq. L: intervals/block (=> KxL maxima) interval 1 H Block 1 Block 4 Block 7 Block 2 Block 5 Block 8 (4.8) interval 2 interval 3 interval 4 interval L Block 3 Block 6 Block K 0 50 100 150 200 250 gray-level gray-level image : most frequent gray-level in an interval Figure 4.6: Extraction of image-global threshold. Adaptation to image noise The threshold Tg is adapted to the amount of image noise as follows: if noise is detected, threshold Tg is set higher accordingly. This adaptation is a function of the estimated noise standard deviation σn , taking into consideration that low sensitivity to small σn is needed. The following positivequadratic weighting function (Fig. 4.7) is used: Tn = Tg + a · σn2 , (4.9) where a < 1 and a depends on the maximum, practically assumed, noise variance (e.g., max(σn2 ) = 25). 66 Object segmentation Tσ n IQW NQW LW PQW T min SW σ n Figure 4.7: Weighting functions used in implementations. SW represents static, LW linear, PQW positive quadratice, NQW negative quadratic, and IQW inverse quadratic weighting. Adaptation to local changes by weighting the difference image To account for local image structure and changes, especially at object borders, the average of the differences in a block k, µk as in Eq. 4.7, is weighted using a monotonically increasing function as follows: µnk = µk + θ · µk , θ = θmax − b · µk µmax (4.10) where µmax the maximum average, b < 1, and θmax < 0.5. When µk is high, meaning the image difference is high, then the block k is assumed to be inside the object and the threshold Tg should be slightly increased (by setting θ low). When µk is low, which can be due to artifacts or when block k is at the object boundary, the threshold Tg should increase (by setting θ high). This means a bias is given to higher differences. The constants a, b, and θmax are experimentally determined and where fixed for all the simulations. Different values of these thresholds do not affect the performance of the whole segmentation algorithm. They, in few cases, affect the accuracy of object boundaries which is not of importance for the intended applications of the proposed binarization. 4.4.4 Experimental results The proposed thresholding procedure has been compared to thresholding methods [99, 2] which have been used in various image processing systems (for example, [36]). Edge detection 67 They provide a good compromise between quality of binarization and computational cost (see Table III in [137]). Simulations show that the proposed method outperforms the reference methods in case of noise and illumination changes in the image. For images with no change due to object motion, the proposed method is more stable for motion detection applications (Fig. 4.8). To give a fair comparison, all simulation results in this section do not include temporal adaptation of the threshold as defined in Eq. 4.5. On average, the proposed algorithm needs 0.05 seconds on a SUN-SPARC-5 360 MHz. The method in [99] needs on average 0.05 seconds and the method in [2] needs 0.67 seconds. Fig. 4.10 summarizes comparative results of these methods. As can be seen, the proposed method separates the bright background and dark objects. Further, the algorithm was tested on noisy images and MPEG-2 encoded images, showing that it remains robust (Fig. 4.10). The good performance of the proposed thresholding function comes from the fact that is takes into account all areas of the image through its block partition and the division of each block into sub-regions. This takes into account all gray-levels and not just the lower or higher gray-levels. This stabilizes the algorithm. Furthermore, adaptation to image noise and weighting of the difference signal stabilize the thresholding function. Since a binary image resulting from thresholding for motion detection may contain some artifacts, many motion-detection-based segmentation techniques have a postprocessing step, usually performed by non-linear filters, such as median or morphological opening and closing. The effectiveness of such operations within the proposed object segmentation method will be discussed in the following in Section 4.5.5. 4.5 Morphological operations Detection of object motion results in binary images which indicate the contours and the object masks. In this section, a fast edge detection method and new operational rules for binary morphological erosion and dilation of reduced complexity are proposed. 4.5.1 Introduction The basic idea of a morphological operation is to analyze and manipulate the structure of an image by passing a structuring element on the image and marking the locations where it fits. In mathematical morphology, neighborhoods are, therefore, defined by the structuring element, i.e., the shape of the structuring element determines the shape of the neighborhood in the image. Structuring elements are characterized by 68 Object segmentation Original image Proposed method Original image Proposed method Difference image Reference method [99] Difference image Reference method [99] Figure 4.8: Thresholding comparison for the ‘Hall’ sequence. Edge detection 69 Original image Proposed method Difference image Reference method [99] Figure 4.9: Thresholding comparison for the ‘Stair’ sequence. (a) Difference image (b) Reference binarization with noise adaptation [99] (c) Proposed binarization Figure 4.10: Thresholding comparison in the presence of noise. 70 Object segmentation a well-defined shape (such as line, segment, or ball), size, and origin. The hardware complexity of implementing morphological operations depends on the size of the structuring elements. The complexity increases even exponentially in some cases. Known hardware implementations of morphological operations are capable of processing structuring elements only up to 3 × 3 pixels [63]. If higher-order structuring elements are needed, they are decomposed into smaller elements. One decomposition strategy is, for example, to present the structuring element as successive dilation of smaller structuring elements. This is known as the “chain rule for dilation” [69]. It should be stated that not all structuring elements can be decomposed. 1 0 10 0 1 0 10 0 1 1 1 0 1 0 1 0 1 0 11 00 1 0 1 0 11 00 1 0 1 0 00 11 0 1 0 1 00 11 0 1 0 1 00 11 0 1 0 1 00 11 0 1 00 11 0 1 00 11 0 1 0 1 0 1 0 1 S Input Image E Kernel 1 0 1 0 1 0 10 0 1 0 10 0 1 1 11 00 00 11 00 1 0 00 11 11 1 0 11 00 00 11 00 1 0 11 00 11 1 0 00 11 0 1 00 11 0 1 00 11 0 1 00 11 0 1 00 11 0 1 00 11 0 1 00 11 00 11 00 11 11 00 11 00 11 00 00 11 00 11 00 11 S - E 00 00 11 ( : eroded pixel) Erosion 11 00 11 1 0 1 0 S+ E 00 11 00 11 ( : expanded pixel) Dilation11 00 Figure 4.11: Dilation and erosion (note that they are applied here to the black pixels). The basic morphological operations are dilation and erosion (Fig. 4.11). These operations are expressed by a kernel operating on an input image. Erosion and dilation work conceptually by translating the structuring element to various points in the input image, and examining the intersection between the translated kernel coordinates and the image coordinates. When specific conditions are met the image content is manipulated using the following rules7 : • Standard dilation: Move a kernel K line-wise over the binary image B. If the origin of K intersects a white pixel in B, then set all pixels covered by K in B to white if the respective pixel in K is set white. • Standard erosion: Move a kernel K line-wise over the binary image B. If the origin of K intersects a white pixel in B and if all pixels of K intersect white pixels in B (i.e., K fits), then keep the pixel of B that intersect the origin of K white. Otherwise set that pixel to black. The dilation is an expansion operator that enlarges objects into the background. The erosion operation is a thinning operator that shrinks objects. By applying erosion to an image, narrow regions can be eliminated while wider ones are thinned. In order to restore regions, dilation can be applied using a mask of the same size. Erosion and dilation can be combined to solve specific filtering tasks. Two widely used combinations are opening, closing, and edge detection. Opening (erosion fol7 For set-theoretical definitions see [69, 64]. Edge detection 71 lowed by dilation) filters details and simplifies images by rounding corners from inside the object where the kernel used fits. Closing (dilation followed by erosion) protects coarse structures, closes small gaps, and rounds concave corners. Morphological operations are very effective for detection of edges in a binary image B, where white pixels denote uniform regions and black pixels denote region boundaries [64, 69]. Usually, the following detectors are used: E = B − E[B, K(m×m) ], E = D[B, K(m×m) ] − B, or E = D[B, K(m×m) ] − E[B, K(m×m) ]. (4.11) B is the binary image in which white pixels denote uniform regions and black pixels denote region boundaries. E is the edge image. E (D) is the erosion (dilation) operator (erosion is often represented by ª and dilation by ⊕). Km×m is the erosion (dilation) m × m kernel used. − denotes the set-theoretical subtraction. 4.5.2 Motivation for new operations Motivation for new erosion and dilation Standard morphological erosion and dilation are defined around an origin of a structuring element. The position of this origin is crucial for the detection of edges. For each step of an erosion or dilation, one pixel is set (at a time) in B. To achieve precise edges with single-pixel width, 3 × 3 kernels (defined around the origin) are used (kernel examples are in Fig. 4.12): when a 3 × 3 cross kernel is used, an incomplete corner detection is obtained (Fig. 4.13); a 3 × 3 square kernel gives complete edges but requires more computation (which grows rapidly with increased input data, Fig. 4.17(a)); and the use of a 2 × 2 square kernel will produce incomplete edges (Fig. 4.14). Figure 4.12: A 3 × 3 square, a 3 × 3 cross, and a 2 × 2 square kernel. To avoid these drawbacks, new operational rules for edge detection by erosion or dilation are proposed. A fixed-size (2 × 2 square) kernel is used and the rules set all four pixels of this kernel at a time in B. For edge detection based on the new rules, accurate complete edges are achieved and the computational cost is significantly reduced. Motivation for conditional operations When extracting binary images from gray-level ones, the binary images are often enhanced by applying morphological 72 Object segmentation operations which are effective and efficient and, therefore, widely used [51]. Applying standard morphological operations for enhancement, however, can connect some object areas or erode some important information. This thesis contributes basic definitions to conditional morphological operations to solve this problem. 4.5.3 New morphological operations Erosion Definition: proposed erosion Move the 2 × 2 square kernel line-wise over the binary image B. If at least one of the four pixels inside the kernel is black, then set all the four pixels in the output image E to black. If all four pixels inside the 2 × 2 kernel are white, then set all (at a time) four pixels in E to white if they were not eroded previously. Set-theoretical formulation An advantage of the proposed erosion is that it can be formally defined based on set-theoretical intersection, union, and translation in analogy to the formal definitions of the standard erosion [64]. The standard erosion satisfies the following property [64]: the erosion of an image by the union of kernels is equivalent to erosion by each kernel independently and then intersecting the result (Eq. 4.12). So given image A and kernels B and C in R2 , \ Es [A, B ∪ C] = Es [A, B] Es [A, C] (4.12) where Es denotes the standard erosion. The proposed erosion is then defined as follows: Ep [A, K2×2 ] = Es [A, S3×3 ] = ul ur ll lr Es [A, K2×2 ∪ K2×2 ∪ K2×2 ∪ K2×2 ]= ul Es [A, K2×2 ] T ur Es [A, K2×2 ] T ll Es [A, K2×2 ] (4.13) T lr Es [A, K2×2 ] ul where Ep denotes the proposed erosion, S3×3 is a 3 × 3 square kernel, and K2×2 is a 2 × 2 kernel with origin at the upper left (equivalently upper right, lower left, lower right) corner (cf. Fig. 4.12). Thus the proposed erosion gives the same results as the standard erosion when using a 3 × 3 square kernel. However, the proposed erosion is significantly faster. Using a 3 × 3 cross kernel with the standard erosion accelerates processing but gives incomplete results, especially at corners (Fig. 4.13). Dilation Definition: proposed dilation Move the 2 × 2 kernel line-wise over the binary image B. If at least one of the four binary-image pixels inside the kernel is white, Edge detection 73 Original image Erosion Proposed detection Standard detection Figure 4.13: Proposed versus standard erosion (standard erosion uses a 3 × 3 cross kernel). then set all (at a time) the four pixels in the output image E to white. Set-theoretical formulation In analogy to standard erosion, the standard dilation satisfies the following property [64]: the dilation of an image by the union of kernels corresponds to dilation by each kernel and then performing the union of the resulting images (Eq. 4.14). This means that given image A and kernels B and C in R2 , Ds [A, B ∪ C] = Ds [A, B] ∪ Ds [A, C] (4.14) where Ds denotes the standard dilation. The proposed dilation is then given by: Dp [A, K2×2 ] = Ds [A, S3×3 ] = ul ur ll lr Ds [A, K2×2 ∪ K2×2 ∪ K2×2 ∪ K2×2 ]= (4.15) ul ur ll lr Ds [A, K2×2 ] ∪ Ds [A, K2×2 ] ∪ Ds [A, K2×2 ] ∪ Ds [A, K2×2 ] where Dp denotes the new dilation. Original image Proposed dilation Standard dilation Figure 4.14: Proposed versus standard dilation (standard dilation uses a 2 × 2 kernel, origin at left upper pixel). Binary edge detection In this section, the need to use two operations (Eq. 4.11) for a binary morphological edge detection is questioned. When detecting binary edges, erosion and subtraction can be performed implicitly. Such an implicit detection is proposed in the next definition to reduce the complexity of morphological edge detection. 74 Object segmentation Definition: proposed edge detection Move the 2 × 2 kernel over the binary image B. If at least one of the four pixels of 2 × 2 kernel is black, then set the four pixels of the same positions in the output edge image E to white if their equivalent pixels in B are white. Otherwise set the pixels to black. If the 2 × 2 kernel fits in a white area it is implicitly eroded, but edges (the kernel does not fit) are kept. Fig. 4.17(b) gives a complexity comparison of the new binary edge detection, edge detection with the proposed erosion and edge detection using standard erosion (a 3 × 3 square kernel). As shown, the cost of edge detection is significantly reduced. Conditional erosion and dilation Usually object segmentation requires post-processing to simplify binary images. The most popular post-processing filters are the median and morphological filters such as opening or closing. This is because of their efficiency. The difficulty with these, however, is that they may connect or disconnect objects. To support morphological filters, this thesis suggests conditional dilation and erosion for the purpose of object segmentation. They are topology preserving filters in the sense that they are applied if specific conditions are met. Conditional erosion Using conditional erosion, a white pixel is eroded only if it has at least three black neighbors. This ensures that objects are not connected. It will be performed mainly at object boundaries. The basic idea is that if the majority of the kernel’s quadrant 2 × 2 points are black then this is most probably a border point and can be eroded. This is useful when holes inside the object had to be kept. Conditional dilation With conditional dilation, a black pixel is set to white if the majority of the 2 × 2 kernel pixels are white. If this condition is met then it is more likely that this pixel is inside an object and not a border pixel. Conditional dilation sets pixels mainly inside the object and stops at object boundaries to avoid connection of neighboring objects. This condition ensures that objects are not connected in the horizontal and vertical directions. In some cases, however, objects may be connected diagonally as shown in Fig. 4.15. In this Figure both ◦ pixels will become connected and so the two object regions. Figure 4.15: Cases where objects are connected using conditional dilation. Edge detection (a) Binary image. 75 (b) Canny edge detection. (c) Classical morphological edge detection using a 3×3 cross kernel. (d) Proposed morphological edge detection. Figure 4.16: Edge detection comparison. Note the shape distortion when using Canny detector. The proposed detection gives more accurate results than the classical morphological detector using a 3 × 3 cross kernel. 4.5.4 Comparison and discussion The proposed edge detectors have been compared to gradient-based methods such as the Canny method [31]. Canny edge detector is powerful method that is widely used in various imaging systems. The difficulty of using this method is that its parameters need to be tuned for different applications and images. Compared to the Canny-edge detector, the proposed methods show higher detection accuracy resulting in better shapes (Fig. 4.16(d)). A better shape accuracy using the Canny method can be achieved when its parameters are tuned accordingly. This is, however, not appropriate for automated video processing. This is mainly because the Canny detector uses a smoothing filter. In addition, the proposed edge detectors have lower complexity and produce gap-free edges so that no edge linking is necessary. The proposed edge detector are also significantly faster and give more accurate result (Fig. 4.16)(c)) than the classical morphological edge detectors when using a 3 × 3 cross kernel. The proposed morphological edge detectors have the same performance compared to standard morphological detectors but have significantly reduced complexity as Fig. 4.17 shows. This is confirmed using various natural image data. Fig. 4.17(a) shows that the computational cost using the standard erosion with a 3 × 3 square kernel grows rapidly with the amount of input data, while the cost of the proposed erosion stays almost constant. Computations can be further reduced by applying the 76 Object segmentation 0.35 0.4 Proposed Erosion Standard Erosion Standard erosion−based detection Proposed erosion−based detection Proposed direct detection 0.3 0.35 0.3 0.2 Time (in sec.) Time (in sec.) 0.25 0.15 0.25 0.2 0.1 0.15 0.05 0.1 0 0 16 30 36 43 52 67 Data (% of white pixels) 100 (a) Proposed versus standard erosion. 0.05 0 16 30 36 43 52 Data (% of white pixels) 67 100 (b) Proposed versus standard detection. Figure 4.17: Computational efficiency comparison. novel morphological edge detection with implicit erosion (Fig. 4.17(b)). 4.5.5 Morphological post-processing of binary images A binary image B resulting from a binarization of a gray-level image may contain artifacts, particularly at object boundaries. Many segmentation techniques that use binarization (e.g., for motion detection as in Section 4.3) have a post-processing step, usually performed by non-linear filters, such as median or morphological opening and closing. Non-linear filters are effective and efficient and, therefore, widely used [51]. This thesis examines the usefulness of applying a post-processing filter to the binary image. Erosion, dilation, closing, opening, and a 3 × 3 median operation were applied to the binary image and results were compared. The temporal stability of these filters throughout an image sequence has been tested and evaluated. The following conclusions are drawn: • Erosion can delete some important details and dilation can connect objects. • Standard opening with a 3 × 3 cross kernel smoothes the image but some significant object details can be removed and objects may get disconnected. • Standard closing performs better smoothing but may connect objects. • Conditional closing (see Page 74) is significantly faster than standard closing and is more conservative in smoothing results. It may connect objects diagonally as illustrated in Fig. 4.15. To compensate for disadvantages of the discussed operation, two solutions were tested: Contour analysis 77 • Conditional erosion followed by conditional closing. • Erosion, a 3 × 3 median filter, and a conditional dilation. Erosion before closing does not connect objects but filters many details and may change the object shape. Erosion, median, and conditional dilation perform better by preserving edge and corners. In conclusion, applying smoothing filters can introduce artifacts, remove significant object parts, or disconnect object parts. This complicates subsequent objectbased video processing such as object tracking and object-based motion estimation. These effects are more severe when objects are small or when their parts are thin compared to the used morphological or median masks. Use of the above operations is recommended when objects and their connected parts are large. Such information is, however, rarely a priori known. Therefore, this thesis does not apply an explicit postprocessing step but implicitly removes noise within the contour tracing procedure as will be shown in Section 4.6. 4.6 Contour-based object labeling This section deals with extraction of contours from edges (Section 4.6.1) and with labeling of objects based on contours (Section 4.6.2). 4.6.1 Contour tracing The proposed morphological edge detection (Section 4.5) gives an edge image, E(n), where edges are marked white and points inside the object or of the background are black. The important advantage of morphological edge detection techniques is that they never produce edges with gaps. This facilitates contour tracing techniques. To identify the object boundaries in E(n), the white points belonging to a boundary have to be grouped in one contour, C. An algorithm that groups the points in a contour is called contour tracing. The result of tracing edges in E(n) is a list, C(n), of contours and their features, such as starting point and perimeter. A contour, C ∈ C(n), is a finite set of points, {p1 , · · · , pn }, where for every pair of points pi and pj in C there exists a sequence of points s = {pi , · · · , pj } such that i) s is contained in C and ii) every pair of successive points in s are neighbors of each other. In an image defined on a rectangular sampling lattice, two types of neighborhoods are distinguished: 8-neighborhood and 4-neighborhood. In an 8-neighborhood all the eight neighboring points around a point are considered. In a 4-neighborhood only the four neighboring points, right, left, up, and down, are considered. A contour can be represented by the point coordinates or by a chain code. A list of point coordinates is needed for on-line object analysis. A chain code requires less 78 Object segmentation memory than coordinate representation and is desirable in applications that require storing the contours for later use [101]. Different contour tracing techniques are reported in the literature ([101, 7, 110]). Many methods are developed for specific applications such as pattern recognition. Some are defined to trace contours of simple structure. A commonly used technique is described in [102]. A drawback of this method is that it ignores contours inside other contours, fails to extract contours containing some 8-neighborhoods, and fails in case of contours with dead branches. A procedure for tracing complex contours in real images This proposed procedure aims at tracing contours of complex structure such as those containing dead or inner branches as with contours illustrated in Fig. 4.19. The proposed tracing algorithm uses plausibility rules i) to locate the starting point of a contour, ii) to find neighboring points, and iii) to decide whether to select a contour for subsequent processing (Fig. 4.18). The tracing is in a clockwise manner and the algorithm looks for a neighboring point of a current point in the 8-neighborhood starting at the rightmost neighbors. This rule forces the algorithm to move around the object by looking for the rightmost point and never inside the object (see Rule 2). The algorithm records both the contour chain code and the point coordinates. C(n-1) E(n) Locate a ’starting point’ in E(n) Find its connected neighboring points Next contour C Contour matching & selection C(n) Figure 4.18: Proposed contour tracing. In the following, let E(n) be the edge image of the original image I(n), C(n − 1) the list of contours of the objects of the original image I(n − 1), C(n) the list of contours of the objects of the original image I(n), Cc ∈ C(n) the current contour with starting point ps , Pc the length of Cc , i.e., the number of points of Cc , pc the current point, pi an 8-neighbor of it, and pp its previous neighbor. Rule 1 - Locating a starting point Scan the edge image, E(n), from left to right and from top to bottom until a white point pw is found. pw must be unvisited (i.e., not reached before) and has at least one unvisited neighbor. If such a point is found, Contour analysis 79 Original Original Obj 4 Obj 2 Obj 3 = Tracing direction Obj 1 (a) Rule 1. 11 00 00 11 00 11 0 1 0 1 0 1 1 0 11 0 00 1 11 0 00 1 11 00 1 0 11 0 00 1 11 0 00 1 11 00 = Tracing direction Traced 0 00 11 11 1 00 0 00 11 00 1 11 0 00 11 00 11 01 1 00 11 0 11 1 00 0 11 1 00 01 1 0 00 1 11 0 01 1 0 00 1 11 0 11 marked as visited 00 00 11 (b) Rule 3 - Dead branch. 11 00 00 11 0 1 0 1 0 1 11 00 11 00 00 11 00 11 00 11 0 1 00 11 00 11 0 1 00 11 00 11 00 11 00 11 00 11 11 00 00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 0 1 0 1 00 11 00 1 11 0 0 00 1 11 = Tracing direction Traced 1 0 0 1 0 1 0 1 0 1 1 0 1 0 11 00 00 11 0 1 1 0 1 0 1 0 00 11 0 01 1 1 0 00 11 00 11 00 11 00 11 00 11 00 11 00 11 0 1 0 1 0 1 00 11 00 11 0 1 0 1 00 1 11 00 11 0 = marked as visited (c) Rule 4 - pi visited. Figure 4.19: Illustrating effects of proposed tracing rules. i) set ps = pw , ii) set pc = ps , and iii) perform Rule 2. If no starting point is found, end tracing. In case objects contain other objects, the given scanning direction forces the algorithm to trace first the outwards and then the inwards contours (Fig. 4.19(a)). Note that due to the image scanning direction (from left to right and from top to bottom) the object boundaries lay always left of the tracing direction. Rule 2 - Finding a neighboring point The basic idea is to locate the rightmost neighbor of the current point. This ensures that object contours are traced from outward and tracing never enters branches inside the object. The definition of the rightmost neighbor of a current point depends on the current direction of tracing as defined by the previous and the current points. If, for example, the previous point lays to the left of the current point then the algorithm looks for a neighboring point, pi , within the five neighbors of pc displayed at the upper left of Fig. 4.20. The other neighbors of pc were neighbors of pp and were already visited and there is no need to consider them. Based on the position of pp eight search neighborhoods are defined (Fig. 4.20). Note that the remaining neighbors of pc , that are not considered are already visited when tracing pp . Since the algorithm is designed to close contours when a visited point is reached (see Rule 4), these points should not be considered. Depending on the position of pp , look for the next neighboring point pi of pc in the respective neighborhood as given in Fig. 4.20. If a pi is found i) mark pc as visited if it is not marked visited, ii) set pp = pc , iii) set pc = pi , and perform Rule 4. If no pi is found perform Rule 3, i.e., delete a dead branch. Rule 3 - Deleting dead branches This rule is activated only if pc has no neighbor except pp . In this case, pc is assumed to be at the end of a dead branch of the contour and the following steps are performed: i) eliminate pc from E(n), ii) pc = pp , iii) pp is set to its previous neighbor (which can be easily derived from the stored chain 80 Object segmentation 5 4 3 2 3 1 4 3 5 2 1 2 1 5 4 6 4 5 1 6 1 1 2 5 2 4 3 6 5 2 3 1 4 5 3 previous point 3 2 4 2 3 6 4 1 5 current point Figure 4.20: Neighborhoods of the current point. code), and iv) perform Rule 2. Dead branches are points at the end of a contour and are not connected to the remaining contour (an example is given in Fig. 4.19(b)). In some rare cases these single-point-wide branches are part of the original object contour. In applications where reliable object segmentation foregoes the need for precision, these points provide no important information and can be deleted. Note that only single-point-wide dead branches are deleted using this rule. The elimination of dead branches facilitates subsequent steps of object-oriented video analysis. Rule 4 - Closing a contour Close Cc , eliminate its points from E(n), and perform Rule 5 if pc = ps or pc is marked visited. If pc is marked visited, eliminate the remaining dead points of Cc from E(n). If Cc is not closed i) store the coordinate and chain code of pc and ii) look for the next point, i.e., perform Rule 2. Note that this rule closes a contour even if the starting point is not reached. This is important in case of errors in previous segmentation steps that produce, for instance, dead branches (Fig. 4.19(c)). Rule 5 - Selecting a contour Do not add Cc to C(n) if 1) Pc is too small, i.e., Pc < tpmin where tpmin is a threshold, 2) Pc is small (i.e., tpmin1 < Pc < tpmin2 where tpmin2 is a threshold) and Cc has no corresponding contour in C(n − 1), or 3) Cc is inside a previously traced contour Cp so that the spatial homogeneity (Eq. 3.3) of the object of Cp is low. Otherwise add Cc to C(n). In both cases, perform Rule 1. With rule 5, small contours are assumed to be the result of noise or erroneous thresholding and are, therefore, eliminated if they have no corresponding contours in the previous image. The elimination of small contours (representing small objects) is spatially adapted to the homogeneity criterion of an object and temporally to corresponding objects in previous images. This is different from many methods that delete Evaluation 81 small objects based on a fixed threshold (see, for example, [61, 119, 118, 85, 70]). 4.6.2 Object labeling Object contours, characterized by contour points and their spatial relationship, are not sufficient for further object-based video processing (e.g., object tracking or object manipulation), which is based on the data of the position of the object points. Therefore, extracted contours are filled to reconstruct the object and identify the exterior points of the object. Contour filling attempts to recreate each point of the original object given its contour data [101, 102]. In video analysis this is needed, for example, when profile, area, or region-based shape or size are required. Finding the interior of a region when its contour is given is one of the most common tasks in image analysis. Several methods for contour filling exist. The two most used are the seed-based method and the scan-line method [59, 101, 102]. In the seed-based method an interior contour point is needed as a start point, then the contour is filled in a recursive way. Since this method is not automated it is not suitable for online video applications. The Scan-line method is an automated technique that fills a contour line by line. This thesis uses an enhanced efficient version of the Scan-Line method as described in [7, 110]. 4.7 4.7.1 Evaluation of the segmentation method Evaluation criteria Evaluation criteria for object segmentation can be distinguished into two groups: i) Criteria based on implementation and architecture efficiency: implementation efficiency is measured by memory use and computational cost, i.e., the time needed to segment an image. Important parameters are image size, frame rate and computing system (e.g., multitasking computers or computers with specialized hardware). Architectural performance is evaluated by the level of human supervision, level of parallelism, and regularity, which means that similar operations are performed at each pixel. ii) Criteria based on the quality of the segmentation results, i.e., spatial accuracy and temporal stability. Since object segmentation is becoming integrated in many applications, it is important to be able to evaluate segmentation results using some objective numerical measures similar to PSNR in comparing coding and enhancement algorithms. This measure would facilitate research and exchange between researchers. It would also reduce the cost of evaluation by humans. 82 Object segmentation Recently, an objective measure for segmentation quality has been introduced [140]. It measures the spatial accuracy, temporal stability, and temporal coherence of the estimated object masks relative to reference masks. The spatial accuracy (sQM (dB)) is measured in terms of the number and location of the differences between estimated and reference masks. The sQM is 0 for an estimated segmentation identical to the reference and grows with deviation from the reference, indicating lower quality of segmentation. The temporal stability (vQM (dB)) is measured in terms of fluctuating spatial accuracy with respect to the reference segmentation. The temporal coherency (vGC(dB)) is measured by the relative variation of the gravity centers of both the reference and estimated masks. Both vQM and vGC are zero for perfect segmentation throughout the image sequence. The higher the temporal stability values vQM and vGC, the less stable the estimated segmentation over time. If all values are zero, then the segmentation is perfect with respect to a reference. 4.7.2 Evaluation and comparison In this section, simulation results using commonly referenced shots are given and discussed. These results are compared with the current state-of-the-art segmentation method [60, 6, 85], the COST-AM method, based on both objective and subjective evaluation criteria. The reference method is described in Section 3.3. Automatic operation The proposed method does not use a priori knowledge, and significant low-level parameters and thresholds are adaptively calculated. Parallelism and regularity All the elements of the algorithm have regular structure: motion detection with filtering and thresholding, morphological edge detection, contour tracing, and filling. A rough analysis shows that the process of motion detection and thresholding can be performed using parallel processing units. Edge detection, contour tracing, and filling are sequential methods. Binarization 0.1 - 0.15 Morphological edge detection 0.01 Contour analysis 0.01 Contour filling 0.001-.01 Table 4.1: Segmentation time in seconds for the proposed method. Computational cost The proposed algorithm needs on average 0.15 seconds on a SUN-SPARC-5 360 MHz per image. As shown in Table 4.1, most of the computational cost is for motion detection and thresholding. The morphological operations and Evaluation 83 contour analysis require very little computation. As reported in [127], the reference method, COST-AM, needs on average 45 seconds on a PC-Pentium-II 333 MHz. Simulations on the SUN-SPARC-5 360 MHz show that the reference method, COSTAM, needs roughly 95 seconds (not including global motion estimation). Quality of results In the evaluation, both indoor and outdoor test sequences and small and large objects are considered. All simulations are performed with the same parameter settings. Performance of the proposed segmentation is evaluated in the presence of MPEG2 artifacts and noise (cf. 2.2). In Fig. 4.22 the robustness of the proposed method in such environments is demonstrated. The method stays robust in the presence of MPEG-2 artifacts (the MPEG-2 images have an average of 25.90 dB PSNR, which means that the MPEG-2 images are strongly compressed and include many artifacts) and noise (a Gaussian white noise of 30 dB was added to the original sequence). The proposed segmentation is objectively compared to the current version (4.x) of the COST-AM method8 . Fig. 4.23 gives the comparison results based on the criteria mentioned in Section 4.7.1. As shown, the proposed segmentation is better with respect to all three criteria; especially it yields higher spatial accuracy. The three indicators of performance strongly depend on the reference object segmentation. For example, in the Hall test sequence, the reference segmentation focuses only on the two moving objects and disregards the third that has deposited one of the moving objects. Fig. 4.21 shows how the spatial accuracy sQM (dB) of the proposed method gets higher when the evaluation is focused on the moving objects. sQM[dB] 0 3 objects 2 objects −5 −10 −15 0 20 40 60 80 100 pic 120 140 160 180 200 Figure 4.21: Spatial accuracy as a function of the reference object segmentation. In Figures 4.24–4.27 subjective results are given for sample sequences and compared to results from the reference algorithm COST-AM. The segmented object masks using the proposed method are more accurate than those of the reference method. 8 The evaluation software and the reference masks are courtesy of the COST-211-group, http://www.tele.ucl.ac.be/EXCHANGE/. 84 Object segmentation In the masks generated by the COST-AM method, parts of moving objects are often not detected, or large background areas are added to the object masks. Both spatial and temporal coherence of the estimated object masks are better for the proposed method than the COST-AM method. In case of small objects and outdoor scenes, the proposed method stays stable and segments all objects of the scene. As shown in Fig. 4.27, the reference method loses objects in some images, which is critical for object tracking applications. In few cases, the COST-AM method results in more accurate object boundaries than the proposed method. This is a result of using color image segmentation in the COST-AM method. Limitations of the proposed segmentation algorithm In the presence of shadows, the proposed method has difficulties in detecting accurate object shape. Some systems apply strategies to reduce shadows [131, 113]. This might increase, however, the computational cost. This thesis proposes to compensate for the shadow artifacts in higher-level processing, as will be shown in Chapters 6 and 7. To focus on meaningful objects the proposed method needs a background image A background image is available in surveillance applications. In other applications, a background update method has to be used which must adapt to different environments. This limitation can be compensated by motion detection information in successive images as in [85, 89] or by introducing texture and color analysis as in [123, 6, 145]. Such an extension is worthwhile only if it is robust and computationally efficient. 4.8 Summary In real-time video applications, fast unsupervised object segmentation is required. This Chapter has proposed a fast automated object segmentation method which consists of four steps: motion-detection-based binarization, morphological edge detection, contour analysis, and object labeling. The originality of the proposed approach is : 1) the segmentation process is divided into simple but effective tasks so that complex operations are avoided, 2) a fast robust motion detection is proposed which uses a novel memory-based thresholding technique, 3) new morphological operations are introduced that show significantly reduced computations and equal performance compared to standard morphological operations, and 4) a new contour analysis method that effectively traces contours of complex structure such as those containing dead or inner branches as with contours. Both objective and subjective evaluation and comparisons show the robustness of the proposed methods in noisy images and in images with illumination changes while being of reduced complexity. The segmentation method uses few parameters, and these are automatically adjusted to noise and temporal changes within a video shot. Evaluation Original objects 85 MPEG-2 objects Noisy objects Figure 4.22: Object segmentation comparison for the ‘Hall’ test sequence in case of MPEG-2 decoded (25 dB) and noisy (30 dB) sequences. The proposed method is robust with respect to noise and artifacts. 86 Object segmentation sQM[dB] 0 Proposed COST−AM −5 −10 −15 −20 0 20 40 60 80 100 pic 120 140 160 180 200 (a) Spatial accuracy comparison: the proposed method has better spatial accuracy throughout the sequence. Average gain ' 3.5 dB vQM[dB] 0 −2 −4 −6 −8 −10 −12 −14 Proposed COST−AM 0 20 40 60 80 100 pic 120 140 160 180 200 (b) Temporal stability comparison: the proposed method has higher temporal stability throughout the sequence. Average gain ' 1.0 dB vGC[dB] 0 −5 −10 −15 Proposed COST−AM −20 −25 0 20 40 60 80 100 pic 120 140 160 180 200 (c) Temporal coherency comparison: after the first object enters the scene, the proposed method has higher temporal coherency. Average gain ' 0.5 dB Figure 4.23: Objective evaluation obtained for the ‘Hall’ test sequence. The proposed segmentation is better with respect to all three criteria. Evaluation 87 COST-AM method Proposed method Figure 4.24: Comparison of results of the indoor ‘Hall’ test sequence. Proposed objects have better spatial accuracy. 88 Object segmentation COST-AM method Proposed method Figure 4.25: Comparison of results of the outdoor ‘Highway’ test sequence. The reference method has lower temporal and spatial stability compared to the proposed method. Evaluation 89 COST-AM method Proposed method Figure 4.26: Comparison of results of the indoor ‘Stair’ test sequence. This sequence has strong object and local illumination changes. The proposed method remains stable while the reference method has difficulties in providing reliable object masks. 90 Object segmentation COST-AM method Proposed method Figure 4.27: Comparison of results of the outdoor ‘Urbicande’ test sequence. The scale of the objects change across this sequence. The reference method loses some objects and its spatial accuracy is poor. The proposed method remains robust to variable object size and is spatially more accurate. Chapter 5 Object-Based Motion Estimation Motion estimation plays a key role in many video applications, such as frame-rate video conversion [120, 44, 19], video retrieval [8, 48, 134], video surveillance [36, 130], and video compression [136, 55]. The key issue in these applications is to define appropriate representations that can efficiently support motion estimation with the required accuracy. This chapter is concerned with the estimation of 2-D object motion from a video using segmented object data. The goal is to propose an object-based motion estimation method that meets the requirements of real-time and content-based applications such as surveillance and retrieval applications. In these applications, a representation of object motion in a way meaningful for high-level interpretation, such as event detection and classification, foregoes precision of estimation. 5.1 Introduction Objects can be classified into three major categories: rigid, articulated, and non-rigid [53]. The motion of a rigid object is a composition of a translation and a rotation. An articulated object consists of rigid parts linked by joints. Most video applications, such as entertainment, surveillance, or retrieval, assume rigid objects. An image acquisition system projects a 3-D world scene onto a 2-D image plane. When an object moves, its projection is animated by a 2-D motion, to be estimated from the space-time image variation. These variations can be divided into global and local. Global variations can be a result of camera motion or global illumination change. Local variations can be due to object motion, local illumination change and noise. Motion estimation techniques estimate apparent motion which is due to true motion or to various artifacts, such as noise and illumination change. 2-D object motion can be characterized as the velocity of image points or by their displacement. Let (px , py ) denote the spatial coordinates of an object point in I(n−1) 92 Motion estimation and (qx , qy ) its spatial coordinates in I(n). The displacement dp = (dx , dy ) of (px , py ) is given by (dx , dy ) = (qx − px , qy − py ). The field of optical velocities over the image lattice is often called optical flow. This field associates to each image point a velocity. The correspondence field is the field of displacements over the image lattice. In video processing (e.g., entertainment, surveillance), estimation of the correspondence field is usually considered. The goal of a motion estimation technique is to assign a motion vector (displacement or velocity) to each pixel in an image. Motion estimation relies on hypotheses about the nature of the image or object motion, and is often tailored to applications needs. Even with good motion models, a practical implementation may not find a correct estimate. In addition, motion vectors cannot always be reliably estimated because of noise and artifacts. Difficulties in motion estimation arise from unwanted camera motion, occlusion, noise, lack of image texture, and illumination changes. Motion estimation is an illposed problem which requires regularization. A problem is ill-posed if no unique solution exists or the solution does not continuously depend on the input data [92]. The choice of a motion estimation approach strongly depends on the application and on the nature of the processes that will interpret the estimated motion. 5.2 Review of methods and motivation Motion estimation methods can be classified into two broad categories: gradientbased and matching methods1 . Both generally assume that objects undergo pure translation and have been widely studied and used in the field of video coding and interpolation. Gradient-based approaches use a relationship between image motion and the spatio-temporal derivatives of image brightness. They use computations localized to small regions of the image. As a result, they are sensitive to occlusion. Their main disadvantage is that they are not applicable for motion of large extent unless an expensive multi-resolution scheme is used. A more reliable approach is the estimation of motion of a larger region of support, such as blocks or arbitrary-shaped regions, based on parametric motion models. Matching techniques locate and track small, identifiable regions of the image over time. They can estimate motion accurately only in distinguishable image regions. In general, matching techniques are highly sensitive to ambiguity among the structures to be matched. Resolving this ambiguity is computationally costly. Furthermore, it is often computationally impractical to estimate matches for a large number of regions. Motion estimation by matching that is frequently used and implemented in hardware is block matching [46, 20, 21]. Here, the motion field is assumed to be 1 For a thorough review see [92]. Review 93 constant over rectangular blocks and represented by a single motion vector in each block. Several refinements of this basic idea have been proposed [53, 43, 18]. In [18], for example, a spatio-temporal update strategy is used to increase the accuracy of the block-matching algorithm. Also, a median-based smoothing of block motion is used. Three advantages of block-matching algorithms are: 1) easy implementation, 2) better quality of the resulting motion vector fields compared to other methods such as phase correlation and gradient methods in the presence of large motion, and 3) they can be implemented by regular VLSI architectures. An additional important advantage of block-matching is that it does not break down totally. Block matching has, however, some drawbacks. This is particularly true at object boundaries where these methods assume an incorrect model and result in erroneous motion vectors, leading to discontinuity in the motion vector fields, causing ripped boundaries artifacts in case of block-matching based motion compensation. In motion-compensated image, block structures become visible and object boundaries may split. Another drawback is that the resulting motion vectors inside objects or object regions with a single motion are not homogeneous, producing ripped region artifacts, i.e., structure inside regions can get split or distorted. Additionally, using a block-based algorithm results in block patterns in the motion vector field, causing block patterns or blocking artifacts. These patterns often result in block motion artifacts in subsequently processed images. The human visual system is very sensitive to such artifacts (especially abrupt changes). Various studies show that the integration of object information in the process of estimating object motion reduces block matching artifacts and enhances the motion vector fields [12, 10, 62, 30, 20, 55, 26, 49]. Block-based and pixel-based motion estimation methods have been widely used in the field of coding and image interpolation. The focus in these applications is on accurate motion and less on meaningful representation of object motion. For video surveillance and retrieval, the focus is on extracting a flexible content-based video representation. The focus is on reliable estimation without high precision, but stable throughout an image sequence. Content-based video processing calls for motion estimation based on objects. In an object-based motion estimation algorithm, motion vectors are estimated using information about the shape or structure of segmented objects. This causes the motion vectors to be consistent within the objects and at object boundaries. In addition, since the number of objects is significantly less than the number of blocks in an image, object-based motion estimation has lower complexity. Furthermore, given the objects, motion models more accurate than pure translation can be used, for instance, models that include rotation and scaling. Various object-based motion estimation methods have been proposed that require large amounts of computations [119, 49, 55, 32, 132]. This complexity is mainly due to segmentation which is difficult and complex [143, 55, 118, 132]. Also, although they 94 Motion estimation generally give good motion estimates, they can fail to interpret object motion correctly or can simply break down. This is due to dependence on good segmentation. Several methods use region growing for segmentation, or try to minimize a global energy function when the minimum is difficult to find [143]. Furthermore, these methods include several levels of refinement. In the next sections, a low-complexity object motion estimation technique is introduced that is designed to fit the needs of content-based video representation. It relies on the estimation of the displacements of the sides of the object minimum bounding box. Two motion estimation steps are considered: initial coarse estimation to find a single displacement for an object using the four sides of the MBB between two successive images and detection of non-translational motion and its estimation. 5.3 Modeling object motion To describe the 2-D motion of objects, definition of a motion model is needed. Two broad categories of 2-D motion models are defined: non-parametric and parametric. Non-parametric models are based on a dense local motion field where one motion vector is estimated for each pixel of the image. Parametric models describe the motion of a region in the image by a set of parameters. The motion of rigid objects, for example, can be described by a parameter motion model. Various simplifications of parametric motion models exist [95, 55, 91]. Models have different complexity and accuracy. In practice, as a compromise between complexity and accuracy, 2-D affine or 2-D ‘simplified linear’ motion models are used. Assuming a static camera or a camera-motion compensated video, local object motion can be described adequately as the composition of translation, rotation, and scaling. Changes in object scale occur when the object moves towards or away from the camera. This thesis uses the so-called ‘simplified linear’ models to describe objects’ motion. Let (px , py ) and (qx , qy ) be the initial, respectively the final, position of a point p of an object undergoing motion. Translation The translation of p by (dx , dy ) is given by qx = px + dx qy = py + dy . (5.1) Scaling The scale change transformation of p is defined by qx = s · (px − cx ) + cx qy = s · (py − cy ) + cy where s is the scaling factor and (cx , cy ) is the center of scaling. (5.2) Proposed approach 95 Rotation The rotational transformation of p is defined by qx = cx + (px − cx ) cos φ − (py − cy ) sin φ qy = cy + (px − cx ) sin φ + (py − cy ) cos φ (5.3) where (cx , cy ) is the center of rotation and φ the rotation angle. Composition If an object Oi is scaled, rotated, and displaced then the final position of p ∈ Oi is defined by (assume a small-angle rotation which gives sin φ ' φ and cos φ ' 1) qx = px + dx + s · (px − cx ) − φ · (py − cy ) (5.4) qy = py + dy + s · (py − cy ) + φ · (px − cx ). 5.4 Motion estimation based on object-matching A key issue when designing a motion estimation technique is its degree of efficiency with enough accuracy to serve the purpose of intended video application. For instance, in object tracking and event detection a tradeoff is required between computation cost and quality of object prediction. In video coding applications, accurate motion representation is needed to achieve good coding quality with low bite rate. The proposed method estimates object motion based on the displacements of the MBB of the object. MBB-based object motion estimation is not a new concept. Usually, MBB-based methods use the displacement of the centroid of the object MBB. This is sensitive to noise and other image artifacts such as occlusion. Most MBB motion estimators assume translational motion when motion type can be important information, as in retrieval, for instance. The contribution is in the detection of the type of object motion: translation, scaling, composition, and the subsequent estimation of one or more motion values per object depending on the detected motion type. In the case of a composition of these primitive motions, the method estimates the motion without specifying its composition. Special consideration is given to object motion in interlaced video and at image margin. Analysis of displacements of the four MBB-sides allows further the estimation of more complex image motion as when objects move towards or away from the camera (Section 5.4.3). 5.4.1 Overall approach This proposed non-parametric motion estimation method is based on four steps (Fig. 5.1): object segmentation, object matching, MBB-based displacement estimation, and motion analysis and update. 96 Motion estimation R(n) I(n-1) I(n) Object Segmentation O(n-1) O(n) Object matching motion & object memory object mapping Initial estimation of the MBB displacements Initial motion data MBB-Motion analysis (Detection of motion types) object motion Figure 5.1: Diagram of the proposed motion estimation method. Object segmentation has been considered in Chapter 4 and object matching will be discussed in Chapter 6. In the third step (Section 5.4.2), an initial object motion is estimated by considering the displacements of the sides of the MBBs of two corresponding objects (Fig. 5.2), accounting for possible segmentation inaccuracies due to occlusion and splitting of object regions. In the fourth step (Section 5.4.3), the type of the object motion is determined. If the motion is a translation, a single motion vector is estimated. Otherwise the object MBB is divided into partitions for more precise estimation, and different motion vectors are assigned to the different partitions. The proposed motion estimation scheme assumes that the shape of moving objects does not change drastically between successive images and that the displacements are within a predefined range (in the implementation the range [−16, +15] was used but other ranges can be easily adopted). These two assumptions are realistic for most video applications and motion of real objects. 5.4.2 Initial estimation Let I(n) be the observed image at time instant n, defined on an X × Y lattice where the starting pixel, I(1, 1, n), is at the upper-left corner of the lattice. If the motion is estimated forward, between I(n − 1) and I(n), then the direction of object motion is defined as follows: horizontal motion to the left is negative and positive to the right; vertical motion down is positive and negative up. The initial estimate of an object motion comes from the analysis of the displacements of the four sides of the MBBs of two corresponding objects (Fig. 5.2) as follows. Definitions: • Mi : Op → Oi a function that assigns to an object Op at time n − 1 an object Oi at time n. Proposed approach 97 min. row object displacement Op max. row displacement of min. row displacement of max. row Oi min. col max. col displacement of min. col displacement of max. col Figure 5.2: MBB-based displacement estimation. • w = (wx , wy ) the current displacement of Oi , between I(n − 2) and I(n − 1). • (rminp , cminp ), (rmaxp , cminp ) (rminp , cmaxp ), and (rmaxp , cmaxp ) the four corners, upper left, lower left, upper right, and lower right, of the MBB of Op (cf. Fig. 5.2). • rminp and rmaxp the upper and lower row of Op . • cminp and cmaxp the left and right column of Op . • rmini and rmaxi the upper and lower row of Oi . If upper occlusion or splitting is detected then rmini = rminp + wx . If lower occlusion or splitting is detected then rmaxi = rmaxp + wx . • cmini and cmaxi the left and right column of Oi . If left occlusion or splitting is detected then cmini = cminp + wy . If right occlusion or splitting is detected then cmaxi = cmaxp + wy . • drmin = rmini − rminp the vertical displacement of the point (rminp , cminp ). • dcmin = cmini − cminp the horizontal displacement of the point (rminp , cminp ). • drmax = rmaxi − rmaxp the vertical displacement of the point (rmaxp , cmaxp ). • dcmax = cmaxi − cmaxp the horizontal displacement of the point (rmaxp , cmaxp ). • dr = drmax − drmin the difference of the vertical displacements. • dc = dcmax − dcmin the difference of the horizontal displacements. The initial displacement, wi1 = (wx1i , wy1i ), of an object is the mean of the displacements of the horizontal and vertical MBB-sides (see the first part of Eq. 5.5 and 5.6). In case of segmentation errors, the displacements of parallel sides can deviate significantly, i.e., |dc | > td or |dr | > td . So the method detects these deviations and corrects the estimate based on previous estimates of (wx , wy ). This is given in the 98 Motion estimation second and third part of Eqs. 5.5 and 5.6. (dc +dc ) max min : |dc | ≤ td 2 (|dc | > td ) ∧ (dcmin +wx ) : [((dcmax dcmin > 0) ∧ (dcmax > dcmin )) ∨ 2 wx1i = ((dcmax dcmin < 0) ∧ (dcmax wx < 0))] (|dc | > td ) ∧ (dcmax +wx ) : [((dcmax dcmin > 0) ∧ (dcmax ≤ dcmin )) ∨ 2 ((dcmax dcmin < 0) ∧ (dcmax wx > 0))], wy1i = (drmax +drmin ) 2 (drmin +wy ) 2 (drmax +wy ) 2 : |dr | ≤ td : (|dr | > td ) ∧ [((drmax drmin > 0) ∧ (drmax > drmin )) ∨ ((drmax drmin < 0) ∧ (drmax wy < 0))] : (5.5) (5.6) (|dr | > td ) ∧ [((drmax drmin > 0) ∧ (drmax ≤ drmin )) ∨ ((drmax drmin < 0) ∧ (drmax wy > 0))]. This estimated displacement may deviate from the correct value due to inaccurate object shape estimation across the image sequence. To stabilize the estimation throughout the image sequence, the first initial estimate wi1 = (wx1i , wy1i ) is compared to the current estimate w = (wx , wy ). If they deviate significantly, i.e., the difference |wx − wx1i | > tm or |wy − wy1i | > tm for a threshold tm , acceleration is assumed and the estimate wi is adapted to the current estimate as given in Eq. 5.7, where a represents the maximal allowable acceleration. This way, the estimated displacement is adapted to the previous displacement to provide stability to inaccuracies in the estimation of the object shape by the object segmentation module. 1 : |wx1i − wx | ≤ tm wx i wx i = w + a : |wx1i − wx | > tm ∧ wx1i > wx x wx − a : |wx1i − wx | > tm ∧ wx1i < wx (5.7) 1 1 : |wyi − wy | ≤ tm w yi w yi = w + a : |wy1i − wy | > tm ∧ wy1i > wy y wy − a : |wy1i − wy | > tm ∧ wy1i < wy . 5.4.3 Motion analysis and update Often, objects correspond to a large area of the image and a simple translational model for object matching is not appropriate; a more complex motion model must Proposed approach 99 V1 displacement of the min. row V3 displacement of max. row displacement of the min. row V1 displacement displacement of min. col V2 V3 of max. col displacement of max. row V2 interpolated displacement (a) Detection of scale change. (b) Vertical scaling estimation. Figure 5.3: Scaling: symmetrical, (nearly) identical displacements of all MBB-sides. be introduced. To achieve this, the motion of the sides of the MBB is analyzed and motion types are detected based on plausibility rules. This analysis detects four states of object motion changes: translation, scaling, and acceleration. If non-translational motion is estimated, an object is divided into several partitions that are assigned different motion vectors. The number of regions depends on the magnitude of the estimated non-translational motion. Usually motion in objects does not contain fine details and motion vectors are spatially consistent so that large object regions have identical motion vectors. Therefore, the number of regions need not to be high. Detection of translation This thesis assumes translational object motion if the displacements of the horizontal and vertical sides of the object MBB are nearly identical, i.e., |dr | < td ∧ T ranslation : (5.8) |dc | < td . In this case one motion vector (Eq. 5.7) is assigned to the whole object. Detection of scaling This thesis assumes the scaling center as the centroid of the segmented object and assumes object scaling if the displacements of the parallel sides of the MBB are symmetrical and nearly identical. This means Scaling : ((|dr | < ts ) ((|dc | < ts ) ∧ ∧ (drmin · drmax ) < 0) ∧ (dcmin · dcmax ) > 0). (5.9) with a small threshold ts . For example, if one side is displaced to the right by three pixels, the parallel side is displaced by three pixels to the left. This is illustrated in Fig. 5.3(a). If scale change is detected the object is divided into sub-regions where the number of regions depends on the difference |dr |. Each region is then assigned one displacement as follows: the region closest to rmax is assigned drmax and the region 100 Motion estimation closest to rmin is assigned drmin . For in-between regions motion is interpolated by increasing or decreasing drmin and drmax (Fig. 5.3). The accurate detection of scaling depends on the performance of the segmentation. However, Eq. 5.9 takes into account possible segmentation errors. Detection of rotation Rotation about the center can be detected when there is a small difference between the orientations of the horizontal MBB-sides and a small difference between the orientation of the vertical sides in the current and previous images (Fig. 5.4). Oi (Cx,Cy) Op Figure 5.4: Rotation: similar orientations of the MBB-sides. Detection of general motion In case of composition of motion types, three types of motion are considered: translational motion, non-translational motion, and acceleration: If |wyi − drmin | > a ∨ |wyi − drmax | > a (5.10) where a is the maximal possible acceleration, then it is a vertical non-translation and the object is divided into |drmax − drmin | + 1 vertical regions. As with scale change estimation, each region in case of non-translational motion is assigned one displacement as follows: the region closest to rmax is assigned drmax and the region closest to rmin is assigned drmin . The motion of in-between regions is interpolated by increasing or decreasing drmin and drmax (Fig. 5.5). V1 = displ. of min. row V4 = displ. of max. row V2 & V3 are interpolated displ. of min. row V1 V2 V3 V4 Op displ. of max. row Oi Figure 5.5: Vertical non-translational motion estimation. Proposed approach 101 If |wxi − dcmin | > a ∨ |wxi − dcmax | > a (5.11) then horizontal non-translational motion is declared and the object is divided into |dcmax − dcmin | + 1 horizontal regions. Each region is assigned one displacement as follows: the region closest to cmax is assigned dcmax and the region closest to cmin is assigned dcmin . The motion of the other regions is interpolated based on dcmin and dcmax . Detection of motion at image margin MBB-based motion estimation will be affected by objects entering or leaving the visual field. Therefore, this condition has to be explicitly detected to adapt the estimation. Motion at image borders is detected by small motion of the MBB-side that is at the image border. The motion of the object is then defined based on the motion of the MBB-side that is not at the image border (cf. Fig. 5.6). This consideration is important for event-based representation of video. It enables tracking and monitoring object activity as soon as the objects enter or leave the image. Horizontal displacement of the object Op Oi defines image border Figure 5.6: Object motion at image border. Compensation of interlaced artifacts Analog or digital video can be classified as interlaced or non-interlaced2 . The interlacing often disturbs image edges aligned vertically. In interlaced video, vertical motion estimation can be distorted because of aliasing where two successive fields have different rasters. The effect is that the vertical motion vector will fluctuate by ±1 between two fields. To compensate for 2 Non-interlaced video is also called progressive scan. Most personal computers use progressive scan. Here all lines in a frame are displayed in one pass. TV-signals are interlaced video. Each frame consists of two fields displayed in two passes. Each field contains every other horizontal line in the frame. A TV displays the first field of alternating lines over the entire screen, and then displays the second field to fill in the alternating gaps left by the first field. An NTSC field is displayed 1 th 1 th approximately every 60 of a second and a PAL field in 50 of a second. 102 Motion estimation this fluctuation, the current and previous vertical displacements are compared; if they deviate only by one pixel, then the minimal displacement of the two is selected. Another (computationally more expensive) method to compensate for the effect of interlaced video is to interpolate the missing line of the raster so that both fields become on the same raster. This interpolation results in shifting each line of the field; therefore, it must be done differently for different fields. Such an approach has been investigated in [18] which shows that the effect of the interlaced alias can be significantly reduced. 5.5 5.5.1 Experimental results and discussion Evaluation criteria Evaluation criteria for motion estimation techniques can be divided into: 1) Accuracy criteria: Two subjective evaluation criteria to evaluate the accuracy of the estimated motion vectors are used. The first criterion is to display the estimated vector fields and the original image side by side (Fig. 5.8). The second criterion is based on motion compensation. Motion compensation is a non-linear prediction technique where the current image I(n) is predicted from I(n − 1) using the motion estimated between these images (Fig. 5.9). 2) Consistency criteria: The second category of evaluation criteria is consistency of the motion vectors throughout the image sequence. Motion-based object tracking is one way to measure the consistency of an estimated motion vector (Chapter 6). 3) Implementation criteria: An important implementation-oriented evaluation criterion is the cost of computing the motion vectors. This criterion is critical in real-time applications, such as video surveillance, video retrieval, or frame-rate conversion. It is important to evaluate proposed methods based on these criteria if these are intended for usage in real-time environment. There are also objective criteria, such as the Mean Square Error (MSE), the Root Mean Square Error (RMSE) and the PSNR (cf. [43]) to evaluate the accuracy of motion estimation. The selection of an appropriate evaluation criterion depends on the application. In this thesis, the applications are real-time object tracking for video surveillance and real-time object-based video retrieval. In these applications, objective evaluation criteria are not as appropriate as in the case of coding or noise reduction applications. We have carried out objective evaluations using the MSE criterion. These evaluations have shown that the proposed motion estimation method gives lower MSE compared to the MSE using the block-matching technique in Results 5.5.2 103 Evaluation and discussion Block matching for motion estimation is one of the fastest and relatively reliable motion estimation techniques. It is used in many applications. It is likely that, in many applications, block-matching techniques will be used. The proposed method is compared in this section to a state-of-the-art block-matching-based motion estimation that has been implemented in hardware and found to be useful for TV-applications, such as noise reduction [43, 46]. Computational costs Although block matching is a fast technique, in the video analysis system presented in this thesis, faster techniques are, however, needed. Simulation results show that the 1 computational cost for the object-based motion estimation is about 15 of the computation cost of a fast block matching [43]. This block-based method has a complexity about forty times lower than that of a Full-search block matching algorithm which is used in various MPEG-2 encoders. Furthermore, regular (i.e., the same operations are applied for each object) MBB-based object partition and motion estimation are used. Because of its low computational cost and regular operations, this method is suitable for real-time video applications, such as video retrieval. Quality of the estimated motion In case of partial or complete object occlusion, object-oriented video analysis relies on the estimated motion of the object to predict its position or to find it in case it is lost. Thus a relatively accurate motion estimate is required. Block matching gives motion estimation for blocks and not for objects, and it can fail if there is insufficient structure (Fig. 5.8(e)). The proposed method, as will be shown in Chapters 6 and 7, provides good estimates to be used in object prediction for object tracking and in event extraction. Figs. 5.8, 5.9, and 5.7 show samples of our results. Figs. 5.8 displays the horizontal and vertical components of the motion field between two frames of the sequence ‘Highway’, estimated by block matching and object matching methods. The horizontal and vertical components are encoded separately. Here, the magnitudes were scaled for display purposes, with darker gray-levels representing negative motion, and the lighter gray level representing positive motion. Another measure of the quality for the estimated motion is given by comparing the object predictions using block matching and object matching. Fig. 5.9 shows that, despite being a simple technique, the proposed method gives good results compared to more sophisticated block matching techniques. 104 Block motion MSE with block motion Motion estimation Object motion MSE with object motion Block motion MSE with block motion Object motion MSE with object motion Figure 5.7: Object-matching versus block-matching: the first row shows horizontal block motion vectors using the method in [43] and object motion vectors in sequence. The second row shows the mean-square error between the motion compensated and original images using block vectors and object vectors. A drawback of block-matching methods is that they deliver non-homogeneous motion vectors inside objects which affect motion-based video processing techniques, such as noise reduction. On the other hand, the proposed cost-effective object-based motion estimation provides more homogeneous motion vectors inside objects (Fig. 5.7). Using this motion information, the block-matching motion field can be enhanced [12]. Fig. 5.7 displays an example of the incomplete block-matching motion compensation. It also shows the better performance of object-matching motion compensation. The integration block and object motion information is an interesting research topic for applications, such as motion-adaptive noise reduction or image interpolation. Results 105 (a) Original image. (b) Horizontal object motion field. (c) Vertical object motion field. (d) Horizontal block motion field. (e) Vertical block motion field. Figure 5.8: Object versus block motion. Note the estimated non-translational motion of the left car. Motion is coded as gray-levels where strong dark level indicates fast motion to the left or up and strong bright level indicates fast motion to the right or down. 106 Motion estimation (a) Object-based prediction I(196). The objects are correctly predicted. (c) Object-based (zoomed in). prediction I(6) (b) Block-based prediction I(196). Note the artifacts introduced at object boundaries. (d) Block-based (zoomed in). prediction I(6) Figure 5.9: Prediction of objects: block-based prediction introduces various artifacts while object-based prediction gives smoother results inside objects and at boundaries. Summary 5.6 107 Summary A new real-time approach to object-based motion estimation between two successive images has been proposed in this Chapter. This approach consists of an explicit matching of arbitrarily-shaped objects to estimate their motion. Two motion estimation steps are considered: estimation of the displacement of an object by calculating the displacement of the mean coordinates of the object and estimation of the displacements of the four MBB-sides. These estimates are compared. If they differ significantly, a non-translational motion is assumed and the different motion vectors are assigned to different image regions. In the proposed approach, extracted object information (e.g., size, MBB, position, motion direction) is used in a rule-based process with three steps: object correspondence, estimation of the MBB motion based on the displacement of the sides of the MBB, i.e., the estimation process is independent of the intensity signal and detecting object motion types (scaling, translation, acceleration) by analyzing the displacements of the four MBB sides and assigning different motion vectors to different regions of the object. Special consideration is given to object motion in interlaced video and at image margin. Various simulations have shown that the proposed method provides good estimates for object tracking, event detection, and high-level video representation as will be given in the following chapters. Chapter 6 Voting-Based Object Tracking Object tracking has various applications. It can be used to facilitate the interpretation of video for high-level object description of the temporal behavior of objects (e.g., activities such as entering, stopping, or exiting a scene). Such high-level descriptions are needed in various content-based video applications such as surveillance or retrieval [27, 70, 35, 133, 130]. While object tracking has been extensively studied for surveillance and video retrieval, limited work has been done to temporally integrate or track objects or regions throughout an image sequence. Tracking can also be used to assist the estimation of coherent motion trajectories throughout time and to support object segmentation ([55], Chapter 5). The goal of this section is to develop a fast, robust, object tracking method that accounts for multiple correspondences and object occlusion. The object tracking module receives input from the motion estimation and object segmentation modules. The main issue in tracking systems is reliability in case of shadows, occlusion, and object split. The proposed method focuses on solutions to these problems. 6.1 Introduction Video analysis methods proposed so far, object segmentation and motion estimation, provide low-level spatial and temporal object features for consecutive images of a video. In various applications, such as video interpretation, low-level locally limited object features are not sufficient to describe moving objects and their behavior throughout an entire video shot. To achieve higher-level object description, objects must be tracked and their temporal features registered as they move. Such tracking and description of objects transforms locally-related objects into video objects. Tracking of objects throughout the image sequence is possible because of spatial and temporal continuity: objects, usually, move smoothly, do not disappear or change direction suddenly. Therefore, the temporal behavior of moving objects is predictable. 110 Object tracking Various changes make, however, tracking of objects in real scenes a difficult task: • Image changes, such as noise, shadows, light changes, surface reflectance and clutter, can obscure object features to mislead tracking. • The presence of multiple moving objects further complicates tracking, especially when objects have similar features, when their paths cross, or when they occlude each other. • Non-rigid and articulated objects are yet another factor to confuse tracking because their features vary. • Inaccurate object segmentation also obscures tracking. • Possible feature changes, e.g., due to object deformation or scale change (e.g., object size can change rapidly) can also confuse the tracking process. • Finally, application related requirements, such as real-time processing, limit the design freedom of tracking algorithms. This thesis develops an object tracking method that solves many of these difficulties using a reliable strategy to select features that remain stable over time. It uses a robust detection of occlusion to update features of occluded objects. It further robustly integrates features so that noisy features are filtered or compensated. 6.2 Review of tracking algorithms Applications of object tracking are numerous [80, 42, 71, 16, 65, 75, 55, 50, 40, 70]. Two strategies can be identified: one uses correspondence to match objects between successive images and the other performs explicit tracking using stochastic methods such as MAP approaches [75, 15, 71, 55]. Explicit tracking approaches model occlusion implicitly but have difficulties to detect entering objects without delay and to track multiple object simultaneously. Furthermore, they assume models of the object features that might become invalid [75]. Most methods have high computational costs and are not suitable for real-time applications. Tracking based on correspondence tracks objects, either by estimating their trajectory or by matching their features. In both cases object prediction is needed to define the location of the object along the sequence or to predict occluded objects. Prediction techniques can be based on Kalman filters or on motion estimation and compensation. While the use of a Kalman filter [80, 50, 16, 40] relies on an explicit trajectory model, motion compensation does not require a model of trajectory. In complex scenes, the definition of an explicit trajectory model is difficult and can hardly be generalized for many video sequences [71]. Furthermore, basic Kalman filtering is noise sensitive and can hardly recover its target when lost [71]. Extended Kalman filters can estimate tracks in some occlusion cases but have difficulty when the number of objects and artifacts increase. Review 111 Correspondence-based tracking establishes correspondence between features of one object to features of another object in successive images. Tracking methods can be divided into three categories according to the features they use: • Motion-based methods (ex. [80]) track objects based on their estimated motion. They either assume a simple translational motion model or more complex parametric models, e.g., affine models. Although robust motion estimation is not an easy task, motion is a powerful tool for tracking and is used widely. • Contour-based methods [71, 16]) represent objects by their boundaries and shapes. Contour-based features are fast to compute and to track and are robust to some illumination changes. Their main disadvantage is sensitivity to noise. • Region-based methods [65, 16] aim at representing the objects through their spatial pattern and its variations. Such a representation is, in general, robust to noise and object scaling. It requires, however, large amounts of computation for tracking (e.g., using correlation methods) and is sensitive to illumination changes. Region-based features are useful in case of small objects or with low resolution images. For small objects, contour-based methods are more strongly affected by noise or by low resolution. While earlier tracking algorithms were based on motion and Kalman filtering [80], recent algorithms [50, 65, 70] combine features for more robust tracking. The method in [65] is based on a change detection from multiple image scales, and a Kalman Filter to track segmented objects based on contour and region features. This approach is fast and can track multiple objects. It has, however, a large tracking delay, i.e., objects are detected and tracked after being in the scene for a long time, no object occlusion is considered, the object segmentation has large deviation, and its model is oriented to one narrow application (vehicle tracking). In [16] object tracking is performed using active contour models, region-based analysis and Kalmanbased prediction. This method can track one object and relies heavily on Kalmanfiltering, which is not robust to clutter and artifact. Moreover, the method has high computational cost. The study in [40] uses a change detection, morphological dilation by a 5 × 5 kernel, estimation of the center-of-gravity, and Kalman filters for position estimation. It is able to keep the object of interest in the field of view. No object occlusion is, however, considered and the computational cost is high. The method in [70] is designed to track people that are isolated, move in an upright fashion and are unoccluded. It tracks objects by modeling their body-parts motion and matching objects whose MBB overlap. To continue to track objects after occlusion, statistical features of two persons before occlusion are compared to the features after occlusion to recover objects. A recent object tracking algorithm is proposed in [50]. It consists of various stages: motion estimation, change detection with background adaptation, spatial 112 Object tracking clustering (region isolation, merging, filtering, and splitting), and Kalman-filtering based prediction for tracking. The system is optimized to track humans and has a fast response on a workstation with a specialized high performance graphic card. This depends on the contents of the input sequence. It uses a special procedure to detect shadows and reduce their effects. The system can also track objects in the presence of partial occlusion. The weak part of this system is the change detection module which is based on an experimentally-fixed threshold to binarize the image difference signal. This threshold remains fixed throughout the image sequence and for all input sequences. Furthermore, all the thresholds used in the system are optimized by a training procedure which is based on an image sequence sample. As discussed in Section 4.4 and in [112], experimentally selected thresholds are not appropriate for a robust autonomous surveillance system. Thresholds should be calculated dynamically based on the changing image and sequence content, which is especially critical in the presence of noise. Many methods for object tracking contribute to solve some difficulties of the object tracking problem. Few have considered real environments with multiple rigid or/and articulated objects, and limited solutions to the occlusion problem exist (examples are [70, 50]). These methods track objects after, and not during, occlusion. In addition, many methods are designed for specific applications [65, 40, 75] (e.g., tracking based on body parts’ models or vehicle models) or impose constraints regarding camera or object motion (e.g., upright motion) [70, 50]. In this Chapter, a method to track objects in the presence of multiple rigid or articulated objects is proposed. The algorithm is able to solve the occlusion problem in the presence of multiple crossing paths. It assigns pixels to each object in the occlusion process and tracks objects successfully during and after occlusion. There are no constraints regarding the motion of the objects and on camera position. Sample sequences used for evaluation are taken with different camera positions. Objects can move close to or far from the camera. When objects are close to the camera occlusion is stronger and is harder to resolve. 6.3 6.3.1 Non-linear object tracking by feature voting HVS-related considerations Visual processing ranges from low-level or iconic processes to the high-level or symbolic processes. Early vision is the fist stage of visual processing where elementary properties of images such as brightness are computed. It is generally agreed that early vision involves measurements of a number of basic image features such as color or motion. Tracking can be seen as an intermediate processing step between high- Overall approach 113 level and low-level processing. Tracking is an active field of research that produces many methods, and despite attempts to make object tracking robust to mis-tracking, tracking is far from being solved under large image changes, such as rapid unexpected motions, changes in ambient illumination, and severe occlusions. The HVS can solve the task of tracking under strong ambiguous conditions. The HVS can balance various features and is successful in tracking. Therefore, it is important to orient tracking algorithms to the way or to what is known about how the HVS tracks objects. The study of the movement of the eyes gives some information about the way the HVS tracks objects. The voluntary movement of the human eyes can be classified into three categories. Saccade, smooth pursuit, and vergence [138, 104]. The movement of the eyes when jumping from one fixation point in space to another is called saccade. Saccade brings the image of a new visual target onto the fovea. This movement can be based on intention or reflection. When the human eye maintains a fixation point of a target moving at a moderate speed on the fovea, the movement is called smooth pursuit. The HVS uses multiple clues from the target, such as shape or motion, for a robust tracking of the target. Vergence movement adds depth by adjusting the eyes so that the optical axes keep intersecting on the same target while depth varies. This ensures that both eyes are fixed on the same target. This fixation is helped by disparity clues which play an important role. When viewing an image sequence, the HVS focuses mainly on moving objects and is able to coherently combine both spatial and temporal information to track objects robustly in a non-linear manner. In addition, the HVS is able to recover quickly from mis-tracking and continue successfully to track an object it has lost. The proposed tracking method is oriented to these properties of the HVS by focusing on moving objects, by integrating multiple clues from the target, such as shape or motion, for a robust target tracking, and by using non-linear feature integration based on a two-step voting scheme. This scheme solves the correspondence problem and uses contour, region, and motion features. The proposed tracking aims at quick recovery in case an object is lost or is partly occluded. 6.3.2 Overall approach In a video shot, objects in one image are assumed to exist in successive images. In this case, temporally linking or tracking objects of one image to the objects of a subsequent image is possible. In the proposed tracking, objects are tracked based on the similarity of their features in successive images I(n) and I(n − 1). This is done in four steps: object segmentation, motion estimation, object matching, and feature monitoring and correction (Fig. 6.1). Object segmentation and motion estimation extract objects and their features and represent them in an efficient form (discussed previously in Chapter 4-5 and Section 3.5). Then, using a voting-based feature integration, each 114 Object tracking object Op of the previous image I(n − 1) is matched with an object Oi of the current image I(n) creating a unique correspondence Mi = Op → Oi . This means that all objects in I(n − 1) are matched with objects in I(n) (Fig. 6.2). Mi is a function that assigns to an object Op at time n − 1 an object Oi at time n. This function provides a temporal linkage between objects which defines the trajectory of each object throughout the video and allows a semantic interpretation of the input video (Section 7). Finally, object segmentation errors and object occlusion are detected and corrected and the new data used to update the segmentation and the motion estimation steps. For example, the error correction steps can produce new objects after detecting occlusions. Motion estimation and tracking need to be performed for these new objects. Each tracked object is assigned a number to represent its identity throughout the sequence. This is important for event detection applications, as will be shown in Chapter 7. I(n) I(n-1) Object segmentation & motion estimation O(n) feadback & update O(n-1) Object matching by feature integration based on voting Monitoring & correction of object occlusion & segmentation error (includes region merging ) Object trajectory & temporla links Figure 6.1: Block diagram of the proposed tracking method. Solving the correspondence problem in ambiguous conditions is the challenge of object tracking. The important goal is not to lose any objects while tracking. Ambiguities arise in the case of multiple matches, when one object corresponds to several objects or in the case of zero match M0 : Op a when an object Op cannot be matched to any object in I(n) (Fig. 6.2). This can happen, for example, when objects split, merge, or are occluded. Further ambiguity arises when the appearance of an object varies from one image to the next. This can be a result of erroneous segmentation (e.g., holes caused by identical gray-levels between background and objects), or changes in lighting conditions or in viewpoint. Object correspondence is achieved by matching single object features and then Overall approach 115 combining the matches based on a voting scheme. Such feature-based solutions need to answer some questions concerning feature selection, monitoring, correction, integration, and filtering. Feature selection schemes define good features to match. Feature monitoring aims at detecting errors and at adapting the tracking process to these errors. Feature correction aims at compensating for segmentation errors during tracking, especially during occlusion. Feature integration defines ways to efficiently and effectively combine features. Feature filtering is concerned with ways to monitor and eventually filter noisy features during tracking over time (Fig. 6.2). Objects I(n) Objects I(n-1) Search area M0 Oi Mi Oj Op Mj I(n-1) Figure 6.2: The object correspondence problem. In the following sections, strategies for feature selection, integration, and filtering are proposed. Besides, techniques to solve problems related to segmentation errors and ambiguities are proposed. Many object tracking approaches based on feature extraction assume that the object topology is fixed throughout the image sequence. In this thesis, the object to be tracked can be of arbitrary shape and can gradually change its topology throughout the image sequence. No prior knowledge is assumed and there are no object models. Tracking is activated once an object enters the scene. An entering object is immediately detected by the change detection module. The segmentation and motion estimation modules extract the relevant features for the correspondence module. While tracking objects, the segmentation module keeps looking for new objects entering the scene. Once an object is in the scene, it is assigned a new trajectory. Objects that have no corresponding objects are assumed to be new, entering or appearing, and assigned a new trajectory. Once it leaves its trajectory ends. In the case of multiple object occlusion, the occlusion detection module first detects occluded and occluding objects, and then continues to track both types of objects even if objects are completely invisible, i.e., their area is zero since no pixel can be assigned to them. This is because objects may reappear. Despite attempts to make object tracking robust to mis-tracking (for example, against background distractions), tracking can fail under large image changes, such as rapid unexpected motions, big changes in ambient illu- 116 Object tracking mination, and severe occlusions. Many of these types of failures are unavoidable and even the HVS cannot track objects under some conditions (for instance, when objects move quickly). If it is not possible to avoid mis-tracking, the proposed tracking system is designed to at least recover tracking of an object it has lost. 6.3.3 Feature selection In this thesis, four selection criteria are defined. First, unique features are used, i.e., features not significantly changing over time. The choice is between two of the most important unique features of an object, motion and shape. Second, estimation-errorcompensated feature representations are selected. It is known [111, 65] that different features are sensitive to different conditions (e.g., noise illumination changes). For example, boundaries of objects are known to be insensitive to a range of illumination changes. On the other hand, some region-based, e.g., object area, features are insensitive to noise [111, 65]. Furthermore, segmentation errors, such as holes, affect H . features, such as area, but do not significantly affect perimeter or a ratio, such as W Features based on contour and regions are used in the proposed matching procedure. Third, features within a finite area of the object to be matched need to be selected. This criterion limits matching errors. Finally, feature representations that balance real-time and effectiveness considerations are selected. Based on these criteria the following object features and representations are selected for the matching process (details in Section 3.5). The size and shape tests look at local spatial configurations. Their representations have error-compensated properties. The motion test looks at the temporal configurations and is one of the strongest unique features of objects. The distance test limits the feature selection to a finite area. In the case of multiple matches, the confidence measure helps compensation of matching errors. Representation of the features are fast to compute and to match, and, as will be shown in the following, are effective in tracking objects even in multi-object occlusion1 . • Motion: direction δ = (δx , δy ) and displacement w = (wx , wy ). • Size: area A, height H, and width W . • Shape: extent ratio e = P 2 /(4πA). H , W compactness c = A , HW and irregularity r = • Distance: Euclidean distance di between the centroids of two objects Op and Oi . • Confidence measure: degree of confidence ζi of an established correspondence Mi : Op → Oi (Eq. 6.2). 1 For not selecting color features see Section 3.5.2 Feature voting 117 Experimental results show that the use of this set of features gives stable results. Other features, such as texture, color, and spatial homogeneity, were tested but gave no additional stabilization to the algorithm in the tested video sequences. 6.3.4 Feature integration by voting In a multi-feature-based matching of two entities, an important question is how to combine features for stable tracking. Three requirements are of interest here. First, a linear combination would not take into account the non-linear properties of the HVS. Second, when combining features their collective contribution is considered and the distinguishing power of a single feature becomes less effective. Therefore, it is important to balance the effectiveness of a single feature and multi-features. This increases the spatial tracking accuracy. Third, the quality of features can vary throughout the sequence and, therefore, features should be monitored over time and eventually excluded from the matching process. This requirement aims at increasing the temporal stability of the matching process. The proposed solution (Fig. 6.3) takes these observations into account by combining spatial and temporal features using a non-linear voting scheme of two steps: voting for object matching (object voting) and voting between two object correspondences (correspondence voting). The second step is applied in the case of multiple matches to find the better one. Each voting step is first divided into m sub-matches with m object features. Since features can get noisy or occluded, the m varies spatially (objects) and temporally (throughout the image sequence) depending on a spatial and temporal filtering (cf. Section 6.3.7).(failed) (non-similarity d) Then each sub-match, mi , is performed separately using the appropriate test function. Each time a test is passed a similarity s variable is increased to contribute one or more votes. On the other hand, if a test is failed, a non-similarity d variable is increased to contribute one or more votes. Finally a majority rule compares the two variables and decides about the final vote. The simplicity of the two-step non-linear feature combination, which uses distinctive features oriented on properties of the HVS (Section 6.3.1), provides a good basis for fast and efficient matching, which is illustrated in the result Section 6.4. In the case of zero match a Oi , i.e., no object in I(n − 1) can be matched to an object in I(n), a new object is declared entering or appearing into the scene, depending on its location. In the case of reverse zero match Op a, i.e., no object in I(n) can be matched to an object in I(n − 1), Op is declared disappearing or exiting the scene which depends on its location. Both cases will be treated with more details in Chapter 7. The voting system requires the definition of some thresholds. These thresholds are important to allow variations due to feature estimation errors. The thresholds are adapted to the image and object size (see Section 6.3.7). 118 Object tracking for all O p in I(n-1) for all O i in I(n) Yes Next object of I(n) ? Object vote Is Op close to Oi ? No End No (i.e., lays in its search area) Yes Calculate object similarity by voting Yes Is O p similar to O i of O(n) ? No Yes Correspondence vote has Op already No a correspondence ? Declare an initial correspondence Yes Calculate the deviations between the object features of the two correspondences Compare the deviations by voting Is the new correspondence better ? No Keep old correspondence Yes Replace correspondances Figure 6.3: Two-step object matching by voting. Object voting In this step, three main feature tests are used: shape, size, and motion tests. The shape and size tests include three sub-tests and the motion test 2 sub-tests. This solution wants to avoid cases where one feature fails and the tracking is lost (especially in the case of occlusion). Definitions: • Op an object of the previous image I(n − 1), • Oi the ith object of the current image I(n), ¯ i = Op 9 Oi (non-correspondence • Mi = Op → Oi a correspondence and M between Op and Oi , Feature voting 119 • di the distance between the centroids of Op and Oi , • tr the radius of a search area around Op , • wi = (wxi , wyi ) the estimated displacement of Op relative to Oi , • wmax the maximal possible object displacement. This depends on the application. For example, 15 < wmax < 32. • s the similarity count between Op and Oi , • d the dissimilarity count between Op and Oi , • s++ an increase of s, and d++ of d, by one vote. Then Mi : (di < tr ) ¯i M : otherwise ∧ (wxi < wymax ) ∧ (wyi < wymax ) ∧ (ζ > tm ) (6.1) with the vote confidence ζ = ds and tm a real-valued threshold larger than, for example, 0.5. Mi is accepted if Oi lays within a search area of Op , its displacement is not larger than a maximal displacement, and if both objects are similar, i.e., ds > tm . The use of this rule instead of the majority rule (i.e., s > d) is to allow the acceptance of Mi even if s < d. This is important in case objects are occluded, where some features are significantly dissimilar that might cause the rejection of good correspondence. Note that this step is followed by a correspondence step and no error can be introduced because of accepting correspondences with eventually dissimilar objects. For each correspondence Mi = Op → Oi a confidence measure ζi , which measures the degree of certainty of Mi is used, defined as follows: ½ d−s : ds < tm v ζi = (6.2) s−d : ds > tm v where v is the total number of feature votes. To compute s and d the following feature votes are applied where tz < 1 and ts < 1 are functions of the image and object sizes (see Eq. 6.18). ½ ½ Ap /Ai : Ap ≤ Ai Hp /Hi : Hp ≤ Hi Size vote Let rai = , r hi = , and Ai /Ap : Ap > Ai Hi /Hp : Hp > Hi ½ Wp /Wi : Wp ≤ Wi rwi = , where Ai , Hi , and Wi are area, height, and width of Wi /Wp : Wp > Wi an object Oi (see Section 3.5.2). Then s++ : rai > tz ∨ rhi > tz ∨ rwi > tz d++ : rai ≤ tz ∨ rhi ≤ tz ∨ rwi ≤ tz . (6.3) 120 Object tracking Shape vote Let ep (ei ), cp (ci ), rp (ri ) be, respectively, the extent ratio, compactness and irregularity of the shape of Op (Oi ) (see Section 3.5.2), dei = |ep −ei |, dci = |cp −ci |, and dri = |rp − ri |. Then s++ : dei ≤ ts ∨ dci ≤ ts ∨ dri ≤ ts d++ : dei > ts ∨ dci > ts ∨ dri > ts . (6.4) Motion vote Let δp = (δxp , δyp ) and δc = (δxc , δyc ) be, respectively, the previous and current motion directions of Op . Then s++ : δxc = δxp ∨ δyc = δyp d++ : δxc 6= δxp ∨ δyc 6= δyp . (6.5) Correspondence voting Recall that all objects I(n − 1) are matched to all objects of I(n). First each object Op ∈ I(n − 1) is matched to each object Oi ∈ I(n). This may result in multiple matches for one object, for example, (Mpi : Op → Oi and Mpj : Op → Oj ) or (Mpi : Op → Oi and Mqi : Oq → Oi ) with Op , Oq ∈ I(n − 1) and Oi , Oj ∈ I(n). If the final correspondence vote results in si ≈ sj , i.e, two objects of I(n) are matched with the same object in I(n − 1), or sp ≈ sq , i.e., two objects of I(n − 1) are matched with the same object in I(n), plausibility rules are applied to solve this case. This is explained in detail in Section 6.3.7. To decide which of the two correspondences is the right one the following vote is applied. Let si (sj ) be a number that describes the suitability of Mi (Mj ). Then Mi : s i > s j (6.6) Mj : s i ≤ s j . A voting majority rule is applied in this voting step. To compute si and sj the following votes are applied where tkz < 1, tks < 1, and tkd > 1 are functions of the image and object sizes (see Eq. 6.18). In the following, the index k denotes a vote for correspondence. Distance vote Let di (dj ) be the distance between Op and Oi (Oj ). Let dkd = |di − dj |. Then si ++ : dkd > tkd ∧ di < dj (6.7) sj ++ : dkd > tkd ∧ di > dj . The aim of the condition dkd > tkd is to ensure that only if the two features really differ the vote can be applied. If the features do not differ then neither si nor sj is increased. Feature voting 121 Confidence vote Let dζ = |ζi − ζj |. Then si ++ : (dζ > tζ ) sj ++ : (dζ > tζ ) ∧ ∧ (ζi > ζj ) (6.8) (ζi < ζj ). The condition dζ > tζ ensures that only if the two features differ significantly, the vote can be applied. Size vote Let dka = |rai − raj |, dkh = |rhi − rhj |, and dkw = |rwi − rwj |. Then si ++ : (dka > tkz (dkh > tkz (dkw > tkz ∧ ∧ ∧ rai < raj ) ∨ rhi < rhj ) ∨ rwi < rwj ) sj ++ : (dka > tkz (dkh > tkz (dkw > tkz ∧ ∧ ∧ rai > raj ) ∨ rhi > rhj ) ∨ rwi > rwj ). (6.9) If the features do not differ, i.e., their difference is less than a threshold, then neither si nor sj is increased. Shape vote Let dke = |dei − dej |, dkc = |dci − dcj |, and dkr = |dri − drj |. Then si ++ : (dke > tks (dkc > tks (dkr > tks ∧ ∧ ∧ rei < rej ) ∨ rci < rcj ) ∨ rri < rrj ) sj ++ : (dke > tks (dkc > tks (dkr > tks ∧ ∧ ∧ rei > rej ) ∨ rci > rcj ) ∨ rri > rrj ). (6.10) Only if the two features significantly differ, the vote is applied. If the features do not differ then neither si nor sj is increased. Motion vote • Direction vote: let δc = (δxc , δyc ), δp = (δxp , δyp ), δu = (δxu , δyu ) be, respectively, the current (i.e., between I(n) and I(n − 1)), previous (i.e., between I(n − 1) and I(n − 2)) and past-previous (i.e., between I(n − 2) and I(n − 3)) motion direction of Op . Let δi = (δxi , δyi ) (δj = (δxj , δyj )) be the motion direction of Op 122 Object tracking if it will be matched to Oi (Oj ). si ++ sj ++ : : (δxi = δxc (δyi = δyc ∧ ∧ (δxj = δxc (δyj = δyc ∧ ∧ δxi = δxo δyi = δyo ∧ ∧ δxj = δxo δyj = δyo ∧ ∧ δxi = δxu ) ∨ δyi = δyu ) δxj = δxu ) ∨ δyj = δyu ). (6.11) • Displacement vote: let dmi (dmj ) be the displacement of Op relative to Oi (Oj ) and dkm = |dmi − dmj |. Then si ++ : (dkm > tkm ) ∧ (dmi < dmj ) sj ++ : (dkm > tkm ) ∧ (dmi > dmj ). (6.12) Here dkm > tkm means that the displacements have to differ significantly to be considered for voting. tkm is adapted to detect segmentation error. For example, in the case of occlusion, it is increased. Also it is a function of the image and object size (see Section 6.18). The motion magnitude test can contribute more than one vote to the matching process; if si = sj and the difference dkm is large, then si or sj is increased by 1,2, or 3 as follows: si +1 si +2 si +3 : (dkm < tkmmin ) ∧ (dkm > tkm ) : (tkmmin < dkm < tkmmax ) ∧ (dkm > tkm ) : (dkm > tkmmax ) ∧ (dkm > tkm ) ∧ ∧ ∧ (dmi < dmj ) (dmi < dmj ) (dmi < dmj ) (6.13) ∧ (dkm > tkm ) sj +1 : (dkm < tkmmin ) sj +2 : (tkmmin < dkm < tkmmax ) ∧ (dkm > tkm ) sj +3 : (dkm > tkmmax ) ∧ (dkm > tkm ) 6.3.5 ∧ ∧ ∧ (dmi > dmj ) (dmi > dmj ) (dmi > dmj ). Feature monitoring and correction Since achieving perfect object segmentation is a difficult task, it is likely that a segmentation algorithm outputs erroneous results. Therefore, robust tracking based on object segmentation should take possible errors into account and try to correct or compensate for their effects. Three types of error are of importance: object merging due to occlusion (Fig. 6.4), object splitting due to various artifacts (6.6), and object deformation due to viewpoint change or other changing conditions. Analysis of displacements of the four MBB-sides allows the detection and correction of various object segmentation errors. This thesis detects and corrects these types of errors based on plausibility rules and prediction strategies as follows. Feature monitoring 123 Correction of erroneous merging Detection Let: • Op1 , Op2 ∈ I(n − 1), • Mi : Op1 → Oi where Oi results from the occlusion of Op1 and Op2 in I(n), • dp12 be the distance between the centroids of Op1 and Op2 , • w = (wx , wy ) be the current displacement of Op1 , i.e., between I(n − 2) and I(n − 1), and recall (Section 5.4.2) that • drmax (drmin ) be the vertical displacement of the lower (upper) row and • dcmax (dcmin ) be the horizontal displacement of the right (left) column of Op1 . Object occlusion is declared if ((|wy − drmax | > t1 ) ∧ (drmax > 0) ∧ (di12 < t2 )) ∨ ((|wy − drmin | > t1 ) ∧ (drmin > 0) ∧ (di12 < t2 )) ∨ ((|wx − dcmax | > t1 ) ∧ (dcmax > 0) ∧ (di12 < t2 )) ∨ ((|wx − dcmin | > t1 ) ∧ (dcmin > 0) ∧ (di12 < t2 )), (6.14) where t1 and t2 are thresholds. If occlusion is detected then both the occluding and occluded objects are labeled with a special flag. This labeling enables the system in the subsequent images to continue tracking both objects even if they are completely invisible. Tracking invisible objects is important since they might reappear. The labeling is further important to help detect occlusion even if the occlusion conditions in Eq. 6.14 are not met. min. row in I(n-1) large displ. of the min. row Op1 Op2 Oi min. row in I(n) (two objects are occluded and merged) I(n-1) I(n) Figure 6.4: Object occlusion: large outward displacement of an MBB-side. Correction by object prediction If occlusion is detected, the occluded object Oi is split into two objects. This is done by predicting both object Op2 and Op1 onto I(n) using the following displacement estimates: dp1 = (MED(d1xc , d1xp , d1ux ), MED(d1yc , d1yp , d1uy )) dp2 = (MED(d2xc , d2xp , d2ux ), MED(d2yc , d2yp , d2uy )) (6.15) 124 Object tracking Figure 6.5: Two examples of tracking two objects during occlusion (zoomed in). with M ED representing a 3-tap median filter, d1xc (d1yc ), d1xp (d1yp ), d1ux (d1uy ) as, respectively, the current, previous and past-previous horizontal (vertical) displacements of Op1 and d2xc (d2yc ), d2xp (d2yp ) ,d2ux (d2uy ) as the current, previous and past-previous horizontal (vertical) displacement of Op2 . After splitting occluded and occluding objects, the list of objects of I(n) is updated, for example, by adding Op2 . Then a feedback loop estimates the correspondences in case new objects are added (Fig. 6.1). Two examples of object occlusion detection and correction are shown in Fig. 6.5. The scene shows two objects moving and then they occlude each other. The change detection module provides one segment for both objects but the tracking module is able to correct the error and track the two objects also during occlusion. Note that in the original images of these examples (Fig. 6.20) the objects appear very small and pixels are missing or misclassified due to some difficulties of the change detection module. However, most pixels of the two objects are correctly classified and tracked. Correction of erroneous splitting Detection Assume Op ∈ I(n − 1) is split in I(n) into two objects Oi1 ∈ I(n − 1) / I(n − 1). Let and Oi2 ∈ • Mi : Op → Oi1 , • di12 be the distance between the centroids of Oi1 and Oi2 , and • w = (wx , wy ) be the current displacement of Op , i.e., between I(n − 2) and I(n − 1). Then object splitting is declared if (|wy − drmax | > t1 ) ∧ drmax < 0 ∧ di12 < t2 ∨ (|wy − drmin | > t1 ) ∧ drmin < 0 ∧ di12 < t2 ∨ (|wx − dcmax | > t1 ) ∧ dcmax < 0 ∧ di12 < t2 ∨ (|wx − dcmin | > t1 ) ∧ dcmin < 0 ∧ di12 < t2 . (6.16) Region merging 125 If splitting is detected, then the two object regions Oi1 and Oi2 are merged into one object Oi (Section 6.3.6). After merging two object regions, the features of Oi and the match Mi : Op → Oi are updated (Fig. 6.1). I(n) Oi1 Oi2 (object is split into 2 regions) Op1 I(n-1) Figure 6.6: Object splitting: large inward displacement of an MBB-side. Compensation of deformation and other errors Detection Let Mi : Op → Oi . If (|wy − drmax | > t1 ) (|wx − dcmax | > t1 ) ∨ (|wy − drmin | > t1 ) ∨ ∨ (|wx − dcmin | > t1 ) (6.17) and no occlusion or split is detected as described in Eqs. 6.14 and 6.16 then object deformation or other unknown segmentation error is assumed and the object displacement estimation is adapted as described in Chapter 5, Eqs. 5.5-5.6. 6.3.6 Region merging Objects throughout a video show specific homogeneity measures, such as motion or texture. Segmentation methods may divide one homogeneous object into several regions. The reason is twofold: optical, i.e., noise, shading, illumination change, and reflection, and physical when an object include regions of different features. For example, human body parts have different motion. When a segmentation method assumes one-feature-based homogeneity or when it does not take optical errors into account, it will fail to extract objects correctly. Region merging is an unavoidable step in segmentation methods. It is a process where regions are compared to determine if they can be merged into one or more objects [61, 118]. It is desirable because subregions complicate the process of object-oriented video analysis and interpretation and merging may reduce the total number of regions, which results in improved performance, if applied correctly. 126 Object tracking Regions can be merged either based on i) spatial homogeneity features such as texture or color, ii) temporal features such as motion, or iii) geometrical relationships. Examples of such geometrical relationships are inclusion, e.g., one region is included in another region, and size ratio, i.e., the size of a region is significantly larger than the other. If, for example, a region is in another region and its size significantly smaller, it may be merged if the two objects show some similar characteristics such as motion. This thesis develops a different merging strategy that is based on geometrical relationships, temporal coherence, and matching of objects rather than on single local features such as motion or size. Assume an object Op ∈ I(n − 1) is split in I(n) into two sub-regions Oi1 and Oi2 (Fig. 6.6). Assume the matching process matches Op with Oi1 . Then Oi2 and Oi1 are merged to be Oi if all the following conditions are met: • Equation 6.16 applies. • Object voting gives Mi : Op → Oi with lower vote confidence ζ, i.e., ζ > tmmerge with tmmerge < tm (Eq. 6.1). • If a split is found at one side of the MBB (based on Eq. 6.16), then all the displacements of the three other MBB sides of Op should not change significantly when the two objects are merged. • Oi1 is spatially close to Oi2 and Oi2 to Op ; for example, in the case of a down split as shown in Fig. 6.7, all the distances d, dnc , dxc , and dxr have to be small. • Geometrical features: size, height, and width, of the merged object Oi = Oi1 + Oi2 match the geometrical features of Op . For example, tmin < AApi < tmax , with thresholds tmin , tmax . • The motion direction of Op does not significantly change if matched to Oi . Op Oi1 dxc dnc dr O i2 dh Figure 6.7: Merging example: spatially close Oi1 and Oi2 . This simple merging strategy has proven to be powerful in various simulations. The good performance is due to the convolution of the tracking and merging processes. Each process supports the other based on restricted rules that aim at limiting the Feature filtering (a) An object ∈ I(60) is split in two. 127 (b) Video objects after matching and merging. (c) Objects I(191). ∈ (d) Video objects ∈ I(191) after matching and merging. Figure 6.8: Performance of the proposed region merging. The MBB includes the result of the merging. false merging. It is preferable to leave objects unmerged rather than merging different objects, which then complicates tracking. The advantage of the proposed merging strategy compared to known merging techniques (cf. [61, 118]) is that it is based on temporal coherency through the tracking process and not on simple features such as motion or texture. Fig. 6.8 shows examples of the good performance of the merging process. The method is successful also when multiple small objects are close to the split object (Fig. 6.8(d)). 6.3.7 Feature filtering A good object matching technique must take into account noise and estimation errors. Due to various artifacts, the extraction of object features is not perfect (cf. Section 6.1). A new key idea in the proposed matching process is to filter features between two images and throughout the image sequence for robust tracking. This is done by ignoring features that become noisy or occluded across images. With such a model it is possible to discriminate between good and noisy feature estimates, and to ignore estimates taken while the object of interest is occluded. This means that features for tracking are well-conditioned, i.e., two features cannot differ by several orders of magnitude. The following plausibility filtering rules are applied: • Error allowance: ◦ Feature deviations are possible and should be allowed. Error allowance should be, however, a function of the object size because small objects are more sensible to image artifacts than larger ones. If two objects are small, then a small error allowance is selected. If, however, objects are large then a larger error allowance can be tolerated. 128 Object tracking ◦ the HVS perceives differences between objects, for example, depending on the differentially changing object size. In small objects, for example, a difference of few percent in the number of pixels is significant, whereas in large objects, a small deviation may not be perceived as significant. Therefore, thresholds of the used feature tests (Eqs. 6.3-6.8) should be a function of the input object size. This adaptation to the object size allows a better distinction at smaller sizes and a stronger matching at larger sizes. The adaptation of the thresholds to the object size is done in a non-linear way. For example, tsmin : A ≤ Amin (6.18) ts = f (A) : Amin < A ≤ Amax tsmax : A > Amax , where the form of the function f (A) depends on an application. Various forms of f (A) are possible (see Fig. 4.7 in Section 4.4.3). In simulation a linear f (A) was used. The values of Amin and Amax are determined experimentally but they can be changed for specific applications. For example, in applications where objects appear small as in the sequence ‘Urbanicade’ (Fig. 6.20), these values should be set low. However, in all simulations in this thesis the same parameters where chosen for all test sequences. This show the stability and good performance of the proposed framework. • Error monitoring: ◦ To monitor the quality of the feature over time the dissimilarity of features between two successive images is examined and when it grows large this feature is not included in the matching process. ◦ If two correspondences (Mi and Mj ) of the same object Op have a similar feature deviation, then this feature is excluded from the voting process. For example, shape irregularity dr = |rip − rjp | < tr with rip = rrpi and rjp = rrpj (definitions of rip and rjp are given in Section 3.5.2). • Matching consistency: ◦ Objects are tracked once they enter the scene and also during occlusion. This is important for activity analysis. ◦ Object correspondence is performed only if the estimated motion directions are consistent. ◦ If, after applying the correspondence voting scheme, two objects of I(n−1) are matched with the same object in I(n), the match with the oldest object (i.e, with the longer trajectory) is selected. ◦ If, due to splitting, two objects of I(n) are matched with the same object in I(n − 1), the match with the largest size is selected. Results 129 ◦ If, during the matching process or after object separation due to occlusion, a better correspondence than a previous one is found, the matching is revised, i.e., the previous correspondence is removed and the new one is established (Fig. 6.2). ◦ Due to the fault-tolerant and correction strategy integrated into the tracking, objects that split into disjoint regions or change their topology over time, can still be tracked and matched with the most similar object. ◦ An object of a set of disjoint regions in I(n − 1) that becomes connected in I(n) is tracked and matched with the object region most similar to the new formed object in I(n). 6.4 Results and discussions Computational cost The proposed tracking takes between 0.01 to 0.21 seconds to link objects between two successive images. As can be seen in Table 6.1, the main computational costs go to detect and correct segmentation errors such as occlusion. The object projection and separation needs some computation cost in the case of large objects or when multiple objects are occluded. Object matching Object correction Motion estimation 0.0001 0.01-0.21 0.0001 Table 6.1: Tracking computation cost in seconds on a ‘SUN-SPARC-5 360 MHz’. Experimental evaluation Few in the literature presented object tracking methods have considered real environments with multiple rigid or/and articulated objects, and limited solutions to the occlusion problem exist. These methods track objects after, and not during, occlusion. In addition, many methods are designed for specific applications (e.g., tracking based on body parts’ models or vehicle models) or impose constraints regarding camera or object motion (e.g., upright motion). The proposed method is able to solve the occlusion problem in the presence of multiple crossing paths. It assigns pixels to each object in the occlusion process and tracks objects successfully during and after occlusion. There are no constraints regarding the motion of the objects and on camera position. Sample sequences used for evaluation are taken with different camera positions. In the following, simulation results using the proposed tracking method applied on widely used video shots (10 containing a total of 6371 images) are presented and discussed. Indoor, outdoor, and noisy real environments are considered. The shown 130 Object tracking results illustrate the good performance and robustness of the proposed approach even in noisy images. This robustness is due to the non-linear behavior of the algorithm and due to the use of plausibility rules for tracking consistency, object occlusion detection and other segmentation error. The good performance of the proposed tracking is shown by three methods: • Tracking in successive images: The robustness of the proposed tracking can be clearly demonstrated when tracking objects in non-successive images. As can be seen in Fig. 6.9 the objects are robustly tracked even when five images have been skipped. • Visualization of the trajectory of objects: To illustrate the temporal tracking consistency of the proposed algorithm the estimated trajectory of each object is plotted as a function of the image number. Such a plot illustrates the reliability of both the motion estimation and tracking methods and allows the analysis and interpretation of the behavior of an object throughout the video shot. For example, the trajectories in Fig. 6.10 show that various objects enter the scene at different times. Two objects (O4 and O2 ) are moving quickly (note that the trajectory curve increases rapidly). In Fig. 6.12, the video analysis extracts three objects. Two objects enter the scene in the first image while the third object enters around the 70th image. O1 moves horizontally to the left and vertically down, O2 moves horizontally right and vertically up, and O5 moves quickly to the left. While the interpretation of objects undergoing straight-forward motion (for example, not stopping or depositing something) is easy to follow and interpret, motion and behavior of persons that perform action are not easy to follow. For example, in Fig. 6.14, a person enters the scene and removes an object, and in Fig. 6.13, a person enters, moves, deposits an object, and meanwhile changes direction. As can be seen, the trajectory of these two persons is complex. • Selection of samples of tracking throughout the video: Figures 6.17-6.20 show sample tracking results throughout various test sequences. These results show the reliability of the proposed method in the case of occlusion (Fig. 6.19 and 6.20), object scale variations (Fig. 6.20), local illumination changes and noise (Fig. 6.18 and 6.17). The three evaluation methods show the reliability of both the motion estimation and the tracking. Their output allows the detection of events by analyzing the behavior of objects throughout a video shot. This can be done in an intuitive and straightforward manner, as given in Section 7.3. Summary 6.5 131 Summary and outlook The issue in tracking systems is reliability in the case of shadows, occlusion, and object split. Few in the literature presented tracking methods have considered such real environments. Many methods impose constraints regarding camera or object motion (e.g., upright motion). This chapter develops a robust object tracking method. It is based on a non-linear voting system that solves the problem of multiple correspondences. The occlusion problem is alleviated by a simple detection procedure based on the displacements of the object and then a median-based prediction procedure, which provide a reasonable estimate for (partially or completely) occluded objects. Objects are tracked once they enter the scene and also during occlusion. This is important for activity analysis. Plausibility rules for consistency, error allowance and monitoring are proposed for accurate tracking over long periods of time. An important contribution of the proposed tracking is the reliable region merging which improves the performance of the whole video algorithm. A possible extension of this method is seen in tracking objects that move in and out of the scene. The proposed algorithm has been developed for content-based video applications such as surveillance or indexing and retrieval. Its properties can be summarized as follows: • Both rigid (e.g., vehicles) and articulated (e.g., human) objects can be tracked. Since in real-world scenes articulated objects may contain a large number of rigid parts, estimation of their motion parameters may result in huge computation (e.g., to solve a very large set of non-linear equations). • the algorithm is able to handle several objects simultaneously and to adapt to their occlusion or crossing. • No template or model matching is used, but simple rules are used that are largely independent of object appearance (e.g., matching based purely on the object shape and not on its image content). Further, the technique does not require any trajectory model. • A confidence measure is maintained over time until the system is confident about the correct matching (especially in the case of occlusion). • A simple motion estimation guides the tracking without the requirement of having predictive temporal filtering (e.g., Kalman filter). • The tracking procedure is independent of how objects are segmented. Any advance in object segmentation will enhance the final results of the tracking but will not influence the way the tracking works. • There is no constraint regarding the motion of objects or camera position. Sample sequences used for evaluation are taken with different camera positions. Objects can move close to or far from the camera. 132 Object tracking Figure 6.9: Tracking results of the sequence ‘Highway’. To show the robustness of the tracking algorithm only one in every five images has been used. This show that the proposed method can track objects that moves fast. Results 133 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 0 50 100 0 50 100 150 x y 150 200 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 200 250 250 300 350 0 50 100 150 Img No. 200 250 300 300 0 50 100 150 200 250 300 Img No. Figure 6.10: The trajectories of the objects in the sequence ‘Highway’ where ‘StartP’ represents the starting point of a trajectory. The upper figure gives the trajectory of the objects in the image plan while the two other figures give the trajectories for vertical and horizontal direction separately. This allows an interpretation of the object motion behavior throughout the sequence. For example, O2 starts left of the image and moves and stops at the edge of the highway. The figures show how the object is available throughout the whole shot while the other objects start and disappear within the shot. Various vehicles enter the scene at different times. Some objects move quickly while the others are slower. Objects are moving in both directions: away from the camera and towards the camera. The system tracks all objects reliably. See also Fig. 6.17. 134 Object tracking StartP:1 ObjID: 1 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:6 ObjID: 6 StartP:7 ObjID: 7 StartP:8 ObjID: 8 StartP:9 ObjID: 9 0 50 100 150 StartP:1 ObjID: 1 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:6 ObjID: 6 StartP:7 ObjID: 7 StartP:8 ObjID: 8 StartP:9 ObjID: 9 0 StartP:1 ObjID: 1 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:6 ObjID: 6 StartP:7 ObjID: 7 StartP:8 ObjID: 8 StartP:9 ObjID: 9 20 40 60 80 x y 100 200 120 140 250 160 300 180 350 0 50 100 150 Img No. 200 250 300 200 0 50 100 150 200 250 300 Img No. Figure 6.11: The trajectories of the objects in the sequence ‘Urbicande’. Many persons enter and leave the scene. One person is walking around. Persons appears very small and occlude each other. The system is reliable even in the presence of multiple occluding objects. O1 starts inside the image at (160,250) and moves around within the rectangle (300,160),(150,80). O5 , for example, starts at image 25 and moves left across the shot reaching the other end of the image. The original sequence is rotated by 90◦ to the right to comply with the CIF format. See also Fig. 6.20. Results 135 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:5 ObjID: 5 0 0 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:5 ObjID: 5 20 50 40 100 60 80 150 x y 100 200 120 140 250 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:5 ObjID: 5 300 350 0 160 180 50 100 150 Img No. 200 250 200 0 50 100 150 200 250 Img No. Figure 6.12: The trajectories of the objects in the sequence ‘Survey’. Three persons are entering the scene at different instants. The sequence includes reflection and other local image changes. The system is able to track the three objects before, during and after occlusion. See also Fig. 6.19. 136 Object tracking StartP:1 ObjID: 1 StartP:5 ObjID: 5 StartP:6 ObjID: 6 60 90 80 StartP:1 ObjID: 1 StartP:5 ObjID: 5 StartP:6 ObjID: 6 100 100 110 120 120 140 x 160 y 130 180 140 200 150 StartP:1 ObjID: 1 StartP:5 ObjID: 5 StartP:6 ObjID: 6 220 240 260 0 160 50 100 150 Img No. 200 250 300 170 0 50 100 150 200 250 300 Img No. Figure 6.13: The trajectory of the objects in the sequence ‘Hall’. Two persons enter the scene. One of them deposits an object. This sequence contains noise and illumination changes. As can be seen this shot include complex object movements. For example, the person on the left side of the image is entering, turning left, putting an object, comes back a little and then moves shortly straight before turning left and disappearing. See also Fig. 6.18. Results 137 StartP:1 ObjID: 1 StartP:2 ObjID: 2 0 60 80 50 100 100 120 140 x 150 y 160 200 180 200 250 300 StartP:1 ObjID: 1 StartP:2 ObjID: 2 0 50 100 StartP:1 ObjID: 1 StartP:2 ObjID: 2 220 150 200 250 300 350 Img No. 400 450 500 550 600 650 700 240 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Img No. Figure 6.14: The trajectories of the objects in the sequence ‘Floort’. An object is removed. The difficulty of this sequence is that it contains coding and interlaced artifacts. Furthermore, illumination changes across the trajectory of the objects. The change detection splits some objects but due to the robust merging based on tracking results the algorithm remains stable throughout the sequence and in the case of error. 138 Object tracking StartP:1 ObjID: 1 StartP:2 ObjID: 2 30 60 StartP:1 ObjID: 1 StartP:2 ObjID: 2 40 80 100 50 120 60 140 x y 160 70 180 80 200 StartP:1 ObjID: 1 StartP:2 ObjID: 2 90 220 100 0 50 100 150 200 250 300 350 Img No. 400 450 500 550 600 650 700 240 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Img No. Figure 6.15: The trajectory of the objects in the sequence ‘Floorp’. An object is deposited. The shadows are a main concern. They complicate the detection of the correct trajectory, but despite some small deviations the object trajectory is reliable. Results 139 StartP:1 ObjID: 1 StartP:3 ObjID: 3 0 80 StartP:1 ObjID: 1 StartP:3 ObjID: 3 20 StartP:1 ObjID: 1 StartP:3 ObjID: 3 100 120 40 140 60 x y 160 80 180 100 200 120 140 220 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 Img No. 240 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 Img No. Figure 6.16: The trajectories of the objects in the sequence ‘Floor’. An object is being first deposited and then removed. This is a long sequence with complex movements, interlace artifacts, illumination changes and shadows. The trajectory is complex but the system is able to track the object correctly. 140 Object tracking Figure 6.17: Tracking results of the ‘Highway’ sequence. Each object is marked by an ID-number and enclosed in its minimum bounding box. This sequence illustrates successful tracking in the presence of noise, scale and illumination changes. Results 141 Figure 6.18: Tracking results of the ‘Hall’ sequence. Each object is marked by an IDnumber and enclosed in its minimum bounding box. The algorithm works correctly despite the various local illumination changes and object shadows. 142 Object tracking Figure 6.19: Tracking results of the ‘Survey’ sequence. Each object is marked by an ID-number and enclosed in its minimum bounding box. Despite the multi-object occlusion (O1 , O2 and O5 ), light changes, and reflections (e.g., car surfaces) the algorithm stays stable. Because of the static traffic sign, the change detection divides the object into two regions; Tracking is recovered properly. Results 143 Figure 6.20: Tracking results of the ‘Urbanicade’ sequence. Each object is marked by an ID-number and enclosed in its minimum bounding box. In this scene, objects are very small and experience illumination changes and object occlusions (O1 & O6 , O1 & O8 ). However, the algorithm continues to track the objects correctly. Chapter 7 Video Interpretation 7.1 Introduction Computer-based interpretation of recorded scenes is an important step towards automated understanding and manipulation of scene content. Effective interpretation can be achieved through integration of object and motion information. The goal in this chapter is to develop a high-level video representation system useful for a wide range of video applications that effectively and efficiently extracts semantic information using low-level object and motion features. The proposed system achieves its objective by extracting and using context-independent video features: qualitative object descriptors and events. Qualitative object descriptors are extracted by quantizing the low-level parametric descriptions of the objects. To extract events, the system monitors the change of motion and other low-level features of each object in the scene. When certain conditions are met, events related to these conditions are detected. Both indoor and outdoor real environments are considered. 7.1.1 Video representation strategies The significant increase of video data in various domains requires effective ways to extract and represent video content. For most applications, manual extraction is not appropriate because it is costly and can vary between different users depending on their perception of the video. The formulation of rich automated content-based representations is, therefore, an important step in many video services. For a video representation to be useful for a wide range of applications, it must describe precisely and accurately video content independently of context. In general, a video shot conveys objects, their low-level and high-level features 146 Video interpretation within a given environment and context1 . Video representation using solely low-level objects does not fully account for the meaning of a video. To fully represent a video, objects need to be assigned high-level features as well. High-level object features are generally related to the movement of objects and are divided into context-independent and context-dependent features. Features that have context-independent components include object movement, activity, and related events. Here, movement is the trajectory of the object within the video shot and activity is a sequence of movements that are semantically related (e.g., pitching a ball) [22]. Context-dependent high-level features include object action which is the semantic feature of a movement related to a context (e.g., following a player) [22]. An event expresses a particular behavior of a finite set of objects in a sequence of a small number of consecutive images of a video shot. An event consists of contextdependent and context-independent components associated with a time and location. For example, a deposit event has a fixed semantic interpretation (an object is added to the scene) common to all applications but the deposit of an object can have variable meaning in different contexts. In the simplest case, an event is the appearance of a new object into the scene or the exit of an object from the scene. In more complex cases, an event starts when the behavior of objects changes. An important issue in event detection is the interaction between interpretation and application. Data are subject to a number of different interpretations and the most appropriate one depends upon the requirements of the application. An event-oriented video content representation is complete only if it is developed in a specific context and application. Video shot Video representation Global features Moving objects Video analysis Video interpretation Meaning (e.g., events) - context independent - Video understanding Behavior - context dependent - Figure 7.1: Video representation levels. 1 A video sequence can also contain audio information. In some applications it is useful to use both audio and visual data to support interpretation. On the other hand, audio data is not always available. For example, in some countries the recording of sound by CCTV surveillance systems is outlawed). Therefore, it is important to be able to handle video retrieval and surveillance based on visual data. Introduction 147 Video Shot Video Shot consists of consists of Object 1 Object 2 close to, start after, ... ... Object i ... Object n depart away, near miss, ... (a) Structural representation. Object 1 Object 2 ... Event 1 ... Event n Event related to objects (e.g., deposit) (b) Conceptual representation. Figure 7.2: Interpretation-based video representation. To extract object features, three levels of video processing are required2 (Fig. 7.1): • The video analysis level aims at the extraction of objects and their spatiotemporal low-level and quantitative features. • The video interpretation level targets the extraction of qualitative and semantic features independent of context. A significant semantic features are events that are extracted based on spatio-temporal low-level features. • The video understanding level addresses the recognition of behavior and actions of objects within the context of object motion in the video. The interpretation-based representations can be divided into structural and conceptual representations (Fig. 7.2). Structural representations use spatial, temporal, and relational features of the objects while conceptual representations use objectrelated events. This chapter addresses video interpretation (both structural and conceptual) for on-line video applications such as video retrieval and surveillance. 7.1.2 Problem statement There is a debate in the field of video content representation whether low-level video representations are sufficient for advanced video applications such as video retrieval or surveillance. Some researchers question the need for high-level representations. For some applications, low-level video representation is an adequate tool and the cost of high-level feature computations can be saved. Studies have shown that low-level features are not sufficient for effective video representation [81]. The main restriction of low-level representations is that they rely on the users to perform the high-level 2 Video representations based on global content, such as global motion, are needed in some applications. They can be combined with other representations to support different tasks, for example, in video retrieval [24]. 148 Video interpretation abstractions [81, 124]. The systems in [116, 90] contribute a solution using relevance feedback mechanisms where they first interact with the user by low-level features and then learn the user’s feedback to enhance the system performance. Relevance feedback mechanisms work well for some applications that require small amounts of high-level data, such as in retrieval of texture images. Most users do not, however, extract video content based on low-level features solely and relevance feedback mechanisms are not sufficient for effective automated representation in advanced applications. The main difficulty in extracting high-level video content is the so-called semantic gap. It is the difference between the automatically extracted low-level features and the features extracted by humans in a given situation [124]. Humans look for features that convey a certain message or have some semantic meaning, but automatically extracted features describe the objects quantitatively. To close this gap, methods need to be developed for association high-level semantic interpretation with extracted lowlevel data without relying completely on low-level descriptions to take decisions. For many applications, such as surveillance, there is no need to provide a fully semantic abstraction. It is sufficient to provide semantic features that are important to the users and similar to how the humans find content. Extracting semantic features for a wide range of video applications is important for high-level video processing, especially in costly applications where human supervision is needed. High-level content allows users to retrieve a video based on its qualitative description or to take decisions based on the qualitative interpretations of the video. In many video applications, there is a need to extract semantic features from video to enable a video-based system to understand the content of the video. The issue is what level of video semantic features are appropriate for general video applications? For example, are high-level intentional descriptions such as what a person is doing or thinking needed? An important observation is that video context can change over time. It is thus important to provide content representation that has fixed semantic features which are generally applicable for a wide range of applications. The question here is how to define fixed semantic video contents and to extract features suitable to represent it. The purpose of video is, in general, to document events and activities made by objects or a group of objects. People usually look for video objects that convey a certain message [124] and they usually focus and memorize [72, 56]: i) events, i.e., ‘what happened’, ii) objects, i.e., ‘who is in the scene’, iii) location, i.e., ‘where did it happen’, and iv) time, i.e., ‘when did it happen’. Therefore, a generally useful video interpretation should be able to: • take decisions on lower-level data to support subsequent processing levels, • qualitatively represent objects and their spatial, temporal, and relational features, Introduction 149 • extract object semantic features that are generally useful, and • automatically and efficiently provide a response (e.g., real-time operation). 7.1.3 Related work As defined, events include semantic primitives. Therefore, event recognition is widely studied in the artificial intelligence literature, where the focus to is develop formal theories and languages of semantics and inference of actions and events ([23, 67]; for more references see [22, 81]). Dynamic scene interpretation has traditionally been quantitative and typically generates large amounts of temporal qualitative data. Recently, there has been increased interest in higher-level approaches to represent and to reason with such data using structural and conceptual approaches. For example, the studies in [57, 34] focus on structural video representations based on qualitative reasoning methods. Research in the area of detecting, tracking, and identifying people and objects has become a central topic in computer vision and video processing [22, 81, 105]. Research interest shifted towards detection and recognition of activities, actions and events. Narrow-domain systems recognize events and actions, for example, in hand sign applications or in Smart-Cameras based cooking (see the special section in [130], [81, 139, 22]). In these systems, prior knowledge is, usually, inserted in the event recognition inference system and the focus is on recognition and logical formulation of events and actions. Some event-based surveillance application systems also have been proposed. In the context-dependent system in [86] the behavior of moving objects in an airborne video is recognized. The system compensates for global motion, tracks moving objects and defines their trajectory. It uses geo-spatial context information to analyze the trajectories and detect likely scenarios such as passing or avoiding the checkpoint. In the context-dependent system in [14], events, such as removal or siting or use terminal, are detected in a static room and precise knowledge of the location of certain objects in the room is needed. Other examples and references to context-dependent video interpretation can be found in [29, 28]. The context-dependent system in [70] tracks several people simultaneously and uses appearance-based models to identify people. It determines whether a person is carrying an object and can segment the object from the person. It also tracks body parts such as head or hands. The system imposes, however, restrictions on the object movements. Objects are assumed to move upright and with little occlusion. Moreover, it can only detect a limited set of events. There has been little work on context-independent interpretation. The system in [38] is based on motion detection and tracking using prediction and nearest-neighbor matching. The system is able to detect basic events such as deposit. It can operate in 150 Video interpretation simple environments where one human is tracked and translational motion is assumed. It is limited to applications of indoor environments, cannot deal with occlusion, and is noise sensitive. Moreover, the definition of events is not widely applicable. For example, the event stop is defined by when an object remains in the same position for two consecutive images. The interpretation system for indoor surveillance applications in [133] consists of object extraction and event detection modules. The event detection module classifies objects using a neural network. The classification includes: abandoned object, person, and object. The system is limited to one abandoned object event in unattended environments. The definition of abandoned object, i.e., remaining in the same position for long time, is specific to a given application. Besides, the system cannot associate abandoned objects and the person who deposited them. The system is limited to surveillance applications of indoor environments. 7.1.4 Proposed framework For an automated video interpretation to be generally useful, it must include features of the video that are context-independent and have a fixed semantic meaning. This thesis proposes an interpretation system that focuses on objects and their related events independent of the context of a specific application. The input of the video interpretation system is a low-level description of the video based on objects and the output is a higher-level description of the video based on qualitative object and event descriptions. With the information provided from the video analysis presented in Chapter 3, events are extracted in a straightforward manner. Event detection is performed by integrating object and motion features, i.e., combining trajectory information with spatial features, such as size and location (Fig. 7.3). Objects and their features are represented in temporally linked lists. Each list contains information about the objects. Information is analyzed as it arrives and events are detected as they occur. An important feature of the proposed system is that it uses a layered approach. It goes from low-level to middle-level to high-level image content analysis to detect events. It integrates results of a lower level to support a higher level, and vice-versa. For example, low-level object segments are used in object tracking and tracking is used to analyze these segments and eventually correct them. In many applications, the location of an object or event is significant for decision making. To provide location information relevant for a specific application, a partition of the scene into areas of interest is required. In the absence of a specific application, this thesis uses two types of location specification (see Fig. 7.4). The first type specifies where is the border of the scene. The second type separates the image into nine sectors: center, right, left, up, down, left up, right up, left down, and right down. Introduction 151 Shot Object-oriented video analysis Data: objects & features Video interpretation Spatio-temporal feature description Motion analysis & interpretation Video objects to events Description: objects & features Event detection & classification Information: objects & events Results Requests Object & Event-based application e.g., event-based decision-making Figure 7.3: Video interpretation: from video objects to events. As a result, the proposed video interpretation outputs at each time instant n of a video shot V a list of objects with their features as follows: • Identity - a tag to uniquely identifies an object throughout the shot, • Low-level feature vector: ◦ ◦ ◦ ◦ ◦ Location - where object appears on the scene (initial and current) Shape - (initial, average, and current) Size - (initial, average, and current) Texture - texture of an object Motion - where an object is moving (initial, average, and current) • Trajectory - the set of the estimated centroids of the object throughout the shot. • Object life span or age - the time interval over which an object is tracked. • Event descriptions - the location and behavior of the object. • Spatio-temporal relationship - relation to other objects in space-time. • Global information - global motion or dominant object. A video shot, V , is thus represented by {(O, Po ), (E, Pe ), (G)} where • O is a set of video objects throughout V , • Po is a set of features for each Oi ∈ O, • E is a set of events throughout V , • Pe is a set of features of each event, and • G is a set of global features of the shot. 152 Video interpretation Left Center Right Origin Up min. row 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 Center max. row Down Obj. centroid min. col max. col Border Figure 7.4: Specification of Object locations and directions. Objects and events are monitored as the objects enter in the scene. Objects are linked to events by describing their relationship to the event. The representation (O, Po ) is defined in Section 7.2 and (E, Pe ) in Section 7.3. A video application can then use this information to search or process raw video (as an example, see the query form in the Appendix A in Fig. A.3). In the following, be defined • V = {I(1), · · · , I(N )} the input video shot of N images, • I(n), I(k), I(l) ∈ V images at time instant n, k, or l, • Oi , Oj objects in I(n), • Op , Oq objects in I(n − 1), • Bi the MBB of Oi , • gi the age of Oi , • ci = (cxi , cyi ) the centroid of Oi , • dij the distance between the centroids of Oi and Oj , • rmin the upper row of the MBB of an object, • rmax the lower row of the MBB of an object, • cmin the left column of the MBB of an object, and • cmax the right column of the MBB of an object. Object representation 7.2 153 Object-based representation In this section, qualitative descriptions of features of moving objects are developed. Spatial, temporal, and relational features are proposed. 7.2.1 Spatial features • Location - the position of object Oi in image I(n). ◦ Qualitative: to permit users to specify qualitative object locations, the image is divided into nine sectors (Fig. 7.4). Oi is declared to be in the center (right, left, top, down, up left, up right, down left, down right, respectively) of the image if its centroid is located in the center (right, left, top, down, up left, up right, down left, down right, respectively) sector. ◦ Quantitative: the position of an object is represented by the coordinates of its centroid ci . • Size, shape, and texture: ◦ Qualitative: – size descriptors: {small, medium, large}, {tall, short}, and {wide, narrow}. – shape descriptors: {solid, hollow}, and {compact, jagged, elongated}. Classification of shape needs to be more precise based on an application. In some applications, finer categories are needed to differentiate between objects, e.g., between person and vehicle or between person and another person. – texture descriptors: {smooth, grainy, motted, striped}. The categorization of texture is also application-dependent. ◦ Quantitative: see Section 3.5. 7.2.2 Temporal features Motion is a key low-level feature of video. Suitable motion representation and description play an important role in high-level video interpretation. The interpretation of quantitative object and global motion parameters is needed in video retrieval applications to allow a user to search for objects or shots based on perceived object motion or camera motion. Global motion Since video shot databases can be very large, pruning of large video databases is essential for efficient video retrieval. This thesis suggests two methods for pruning: 154 Video interpretation pruning based on qualitative global motion and pruning based on dominant objects (cf. Section 7.4.2). In video retrieval, the user may be asked to specify the qualitative global motion or the dominant object of the shot. This is useful, since a user can better describe a shot based on its qualitative features rather than by specifying a parametric description. Global motion estimation techniques represent global motion by a set of parameters based on a given motion model. Global motion estimation using an affine motion model is used to estimate the dominant global motion. The instantaneous velocity w of a pixel at position p in the image plane is given by a 6-parameter a = (a1 , a2 , a3 , a4 , a5 , a6 ) motion model: ¶ µ ¶ µ a 3 a4 a1 + p (7.1) w(p) = a2 a5 a6 In [25, 24], linear combinations of the parameters in a are analyzed to extract qualitative representations. For example, while a1 and a2 describe the translational motion, the linear combination 12 (a2 + a6 ) determines zoom. Rotation is expressed as the combination 12 (a5 − a3 ). If the dominant global motion is a pure pan, the only non-zero parameter is supposed to be a1 . In the case of zoom, the linear combination 12 (a2 +a6 ) is assumed to be the only non-zero parameter. Object motion The requirements on motion representation accuracy are not decisive in some application, but low complexity processing is essential. Retrieval or surveillance application are examples. The main requirement is, here, the capture of basic motion characteristics and not the highest possible accuracy of the motion description. For such applications it may be sufficient to classify motion qualitatively (e.g., large translation) or with an interval (e.g., motion lays within [4, 7]). Motion quantization and classification A classification of object motion includes translation, rotation, scale change or a mixture of these motions. The motion estimation technique proposed in Chapter 5 classifies object motion into translation and non-translation. Scale change can be approximated easily by temporal analysis of the object size. Motion is represented by direction and speed {δ, w}. The speed w is quantized into four descriptions: {static, slow, moderate, and fast}. Directions δ of the object motion are normalized and quantized into eight directions: down left, down right, up left, up right, left, right, down, and top. Trajectory For retrieval purposes, the trajectory (or path) of an object is needed to easily query video shots. Some examples are: objects crossing near, objects moving Object representation 155 left to right, and objects moving far side right to left. Once an object enters the scene, the tracking method assigns to it a new trajectory. When it leaves, the trajectory ends. Object trajectories are constructed from the coordinates of the centroid of the objects. These trajectories are saved and can be used to identify events or interesting objects, to support object retrieval or statistical analysis (e.g., frequent use of a specific trajectory). 7.2.3 Object-relation features Spatial relations The following spatial object relationships are proposed: • Direction: ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Oi Oi Oi Oi Oi Oi Oi Oi is is is is is is is is to the left of Oj if cxi < cxj . to the right of Oj if cxi > cxj . below Oj if cyi > cyj . above Oj if cyi < cyj . to the left and below Oj if (cxi < cxj ) ∧ (cyi > cyj ). to the left and above Oj if (cxi < cxj ) ∧ (cyi < cyj ). to the right and below Oj if (cxi > cxj ) ∧ (cyi > cyj ). to the right and above Oj if (cxi > cxj ) ∧ (cyi < cyj ). • Containment: ◦ Oi is inside Oj if Oi ⊂ Oj . ◦ Oi contains Oj if Oj ⊂ Oi . • Distance: ◦ Oi is near or close to Oj if dij < td and Oi 6⊂ Oj . Composite spatial relations, such as Oi is inside and to the left of Oj , can be easily detected. Also features, such as Oi is partially inside Oj , are easily derived. Temporal relations The following temporal object relationships are defined: • Oi starts after Oj if Oi enters or appears in the scene at I(n) and Oj at I(k) with n > k. • Oi starts before Oj if Oi enters or appears in the scene at I(n) and Oj at I(k) with n < k. • Oj and Oj start together if both enter or appear at the same I(n). • Oj and Oj end together if both exit or disappear at the same I(n). 156 Video interpretation Possible extensions The following object relations can be also compiled for needs of video applications: • Closeness: behavior may involve more than one object and typically is between objects that are spatially close. The identification of spatial and/or temporal closeness features, such as next to, ahead, adjacent, or behind, is important for some applications. For example, the detection of objects that come close in a traffic scene is important for risk analysis. Objects are generally moving at varying speeds. A static notion of closeness is, therefore, not appropriate. Ideally, the closeness feature should be derived based on the velocity of objects and their distance to the camera. • Collision can be defined as: two objects occlude each other and then the shape of both change drastically. If no 3-D data are available, real-world object collision can be approximated, for example, by collision of the MBB of objects. • Near miss: objects come close but do not collide. • Estimation of time-to-collision based on the interpretation of the object motion and distance. • Relative direction of motion: same, opposing, or perpendicular. 7.3 Event-based representation This thesis proposes perceptual descriptions of events that are common for a wide range of applications. Event detection is not based on geometry of objects but on their features and relations over time. The thesis proposes approximate but efficient world models to define useful events. In many applications, approximate models even if not accurate, are adequate. To define events, some thresholds are used which can be adapted to a specific application. For example, the event an object enters is defined when an object is visible in the scene for some time, i.e., its age is larger than a threshold. Some applications require the detection of an enter event as soon as a small portion of the object is visible while other applications require the detection of an event when the object is completely visible. In some applications, applicationspecific conditions concerning low-level features, such as size, motion, or age, need to be considered when detecting events. These conditions can be easily added to the proposed system. To detect events, the proposed system monitors the behavior and features of each object in the scene. If specific conditions are met, events related to these conditions are detected. Analysis of the events is done on-line, i.e., events are detected as they occur. Specific object features, such as motion or size, are stored for each image and compared as images of a shot arrive. The following low-level object features are Event representation 157 combined to detect events: • Identity (ID) - a tag to uniquely identify an object throughout the video. • Age - the time interval when the object is tracked. • MBB - (initial, average, and current). • Area - (initial, average, and current). • Location - (initial and current). • Motion - (initial, average, and current). • Corresponding object - a temporal link to the corresponding object. Here following are the definition of the events that the current system detects automatically. The proposed events are sufficiently broad for a wide range of video applications to assist understanding of video shots. Other composite events can be compiled using this set of events to allow a more flexible event-based representation to adapt for the need of specific applications. Enter An object, Oi , enters the scene at time instant n if all the following conditions are met: • Oi ∈ I(n), • Oi ∈ / I(n − 1), i.e, zero match M0 :a Oi meaning Oi cannot be matched to any object in I(n − 1), and • ci is at the image border in I(n) (Fig. 7.4)3 . Examples are given in Figs. 7.6–7.14. This definition aims at detecting object entrance as soon as a portion of the object becomes visible. In some applications, only entering objects of specific size, motion, or age are of interest. In these applications, additional conditions can be added to refine the event enter. Appear An object, Oi , emerges, or appears4 , in the scene at time instant n in I(n) if the following conditions are met: • Oi ∈ I(n), / I(n − 1), i.e., zero match in I(n − 1): M0 :a Oi , and • Oi ∈ 3 This condition should depend on how fast the object is moving which is an important extension of the proposed event detection method. 4 An object can either enter or appear at the same time 158 Video interpretation • ci is not at the image border in I(n). Examples are given in Figs. 7.6–7.14. Exit (leave) An object, Op , exits or leaves the scene at time instant n if the following conditions are met: • Op ∈ I(n − 1), • Op ∈ / I(n), i.e., zero match in I(n): M0 : Op a, • cp is at the image border in I(n − 1), and • gp > tg where gp is the age of Op and tg a threshold. Examples are given in Section 7.4.1. Disappear An object, Op , disappears from the scene at time instant n in I(n) if the following conditions are met: • Op ∈ I(n − 1), / I(n), i.e., zero match in I(n): M0 : Op a, • Op ∈ • cp is not at the image border in I(n − 1), • gp > tg . Examples are given in Section 7.4.1. Move An object, Oi , moves at time instant n in I(n) if the following conditions are met: • Oi ∈ I(n), • Mi : Op → Oi where Op ∈ I(n − 1), and • the median of the motion magnitudes of Oi in the last k images is larger than a threshold tm 5 . Examples are given in Figs. 7.6–7.14. Stop An object, Oi , stops in the scene at time instant n in I(n) if the following conditions are met: • Oi ∈ I(n), • Mi : Op → Oi where Op ∈ I(n − 1), • the median of the motion magnitudes of Oi in the last k images is less than a threshold tms . 5 Typical values of k are three to five and tm is one. Note that there is no delay to detect this event because motion data at previous images are available. Ideally, the value of k should depend on the object’ size as an approximation of the objects distance from the camera. To reduce computation a fixed threshold was, however, used. Event representation 159 Examples are given in Figs. 7.6–7.14. Occlude/occluded In Section 6.3.5, Eq. 6.14, the detection of occlusion is defined. With occlusion, at least two objects are involved where at least one is moving. All objects involved into occlusion have entered or appeared. When two objects occlude each other, the object with the larger area is defined as the occluding object, the other the occluded object. This definition can be adapted to the requirements of particular applications. Examples of occlusion detection are given in Figs. 7.12, 7.6, 7.7, and 7.13. Expose/exposed Exposure is the opposite operation of occlusion. It is detected when occlusion ends. Remove/removed Let Oi ∈ I(n) and Op , Oq ∈ I(n − 1) with Mi : Op → Oi . Op removes Oq if the following conditions are met: • Op and Oq were occluded in I(n − 1), / I(n), i.e., zero match in I(n): M0 : Oq a, and • Oq ∈ • the area of Oq is smaller than that of Oi , i.e., AAqi < ta , ta < 1 being a threshold. Removal is detected after occlusion. When occlusion is detected the proposed tracking technique (Section 6) predicts the occluded objects. In case of removal, the features of the removed object can change significantly and the tracking system may not be able to predict and track the removed objects. Thus the tracking technique may lose these objects. In this case, conditions for removal are checked and if they are met, removal is declared. The object with the larger area is the remover, the other is the removed object. Removal examples are given in Figs. 7.10 and 7.9. Deposit/deposited Let Op ∈ I(n − 1) and Oi , Oj ∈ I(n) with Mi : Op → Oi . Oi deposits Oj if the following conditions are met: 160 Video interpretation • Oi has entered or appeared, / I(n − 1), i.e., zero match in I(n − 1) with M0 :a Oj , • Oj ∈ • Aj Ai < ta , ta < 1 being a threshold, • Ai + Aj ' Ap ∧ [(Hi + Hj ' Hp ) ∨ (Wi + Wj ' Wp )], where Ai , Hi , and Wi are area, height, and width of an object Oi , • Oj is close to a side, s, of the MBB of Oi where s ∈ {rmini , rmaxi , cmini , cmaxi } (Oj is then declared as deposited object). Let dis be the distance between the MBB-side s and Oj . Oj is close to the MBB-side s if tcmin < dis < tcmax with the thresholds tcmin and tcmax , and • Oi changes in height or width between I(n − 1) and I(n) at the MBB-side s. If the distance between the MBB-side s and Oj is less than the threshold tcmin , then a split of Oj from Oi is assumed and Oj is merged to Oi . Only if this distance is large is the event deposit considered. This is so because in the real world, a depositor moves away from the deposited object and the deposit detection declares the event after the distance between the two objects is large. To reduce false alarms, deposit is declared if the deposited object remains in the scene for some time, e.g., age larger 7. The system differentiates between stopping objects (e.g., seated person or stopped car) and deposited objects. The system can also differentiate between deposit events and segmentation error due to splitting of objects (see Section 6.3.6). A deposited object remains long in the scene and the distance between the depositor and deposited object increases. Examples of object deposit are in Figs. 7.11, 7.9, and 7.8. Split An object splitting can be real (in case of object deposit) or due to object segmentation errors. The main difference between deposit and split is that a split object is close to the splitter while a depositor moves away from the deposited object and they become afar. The conditions for split are defined in Section 6.3.6 and Eq. 6.16. Objects at an obstacle Often, objects move close to static background objects (called obstacles) that can occlude part of the moving objects. This is particularly frequent in traffic scenes Event representation 161 where objects move close to traffic and other road signs. In this case, a change detection module is not able to detect pixels occluded by the obstacle and objects are split into two or more objects as shown in these figures: ⇒ . This is different from object split because no abrupt, but a gradual, change of object size and shape occurs. This thesis develops a method to detect the motion of objects at obstacles. The method monitors the size of each object, Oi , in the scene. If a continuous decrease or increase of the size of Oi is detected (by comparing the area of two corresponding objects), a flag for Oi is set accordingly. Let Oq , Op ∈ I(n − 1). Then Oq is at an obstacle if the following conditions are met: • Oq and Op have appeared or entered, • Oq has no corresponding object in I(n), i.e., zero match in I(n) with M0 : Oq a, • Aq was monotonically decreasing in the last k images, • Oq has a close object Op where ◦ Ap was continuously increasing in the last k images and ◦ Op has a corresponding object, i.e., Mi : Op → Oi , with Oi ∈ I(n). • Oq and Op have some similarity, i.e., object voting (Section 6.3.4, Eq. 6.1) gives Mp : Oq → Op → Oi with a low confidence, and • motion direction of Oq does not change if matched to Oi . Examples are given in Fig. 7.12. Note that while the transition images show two objects, the original object gets its ID back when motion at the obstacle is detected. Abnormal movements An abnormal movement occur when the movement of an object is frequent (e.g., fast motion) or when it is rare (slow motion or long stay). • an object, Oi , stays for long in the scene in the following cases (cf. examples in Figs. 7.13 and 7.6): ◦ gi > tgmax , i.e., Oi does not leave the scene after a given time. tgmax is a function of the frame-rate and the minimal allowable speed. ◦ di < tdmin , i.e, the distance, di , between the current position of Oi in I(n) and its past position in I(l), with l < n less than a threshold tdmin which is a function of the frame-rate, the object motion, and the image size. • an object, Oi , moves too fast (or moves too slow) if the object speed in the last k (for example, five) images is larger (smaller) than a threshold (cf. the example in Fig. 7.6). 162 Video interpretation Dominant object A dominant object • is related to a significant event, • has the largest size of all objects, • has the largest speed, or • has the largest age. Possible extensions Other events and composite events can be easily extracted based on our representation strategy. Also, application-specific conditions can be easily integrated. For example, approach a restricted site can be easily extracted when the location of the restricted site is known. The following list of events can be added to the proposed set of events: • Composite events Examples are: Oi moved, stopped, is occluded, and reverses directions. Oj is exposed, moves, and exits. • Stand/Sit/Walk Standing and sitting are characterized by continuous change in height and width of the object MBB. Sitting is characterized by continual increase of the width and decrease of the height. When an object stands, the width of its MBB continual increase while the height decrease. In both events, height and width must be compared to the values of the height and width at the time instant before they started to change. The event walk can be easily detected as continuous moderate movements of a person. • Approaching a restricted site This is an event that is straightforward to detect. If the location of a restricted site is given, the direction of an object’s motion and distance to the site can be monitored and the event approach a restricted site can be eventually declared. • Object lost/found At a time instant n, an object is declared lost if it has no corresponding object in the current image and occlusion was previously reported (but no removal). It is similar to the event disappear. Some applications require the search for lost objects even if they are not in the scene. To allow the system to find lost objects, features, such as ID, size, or motion, of lost objects need to be stored for future reference. If the event object lost was detected and a new object appears in the scene which shows similar features to the lost object, the objects can be matched and the event object found declared. • Changing direction or speed Based on a registered motion direction, which is registered when the object Results 163 completely enters the scene, the motion direction in the last k images previous to I(n) are compared with the registered motion direction. If the current motion direction is deviating from the motion direction in each of the k images, a change of direction can be declared. Similarly, change of speed can be detected. • Normal behavior Often, a scene contains events which have never occurred before or occur rarely. This is application dependent. In general, normal behavior can be defined as a chain of simple events: for example, enters, moves through the scene, and disappears. • Object history For some video applications, a summary or a detailed description of the spatiotemporal object features is needed. The proposed system can provide such a summary. An object history can include: initial location, trajectory, direction and velocity, significant change in speed, spatial relation to other objects, distance between current location of the object and a previous location. 7.4 Results and discussions There are few representation schemes concerning high-level features such as events. Most high-level video representations are context-dependent or focus on the constraints of a narrow application; so they lack generality and flexibility (Section 7.1.3). Extensive experiments using widely referenced video shots have shown the effectiveness and generality of the proposed framework. The technique has been testes on 10 video shots containing a total of 6071 images including sequences with noise and coding artifacts. Both indoor and outdoor real environments are considered. The performance of the proposed interpretation is shown by an automated textual summary of a video shot (Section 7.4.1) and an automated extraction of key images (Section 7.4.2). The proposed events are sufficiently broad for a wide range of video applications to assist surveillance and retrieval of video shots. For examples, i) the removal/deposit of objects, such as computing devices, in a surveillance site can be monitored and detected as they happen, ii) the movement of traffic objects can be monitored and reported, and iii) the behavior of customers in stores or subways can be monitored. The event detection procedure (not including the video analysis system) is fast and needs on average 0.0007 seconds on a SUN-SPARC-5 360 MHz to interpret data between two images. The whole system, video analysis and interpretation, needs on average between 0.12 and 0.35 seconds to process the data between two images. Typically surveillance video is recorded at a rate of 3-15 frames per second. The proposed 164 Video interpretation system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second. Speed-up can be achieved, for example, by i) optimizing the implementation of the occlusion and object separation, ii) optimizing the implementation of the change detection technique, and iii) working with integers instead of floating numbers (where appropriate) and with additions instead of multiplications. In this thesis, special consideration is given to processing inaccuracies and errors of a multi-level approach to handle specific situations such as false alarms. For example, the system is able to differentiate between deposited objects, split objects, and objects at an obstacle. It also rejects false alarms of entering or disappearing due to segmentation error (cf. Section 7.3 and 6.3.5). A critical issue in video surveillance is to differentiate between real moving objects and ‘clutter motion’, such as trees blowing in the wind and moving shadows. One way to handle these problems is to look for persistent motion and a second way is to classify motion as motion with purpose (vehicle or people) and motion without purpose (trees). The proposed tracking method can implicitly handle the first solution. Implementations of the second way need to be developed. In addition, the detection of background objects that move during the shot needs to be explicitly processed. Video summary 7.4.1 165 Event-based video summary The following tables show shot summaries generated automatically by the proposed system. ‘hall_cif’ Shot Summary based on Objects and Events; StartPic 1/EndPic 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |23 | Appearing completed | 1 | 8 | Move | (68 ,114)/(88 ,147) | (2 ,1 ) | 3878 /3878 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |84 | Appearing completed | 5 | 8 | Move | (234,111)/(224,130) | (-1 ,1 ) | 750 /750 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |146 | is Deposit by ObjID 1 | 6 | 8 | Stop | (117,162)/(117,163) | (0 ,0 ) | 298 /298 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |226 | Occlusion | 6 | 88 | Stop | (117,162)/(117,163) | (0 ,0 ) | 292 /290 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |226 | Occlusion ObjID 6 | 1 | 211 | Move | (68 ,114)/(149,129) | (-1 ,0 ) | 6602 /1507 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |251 | Disappear | 1 | 235 | Disappear | (68 ,114)/(125,128) | (-1 ,1 ) | 6602 /95 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| ‘road1_cif’ Shot Summary based on Objects and Events; StartPic 1/EndPic 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |8 | Appearing completed | 1 | 8 | Move | (306,216)/(272,167) | (-5 ,-5 ) | 1407 /1407 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |11 | Entering completed | 2 | 8 | Move | (344,205)/(340,199) | (-1 ,-1 ) | 543 /543 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |45 | Appearing completed | 3 | 8 | Stop | (148,39 )/(148,39 ) | (0 ,0 ) | 148 /148 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |158 | Appearing completed | 4 | 8 | Move | (142,69 )/(138,74 ) | (-1 ,1 ) | 157 /157 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |182 | Disappear | 1 | 181 | Disappear | (306,216)/(173,41 ) | (0 ,0 ) | 809 /34 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |200 | Appearing completed | 5 | 8 | Move | (336,266)/(295,191) | (-8 ,-8 ) | 1838 /1838 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |201 | Move Fast | 5 | 9 | Move | (336,266)/(288,181) | (-8 ,-8 ) | 1588 /1588 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |204 | Exit | 4 | 53 | Exit | (142,69 )/(8 ,273) | (-9 ,14 ) | 191 /471 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |204 | Occlusion | 5 | 12 | Move | (336,266)/(271,157) | (-5 ,-8 ) | 1103 /1103 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |204 | Occlusion by ObjID 5 | 2 | 201 | Stop | (344,205)/(289,129) | (0 ,0 ) | 822 /555 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |269 | Exit | 3 | 231 | Exit | (148,39 )/(3 ,230) | (-6 ,7 ) | 156 /391 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |279 | Abnormal | 2 | 276 | Stop | (344,205)/(291,129) | (0 ,0 ) | 822 /563 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| 166 Video interpretation ‘floor’ Shot Summary based on Objects and Events; StartPic 1/EndPic 826 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |36 | Appearing completed | 1 | 8 | Move | (126,140)/(123,135) | (0 ,-1 ) | 320 /320 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |268 | is Deposit by ObjID 1 | 3 | 8 | Stop | (121,140)/(121,140) | (0 ,0 ) | 539 /539 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |405 | Occlusion | 3 | 145 | Stop | (121,140)/(120,141) | (0 ,0 ) | 541 /555 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |405 | Occlusion ObjID 3 | 1 | 377 | Move | (126,140)/(83 ,109) | (1 ,0 ) | 840 /2422 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |411 | Removal by ObjID 1 | 3 | 150 | Removal | (121,140)/(104,132) | (0 ,0 ) | 541 /1451 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |787 | Appearing completed | 18 | 8 | Move | (105,68 )/(108,86 ) | (0 ,0 ) | 91 /91 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |825 | Exit | 1 | 796 | Exit | (126,140)/(9 ,230) | (-10,7 ) | 840 /247 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| ‘floort’ Shot Summary based on Objects and Events; StartPic 1/EndPic 636 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |8 | Entering completed | 1 | 8 | Stop | (32 ,136)/(32 ,136) | (0 ,0 ) | 270 /270 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |16 | Appearing completed | 2 | 8 | Move | (55 ,65 )/(57 ,78 ) | (1 ,0 ) | 814 /814 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |185 | Occlusion | 1 | 185 | Move | (32 ,136)/(32 ,131) | (0 ,-1 ) | 269 /352 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |185 | Occlusion ObjID 1 | 2 | 177 | Move | (55 ,65 )/(59 ,102) | (-1 ,1 ) | 1267 /2108 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |202 | Removal by ObjID 2 | 1 | 201 | Removal | (32 ,136)/(33 ,116) | (0 ,-1 ) | 269 /374 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |636 | Exit | 2 | 627 | Exit | (55 ,65 )/(277,235) | (9 ,2 ) | 1267 /252 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| ‘floorp’ Shot Summary based on Objects and Events; StartPic 1/EndPic 655 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |13 | Entering completed | 1 | 8 | Move | (85 ,234)/(74 ,215) | (2 ,-5 ) | 3671 /3671 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |460 | is Deposit by ObjID 1 | 2 | 8 | Stop | (32 ,135)/(32 ,135) | (0 ,0 ) | 266 /266 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |654 | Disappear | 1 | 648 | Disappear | (85 ,234)/(51 ,108) | (0 ,1 ) | 3327 /85 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| Video summary 167 ‘urbicande_cif’ Shot Summary based on Objects and Events; StartPic 1/EndPic 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |8 | Appearing completed | 1 | 8 | Move | (246,160)/(249,152) | (0 ,-1 ) | 185 /185 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |30 | Entering completed | 4 | 8 | Move | (331,158)/(322,160) | (0 ,0 ) | 65 /65 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |31 | Entering completed | 5 | 8 | Stop | (337,157)/(337,157) | (0 ,0 ) | 47 /47 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |48 | Appearing completed | 6 | 8 | Move | (235,95 )/(240,107) | (1 ,1 ) | 197 /197 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |52 | Occlusion by ObjID 6 | 1 | 52 | Stop | (246,160)/(260,129) | (0 ,0 ) | 180 /148 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |52 | Occlusion | 6 | 12 | Move | (235,95 )/(243,118) | (1 ,3 ) | 277 /277 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |104 | Appearing completed | 7 | 8 | Move | (120,13 )/(138,48 ) | (1 ,5 ) | 621 /621 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |145 | Occlusion by ObjID 7 | 1 | 145 | Stop | (246,160)/(249,117) | (0 ,0 ) | 180 /158 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |145 | Occlusion | 7 | 49 | Move | (120,13 )/(245,106) | (1 ,1 ) | 441 /240 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |158 | Exit | 4 | 135 | Exit | (331,158)/(30 ,194) | (-6 ,2 ) | 61 /28 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |182 | Entering completed | 8 | 8 | Stop | (337,157)/(337,157) | (0 ,0 ) | 49 /49 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |200 | Exit | 6 | 159 | Exit | (235,95 )/(334,172) | (0 ,0 ) | 122 /34 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |215 | Occlusion by ObjID 7 | 5 | 192 | Stop | (337,157)/(302,152) | (0 ,0 ) | 47 /92 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |229 | Entering completed | 9 | 8 | Stop | (337,157)/(337,157) | (0 ,0 ) | 49 /49 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |253 | Abnormal | 1 | 253 | Move | (246,160)/(183,92 ) | (-1 ,-1 ) | 180 /311 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |261 | Occlusion | 9 | 40 | Move | (337,157)/(341,158) | (1 ,0 ) | 49 /98 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |261 | Occlusion by ObjID 9 | 7 | 165 | Stop | (120,13 )/(331,150) | (0 ,0 ) | 441 /67 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |276 | Abnormal | 5 | 253 | Move | (337,157)/(226,143) | (-1 ,0 ) | 47 /259 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |290 | Entering completed | 10 | 8 | Stop | (307,193)/(307,193) | (0 ,0 ) | 550 /550 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| 168 Video interpretation ‘survey_d’ Shot Summary based on Objects and Events;; StartPic 1/EndPic 979 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |8 | Entering completed | 2 | 8 | Move | (31 ,161)/(34 ,173) | (1 ,5 ) | 7115 /7115 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |8 | Entering completed | 1 | 8 | Move | (200,29 )/(196,31 ) | (1 ,0 ) | 2156 /2156 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |17 | Entering completed | 3 | 8 | Move | (15 ,173)/(7 ,177) | (-1 ,1 ) | 1103 /1103 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |25 | Entering completed | 4 | 8 | Move | (81 ,195)/(82 ,190) | (1 ,-3 ) | 1967 /1967 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |70 | Occlusion ObjID 1 | 2 | 70 | Move | (31 ,161)/(129,154) | (1 ,0 ) | 3593 /4605 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |70 | Occlusion | 1 | 70 | Move | (200,29 )/(162,48 ) | (-1 ,1 ) | 2219 /3523 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |81 | Entering completed | 5 | 8 | Move | (283,146)/(275,164) | (0 ,1 ) | 3038 /3038 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |91 | Occlusion | 5 | 18 | Move | (283,146)/(211,163) | (-5 ,0 ) | 2886 /3124 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |91 | Occlusion ObjID 5 | 2 | 91 | Move | (31 ,161)/(153,143) | (1 ,0 ) | 3593 /4189 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |107 | Occlusion by ObjID 1 | 5 | 34 | Move | (283,146)/(121,164) | (-5 ,0 ) | 2886 /2643 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |107 | Occlusion ObjID 5 | 1 | 107 | Move | (200,29 )/(124,66 ) | (-1 ,1 ) | 2219 /5138 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |122 | Exit | 5 | 48 | Exit | (283,146)/(5 ,165) | (-3 ,3 ) | 2886 /340 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |197 | Exit | 1 | 196 | Exit | (200,29 )/(3 ,118) | (-2 ,5 ) | 2219 /123 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |228 | Exit | 2 | 227 | Exit | (31 ,161)/(309,12 ) | (0 ,-2 ) | 3593 /153 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |242 | Entering completed | 7 | 8 | Move | (210,27 )/(200,29 ) | (-1 ,1 ) | 1137 /1137 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |287 | Entering completed | 8 | 8 | Move | (242,24 )/(235,26 ) | (-1 ,1 ) | 1431 /1431 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |355 | Exit | 7 | 120 | Exit | (210,27 )/(4 ,103) | (-1 ,3 ) | 1424 /166 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |432 | Entering completed | 9 | 8 | Move | (245,26 )/(241,27 ) | (-1 ,3 ) | 1304 /1304 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |449 | Exit | 8 | 169 | Exit | (242,24 )/(4 ,167) | (-2 ,3 ) | 1708 /192 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |594 | Entering completed | 12 | 8 | Move | (261,28 )/(255,30 ) | (-1 ,1 ) | 1453 /1453 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |606 | Exit | 9 | 181 | Exit | (245,26 )/(13 ,171) | (-1 ,1 ) | 601 /2320 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |741 | Exit | 12 | 154 | Exit | (261,28 )/(8 ,186) | (-3 ,3 ) | 1436 /1109 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |836 | Entering completed | 13 | 8 | Move | (233,26 )/(226,30 ) | (0 ,2 ) | 1202 /1202 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |975 | Exit | 13 | 146 | Exit | (233,26 )/(3 ,123) | (-3 ,1 ) | 1444 /254 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| Key-image extraction 169 ‘stair_wide_cif’ Shot Summary based on Objects and Events;; StartPic 1/EndPic 1475 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |Pic | Event | ObjID | Age | Status | Position | Motion | Size | | | | | | | start/present | present | start/present| |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |172 | Entering completed | 2 | 8 | Move | (312,248)/(308,230) | (-2 ,-2 ) | 5746 /5746 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |216 | Entering completed | 3 | 8 | Move | (184,186)/(167,169) | (-2 ,-2 ) | 7587 /7587 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |234 | Exit | 2 | 69 | Exit | (312,248)/(337,282) | (3 ,16 ) | 11803 /180 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |479 | Appearing completed | 4 | 8 | Stop | (128,104)/(127,100) | (0 ,0 ) | 211 /211 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |547 | Disappear | 3 | 338 | Disappear | (184,186)/(114,67 ) | (0 ,-1 ) | 6374 /137 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |608 | Entering completed | 7 | 8 | Move | (120,88 )/(125,79 ) | (1 ,1 ) | 2536 /2536 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |678 | Entering completed | 8 | 8 | Move | (138,85 )/(131,85 ) | (-1 ,0 ) | 5308 /5308 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |803 | Exit | 7 | 202 | Exit | (120,88 )/(337,282) | (3 ,20 ) | 3432 /199 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |812 | Disappear | 8 | 141 | Disappear | (138,85 )/(121,92 ) | (0 ,0 ) | 4334 /73 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |916 | Entering completed | 9 | 8 | Move | (11 ,151)/(23 ,159) | (3 ,0 ) | 2866 /2866 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |1179| Exit | 9 | 270 | Exit | (11 ,151)/(127,72 ) | (0 ,0 ) | 4955 /1388 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |1290| Entering completed | 16 | 8 | Stop | (123,72 )/(128,77 ) | (0 ,0 ) | 2737 /2737 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| |1363| Exit | 16 | 80 | Exit | (123,72 )/(83 ,154) | (-5 ,1 ) | 3844 /15910 | |--- | ----------------------| ----- | --- | --------- | ------------------- | --------- | -------------| 7.4.2 Key-image based video representation In a surveillance environment, important events may occur after a long time has passed. During this time, the attention of human operators decreases and significant events may be missed. The proposed system for event detection identifies events of interest as they occur and human operators can focus their attention on moving objects and their related events. This section presents automatic extracted key-images from video shots. Keyimages are the subset of images which best represent the content of a video sequence in an abstract manner. Key-image video abstraction transforms an entire video shot into a small number of representative images. This way important content is maintained while redundancies are removed. Key-images based on events are appropriate when the system must report on specific events as soon as they happen. Figures 7.6–7.14 show key-images extracted automatically from video shots. Each image is annotated on its upper left and right corners with the image number, object ID, age, and events. Only objects performing the events are annotated in this application. Note that in the figures no key-images for the event Exit or Disappear are shown because of space constraint. In Fig. 7.13, no appear, enter, exit, and disappear 170 Video interpretation key-images are displayed. Not displayed events are, however, given in the summary Tables in Section 7.4.1. In some applications, detailed description of events using key-images may be required. The proposed system can provide such details. For example, Fig. 7.5 illustrates detailed information during occlusion. 7.5 Summary There has been little work on context-independent video interpretation. The system in [38] is limited to applications of indoor environments, cannot deal with occlusion, and is noise sensitive. Moreover, the definition of events is not widely applicable. The system for indoor surveillance applications in [133] provides only one abandoned object event in unattended environments. This chapter has introduced a new context-independent video interpretation system that provides a video representation rich in terms of generic events and qualitative object features. Qualitative object descriptors are extracted by quantizing the lowlevel parametric descriptions of the objects. The thesis proposes approximate but efficient world models to define useful events. In many applications, approximate models even if not precise, are adequate. To extract events, changes of motion and the behavior of low-level features of the scene’s objects are continually monitored. When certain conditions are met, events related to these conditions are detected. The proposed events are sufficiently broad for a wide range of video applications to assist surveillance and retrieval of video shots. Examples are: 1) the removal/deposit of objects, such as computing devices, in a surveillance site can be monitored and detected as they happen, 2) the movement of traffic objects can be monitored and reported, and 3) the behavior of customers in stores or subways can be monitored. The proposed system can be used in both modes: on-line or off-line. In an on-line mode, such as surveillance, the detection of an event can send related information to a human operator. In an off-line mode, the system stores events and object representation in a database. Experimentations on more than 10 indoor and outdoor video shots containing a total of 6371 images including sequences with noise and coding artifacts have demonstrated the reliability and the real-time performance of the proposed system. Key-image extraction 171 Figure 7.5: Key images during occlusion. Each image is annotated with events (upper left hand corner) and objects are annotated with their MBB and ID. The original sequence is rotated by 90◦ to the right to comply with the CIF format. 172 Video interpretation Figure 7.6: Key-event-images of the ‘Highway’ sequence (300 images). This sequence is characteristic of a traffic monitoring application. Each image is annotated with events (upper left hand corner) and objects are annotated with their MBB and ID. Important key events: Abnormal movement, O5 moves fast and O2 stops for long. Key-image extraction 173 Figure 7.7: Key-event-images of the ‘Highway2’ sequence (300 images). Important key event: the appearance of a person O7 on the highway. 174 Video interpretation Figure 7.8: Key-event-images of the ‘Hall’ sequence (300 images). This sequence is characteristic of an indoor surveillance application. Important key event: O6 is deposited by object O1 . Key-image extraction 175 Figure 7.9: Key-event-images of the ‘Floor’ sequence (826 images). Important key events: O1 deposits and then removes O3 . Both the depositor/remover and deposited/removed objects are detected. 176 Video interpretation Figure 7.10: Key-event-images of the ‘FloorT’ sequence (636 images). Important key event: removal of O1 by O2 . Figure 7.11: Key-event-images of the ‘FloorP’ sequence (655 images). Both the key event Deposit and the object that performs the key event are correctly recognized. Key-image extraction 177 Figure 7.12: Key-event-images of the ‘Survey’ sequence (979 images). This sequence is typical for a parking lot surveillance application. Important key event: occlusion of three objects O1 , O2 , and O5 . 178 Video interpretation Figure 7.13: Key-event-images of the ‘Urbicande’ sequence (300 images) which is characteristic of a city urban surveillance. Various objects enter and leave the scene. Important key events: O1 and O5 are moving abnormally and stay for long in the scene (see I(253) & I(276)). Note that the original sequence is rotated by 90◦ to the right to comply with the CIF format. Key-image extraction 179 Figure 7.14: Key-event-images of the ‘Stair’ sequence (1475 images). This sequence is typical for entrance surveillance application. The interesting feature of this application is that objects can enter from three different places, the two doors and the stairs. One of the doors is restricted. To detect specific events, such as entering or approaching a restricted site (see image 964), a map of the scene is needed. Chapter 8 Conclusion 8.1 Review of the thesis background This thesis has developed a new framework for high-level video content processing and representation based on objects and events. To achieve high applicability, contents are extracted independently of the context of the processed video. The proposed framework targets efficient and flexible representation of video from real (indoor and outdoor) environments where occlusion, illumination change, and artifacts may occur. Most video processing and representation systems have mainly dealt with video data in terms of pixels, blocks, or some global structure. This is not sufficient for advanced video applications. In a surveillance application, for instance, objects are necessary to automatically detect and classify object behaviors. In video databases, advanced retrieval must be based on high-level object features and object meaning. Users are, in general, attracted to moving objects and focus first on their meaning and then on their low-level features. Several approaches to object-based video representation were studied but they often focus on low-level quantitative features or assume a simple environment, for example, without object occlusion. There are few representation schemes concerning high-level features of video content such as activities and events. Much of the work on event detection and classification focuses on how to express events using reasoning and inference methods. In addition, most high-level video representations are context-dependent and focus on the constraints of a narrow application; so they lack generality and flexibility. 8.2 Summary of contributions The proposed system is aimed at three goals: flexible object representations, reliable stable processing that foregoes the need for precision, and low computational cost. The proposed system targets video from real environments such as those with object 182 Conclusion occlusions or artifacts. This thesis has achieved these goals through the adaptation to noise and image content, through the detection and correction or compensation of estimation errors at the various processing levels, and through the division of the processing system into simple but effective tasks avoiding complex operations. This thesis has demonstrated that based on such a strategy quality results of video enhancement, analysis, and interpretation can be achieved. The proposed system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second on a multitasking SUN UltraSPARC 360 MHz without specialized hardware. The robustness of the proposed methods has been demonstrated by extensive experimentation on on more than 10 indoor and outdoor video shots containing a total of 6371 images including sequences with noise and coding artifacts. The robustness of the proposed system is a result of adaptation to noise and artifacts and due to processing that accounts for errors at one step by correction or compensation at the subsequent steps where higher level information is available. This considerations to process inaccuracies and errors of a multi-level approach allow the system to handle specific situations such as false alarms. For example, the system is able to differentiate between deposited objects, split objects, and objects at obstacle. It also rejects false alarms of entering or disappearing due to segmentation error. The proposed system can be viewed as a framework of methods and algorithms to build automatic dynamic scene interpretation and representation. Such interpretation and representation can be used in various video applications. Besides applications such as video surveillance and retrieval, outputs of the proposed framework can be used in a video understanding or a symbolic reasoning system. Contributions of this thesis are made in three processing levels: video enhancement to estimate and reduce noise, video analysis to extract meaningful objects and their spatio-temporal features, and video interpretation to extract context-independent semantic features such as events. The system is modular, and layered from low-level to middle level to high-level. Results from a lower level are integrated to support higher levels. Higher levels support lower levels through memory-based feedback loops. Video enhancement This thesis has developed a spatial noise filter of low complexity which is adaptive to the image structure and the image noise. The proposed method applies first a local image analyzer along eight directions and then selects a suitable direction for filtering. Quantitative and qualitative simulations show that the proposed noise and structure-adaptive filtering method is more effective at reducing Gaussian white noise without image degradation than reference filters used. This thesis has also contributed a reliable fast method to estimate the variance 183 of the white noise. The method finds first homogeneous blocks and then averages the variances of the homogeneous blocks to determine the noise variance. For typical image quality of PSNR between 20 and 40 dB the proposed method outperforms other methods significantly and the worst case PSNR estimation error is approximately 3 dB, which is suitable for video application such as surveillance or TV signal broadcast. Video analysis The proposed video analysis method extracts meaningful video objects and their spatio-temporal low-level features. It is fault tolerant, can correct inaccuracies, and recover from errors. The method is primarily based on computationally efficient object segmentation and voting-based object tracking. Segmentation is realized in four steps: motion-detection-based binarization, morphological edge detection, contour analysis, and object labeling. To focus on meaningful objects, the proposed segmentation method uses a background image. The proposed algorithm memorizes previously detected motion data to adapt current segmentation. The edge detection is performed by novel morphological operations with significantly reduced computations. Edge are gap-free and single-pixel wide. Edges are grouped into contours. Small contours are eliminated if they cannot be matched to previously extracted regions. The tracking method is based on a non-linear voting system to solve the problem of multiple object correspondences. The occlusion problem is alleviated by a medianbased prediction procedure. Objects are tracked once they enter the scene until they exit, including the occlusion period. An important contribution of the proposed tracking is the reliable region merging, which significantly improves the performance of the whole proposed video system. Video interpretation This thesis has proposed a context-independent video interpretation system. The implemented system provides a video representation rich in terms of generic events and qualitative object features. Qualitative object descriptors are extracted by quantizing the low-level parametric descriptions of the objects. The thesis proposes approximate but efficient world models to define useful events. In many applications, approximate models, even if not precise, are adequate. To extract events, changes of motion and low-level features in the scene are continually monitored. When certain conditions are met, events related to these conditions are detected. Detection of events is done on-line, i.e., events are detected as they occur. Specific object features, such as motion or size, are stored for each image and compared as images of a shot come in. Both indoor and outdoor real environments are considered. The proposed events are sufficiently broad for a wide range of video applications to assist surveillance and retrieval of video shots. 184 8.3 Conclusion Possible extensions There are a number of issues to consider in order to enhance the performance of the proposed system and extend its applicability. • Time of execution and applications The motion detection and object occlusion processing modules have the highest computational cost of the proposed modular system. The implementation of their algorithms can be optimized to allow faster execution of the whole system. In addition, the proposed system should be applied to a larger set of video shots and environments. • Object segmentation In the context of MPEG-video coding, motion vectors are available. One of the immediate extensions of the proposed segmentation technique is to integrate motion information from the MPEG-stream to support object segmentation. This integration is expected to enhance segmentation without a significant increase in computational cost. • Motion estimation The proposed model of motion can be further refined to allow more accurate estimation. A straightforward extension is to examine the displacements of the diagonal extents of objects and adapt the estimation to previously estimated motion for greater stability. A possible extension of the proposed tracking method is in tracking objects that move in and out of the scene. • Highlights and shadows The system can benefit from the detection of shadows and compensation of their effects, especially when the source and direction of illumination is known. • Image stabilization Image stabilization techniques can be used to allow the analysis of video data from moving cameras and changing backgrounds. • Video interpretation A wider set of events can be considered for the system to serve a larger set of applications. A program interface can be designed to facilitate user-system interaction. Definition of such an interface requires a study of the needs of users of video applications. A classification of moving objects and ‘clutter motion’, such as trees blowing in the wind, can be considered to reject events. One possible classification is to detect motion as motion with purpose (for example, motion of vehicle or people) and motion without purpose (for example, motion of trees). In addition, the proposed modular framework can be extended to assist context dependent or higher-level tasks such as video understanding or symbolic reasoning. Bibliography [1] T. Aach, A. Kaup, and R. Mester, “Statistical model-based change detection in moving video,” Signal Process., vol. 31, no. 2, pp. 165–180, 1993. [2] A. Abutaleb, “Automatic thresholding of gray-level pictures using two-dimensional entropy,” Comput. Vis. Graph. Image Process., vol. 47, pp. 22–32, 1989. [3] E. Adelson and J. Bergen, “The plenoptic function and the elements of early vision,” in Computational Models of Visual Processing (M. Landy and J. Movshon, eds.), ch. 1, Cambridge: M.I.T. Press, 1991. [4] E. Adelson and J. Movshon, “Phenomenal coherence of moving visual patterns,” Nature, vol. 300, pp. 532–525, Dec. 1982. [5] P. Aigrain, H. Zhong, and D. Petkovic, “Content-based representation and retrieval of visual media: A state-of-the-art review,” Multimedia tools and applications J., vol. 3, pp. 179–192, 1996. [6] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tunceland, and T. Sikora, “Image sequence(1) analysis for emerging interactive multimedia services - the European COST 211 Framework,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 802–813, Nov. 1998. [7] A. Amer, “Motion estimation using object segmentation methods,” Master’s thesis, Dept. Elect. Eng., Univ. Dortmund, Dec. 1994. In German. [8] A. Amer, “Object-based video retrieval based on motion analysis and description,” Tech. Rep. 99–12, INRS-T´el´ecommunications, June 1999. [9] A. Amer and H. Blume, “Postprocessing of MPEG-2 decoded image signals,” in Proc. 1st ITG/Deutsche-Telekom Workshop on Multimedia and applications, (Darmstadt, Germany), Oct. 1996. In German. [10] A. Amer and E. Dubois, “Segmentation-based motion estimation for video processing using object-based detection of motion types,” in Proc. SPIE Visual Communications and Image Process., vol. 3653, (San Jose, CA), pp. 1475–1486, Jan. 1999. [11] A. Amer and E. Dubois, “Image segmentation by robust binarization and fast morphological edge detection,” in Proc. Vision Interface, (Montr´eal, Canada), pp. 357–364, May 2000. 185 186 Bibliography [12] A. Amer and E. Dubois, “Object-based postprocessing of block motion fields for video applications,” in Proc. SPIE Image and Video Communications and Processing, vol. 3974, (San Jose, CA), pp. 415–424, Jan. 2000. [13] A. Amer and H. Schr¨ oder, “A new video noise reduction algorithm using spatial sub-bands,” in Proc. IEEE Int. Conf. Electron., Circuits, and Syst., vol. 1, (Rodos, Greece), pp. 45–48, Oct. 1996. [14] D. Ayers and M. Shah, “Recognizing human actions in a static room,” in Proc. 4th IEEE Workshop on Applications of Computer Vision, (Princeton, NJ), pp. 42–47, Oct. 1998. [15] A. Azarbayejani, C. Wren, and A. Pentland, “Real-time 3-D tracking of the human body,” in Proc. IMAGE’COM, (Bordeaux, France), pp. 19–24, May 1996. M.I.T. TR No. 374. [16] B. Bascle, P. Bouthemy, R. Deriche, and F. Meyer, “Tracking complex primitives in an image sequence,” in Proc. IEEE Int. Conf. Pattern Recognition, (Jerusalem), pp. 426–431, Oct. 1994. [17] J. Bernsen, “Dynamic thresholding of grey-level images,” in Proc. Int. Conf. on Pattern Recognition, (Paris, France), pp. 1251–1255, Oct. 1986. [18] H. Blume, “Bewegungssch¨atzung in videosignalen mit parallelen o¨rtlich zeitlichen pr¨ adiktoren,” in Proc. 5. Dortmunder Fernsehseminar, vol. 0393, (Dortmund, Germany), pp. 220–231, 29 Sep.- 1 Oct. 1993. In German. [19] H. Blume, “Vector-based nonlinear upconversion applying center weighted medians,” in Proc. SPIE Conf. Nonlinear Image Process., (San Jose, CA), pp. 142–153, Feb. 1996. [20] H. Blume and A. Amer, “Parallel predictive motion estimation using object segmentation methods,” in Proc. European Workshop and Exhibition on Image Format Conversion and Transcoding, (Berlin, Germany), pp. C1/1–5, Mar. 1995. [21] H. Blume, A. Amer, and H. Schr¨ oder, “Vector-based postprocessing of MPEG-2 signals for digital TV-receivers,” in Proc. SPIE Visual Communications and Image Process., vol. 3024, (San Jose, CA), pp. 1176–1187, Feb. 1997. [22] A. Bobick, “Movement, activity, and action: the role of knowledge in the perception of motion,” Tech. Rep. 413, M.I.T. Media Laboratory, 1997. [23] G. Boudol, “Atomic actions,” Tech. Rep. 1026, Institut National de Recherche en Informatique et en Automatique, May 1989. [24] P. Bouthemy and R. Fablet, “Motion characterization from temporal co-occurences of local motion-based measures for video indexing,” in Proc. IEEE Int. Conf. Pattern Recognition, vol. 1, (Brisbane, IL), pp. 905–908, Aug. 1998. [25] P. Bouthemy, M. Gelgon, and F. Ganansia, “A unified approach to shot change detection and camera motion characterization,” Tech. Rep. 1148, Institut National de Recherche en Informatique et en Automatique, Nov. 1997. 187 [26] M. Bove, “Object-oriented television,” SMPTE J., vol. 104, pp. 803–807, Dec. 1995. [27] J. Boyd, J. Meloche, and Y. Vardi, “Statistical tracking in video traffic surveillance,” in Proc. IEEE Int. Conf. Computer Vision, vol. 1, (Corfu, Greece), pp. 163–168, Sept. 1999. [28] F. Br´emond and M. Thonnat, “A context representation for surveillance systems,” in Proc. Workshop on Conceptual Descriptions from Images at the European Conf. on Computer Vision, (Cambridge, UK), pp. 28–42, Apr. 1996. [29] F. Br´emond and M. Thonnat, “Issues of representing context illustrated by videosurveillance applications,” Int. J. of Human-Computer Studies, vol. 48, pp. 375–391, 1998. Special Issue on Context. [30] M. Busian, “Object-based vector field postprocessing for enhanced noise reduction,” Tech. Rep. S04–97, Dept. Elect. Eng., Univ. Dortmund, 1997. In German. [31] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 9, pp. 679–698, Nov. 1986. [32] M. Chang, A. Tekalp, and M. Sezan, “Simultaneous motion estimation and segmentation,” IEEE Trans. Image Process., vol. 6, no. 9, pp. 1326–1333, 1997. [33] S. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong, “A fully automatic content-based video search engine supporting multi-object spatio-temporal queries,” IEEE Trans. Circuits Syst. Video Techn., vol. 8, no. 5, pp. 602–615, 1998. Special Issue. [34] A. Cohn and S. Hazarika, “Qualitative spatial representation and reasoning: An overview,” Fundamenta Informaticae, vol. 43, pp. 2–32, 2001. [35] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa, “A system for video surveillance and monitoring,” Tech. Rep. CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, May 2000. [36] J. Conde, A. Teuner, and B. Hosticka, “Hierarchical locally adaptive multigrid motion estimation for surveillance applications,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, (Phoenix, Arizona), pp. 3365–3368, May 1999. [37] P. Correia and F. Pereira, “The role of analysis in content-based video coding and indexing,” Signal Process., vol. 66, pp. 125–142, 1998. [38] J. Courtney, “Automatic video indexing via object motion analysis,” Pattern Recognit., vol. 30, no. 4, pp. 607–625, 1997. [39] A. Cross, D. Mason, and S. Dury, “Segmentation of remotely-sensed images by a split-and-merge process,” Int. J. Remote Sensing, vol. 9, no. 8, pp. 1329–1345, 1988. [40] A. Cr´etual, F. Chaumette, and P. Bouthemy, “Complex object tracking by visual servoing based on 2-D image motion,” in Proc. IEEE Int. Conf. Pattern Recognition, vol. 2, (Brisbane, IL), pp. 1251–1254, Aug. 1998. 188 Bibliography [41] M. Dai, P. Baylou, L. Humbert, and M. Najim, “Image segmentation by a dynamic thresholding using edge detection based on cascaded uniform filters,” Signal Process., vol. 52, pp. 49–63, Apr. 1996. [42] K. Daniilidis, C. Krauss, M. Hansen, and G. Sommer, “Real time tracking of moving objects with an active camera,” J. Real-Time Imging, vol. 4, pp. 3–20, February 1998. [43] G. de Haan, Motion Estimation and Compensation: An Integrated Approach to Consumer Display Field Rate Conversion. PhD thesis, Natuurkundig Laboratorium, Univ. Delft, Sept. 1992. [44] G. de Haan, “IC for motion compensated deinterlacing, noise reduction and picture rate conversion,” IEEE Trans. Consum. Electron., vol. 42, pp. 617–624, Aug. 1999. [45] G. de Haan, “Progress in motion estimation for consumer video format conversion,” in Proc. IEEE Digest of the ICCE, (Los Angeles, CA), pp. 50–51, June 2000. [46] G. de Haan, T. Kwaaitaal-Spassova, M. Larragy, and O. Ojo, “IC for motion compensated 100 Hz TV with smooth movie motion mode,” IEEE Trans. Consum. Electron., vol. 42, pp. 165–174, May 1996. [47] G. de Haan, T. Kwaaitaal-Spassova, and O. Ojo, “Automatic 2-D and 3-D noise filtering for high-quality television receivers,” in Proc. Int. Workshop on Signal Process. and HDTV, vol. VI, (Turin, Italy), pp. 221–230, 1996. [48] Y. Deng and B. Manjunath, “NeTra–V: Towards an object-based video representation,” IEEE Trans. Circuits Syst. Video Techn., vol. 8, pp. 616–27, Sept. 1998. Special Issue. [49] N. Diehl, “Object-oriented motion estimation and segmentation in image sequence,” Signal Process., Image Commun., vol. 3, pp. 23–56, Feb. 1991. [50] S. Dockstader and A. Tekalp, “On the tracking of articulated and occluded video object motion,” J. Real-Time Imging, vol. 7, pp. 415–432, Oct. 2001. [51] E. Dougherty and J. Astola, An Introduction to Nonlinear Image Processing, vol. TT 16. Washington: SPIE Optical Engineering Press, 1994. [52] H. Dreßler, “Noise estimation in analogue and digital trasmitted video signals,” Tech. Rep. S11-96, Dept. Elect. Eng., Univ. Dortmund, Apr. 1997. [53] E. Dubois and T. Huang, “Motion estimation,” in The past, present, and future of image and multidimensional signal processing (R. Chellappa, B. Girod, D. Munson, and M. V. M. Tekalp, eds.), pp. 35–38, IEEE Signal Processing Magazine, Mar. 1998. [54] F. Dufaux and J. Konrad, “Efficient, robust and fast global motion estimation for video coding,” IEEE Trans. Image Process., vol. 9, pp. 497–500, June 2000. [55] F. Dufaux and F. Moscheni, “Segmentation-based motion estimation for second generation video coding techniques,” in Video coding: Second generation approach (L. Torres and M. Kunt, eds.), pp. 219–263, Kluwer Academic Publishers, 1996. 189 [56] M. Ferman, M. Tekalp, and R. Mehrotra, “Effective content representation for video,” in Proc. IEEE Int. Conf. Image Processing, vol. 3, (Chicago, IL), pp. 521–525, Oct. 1998. [57] J. Fernyhough, A. Cohn, and D. Hogg, “Constructing qualitative event models automatically from video input,” Image and Vis. Comput., vol. 18, pp. 81–103, 2000. [58] J. Flack, On the Interpretation of Remotely Sensed Data Using Guided Techniques for Land Cover Analysis. PhD thesis, EEUWIN Center for Remote Sensing Technologies, Feb. 1996. [59] J. Foley, A. van Dam, S. Feiner, and J. Hughes, Computer Graphics: Principles and Practice. Reading, MA: Addison-Wesley, 1990. Second edition. [60] M. Gabbouj, G. Morrison, F. Alaya-Cheikh, and R. Mech, “Redundancy reduction techniques and content analysis for multimedia services - the European COST 211quat Action,” in Proc. Workshop on Image Analysis for Multimedia Interactive Services, (Berlin, Germany), pp. 1251–1255, May 1999. [61] L. Garrido, P. Salembier, and D. Garcia, “Extensive operators in partition lattices for image sequence analysis,” Signal Process., vol. 66, pp. 157–180, 1998. [62] A. Gasch, “Object-based vector analysis for restoration of video signals,” Master’s thesis, Dept. Elect. Eng., Univ. Dortmund, July 1997. In German. [63] A. Gasteratos, “Mathematical morphology operations and structuring elements.” Computer Vision On-line, http://www.dai.ed.ac.uk/CVonline/transf.htm. [64] C. Giardina and E. Dougherty, Morphological Methods in Image and Signal Processing. New Jersey: Prentice Hall, 1988. [65] S. Gil, R. Milanese, and T. Pun, “Feature selection for object tracking in traffic scenes,” in Proc. SPIE Int. Symposium on Smart Highways, vol. 2344, (Boston, MA), pp. 253–266, Oct. 1994. [66] B. Girod, “What’s wrong with mean squared error?,” in Digital Images and Human Vision (A. Watson, ed.), ch. 15, M.I.T. Press, Cambridge, Mar. 1993. [67] F. Golshani and N. Dimitrova, “A language for content-based video retrieval,” Multimedia tools and applications J., vol. 6, pp. 289–312, 1998. [68] M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt, “Real-time scene stabilization and mosaic construction,” in Proc. DARPA Image Understanding Workshop, vol. 1, (Monterry, CA), pp. 457–465, Nov. 1994. [69] R. Haralick and L. Shapiro, Computer and Robot Vision. Reading: Addison-Wesley, 1992. [70] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4 : Real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 809–830, Aug. 2000. 190 Bibliography [71] M. Isard and A. Blake, “Contour tracking by stochastic propagation of conditional density,” in Proc. European Conf. Computer Vision, vol. A, pp. 343–356, 1996. [72] R. Jain, A. Pentland, and D. Petkovic, “Workshop report,” in Proc. NSF-ARPA Workshop on Visual Information Management Systems, (Cambridge, MA), June 1995. [73] R. Jain and T. Binford, “Dialogue: Ignorance, myopia, and naivete in computer vision systems,” Comput. Vis. Graph. Image Process., vol. 53, pp. 112–117, January 1991. [74] K. Jostschulte and A. Amer, “A new cascaded spatio-temporal noise reduction scheme for interlaced video,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, (Chicago, IL), pp. 493–497, Oct. 1998. [75] S. Khan and M. Shah, “Tracking people in presence of occlusion,” in Proc. Asian Conf. on Computer Vision, (Taipei, Taiwan), pp. 1132–1137, Jan. 2000. [76] J. Konrad, “Motion detection and estimation,” in Image and Video Processing Handbook (A. Bovik, ed.), ch. 3.8, Academic Press, 1999. [77] K. Konstantinides, B. Natarajan, and G. Yovanof, “Noise estimation and filtering using block-based singular-value decomposition,” IEEE Trans. Image Process., vol. 6, pp. 479–483, Mar. 1997. [78] M. Kunt, “Comments on dialogue, a series of articles generated by the paper entitled ‘ignorance, myopia, and naivete in computer vision’,” Comput. Vis. Graph. Image Process., vol. 54, pp. 428–429, November 1991. [79] J. Lee, “Digital image smoothing and the sigma filter,” Comput. Vis. Graph. Image Process., vol. 24, pp. 255–269, 1983. [80] G. Legters and T. Young, “A mathematical model for computer image tracking,” IEEE Trans. Pattern Anal. Machine Intell., vol. 4, pp. 583–594, Nov. 1982. [81] A. Lippman, N. Vasconcelos, and G. Iyengar, “Human interfaces to video,” in Proc. 32nd Asilomar Conf. on Signals, Systems, and Computers, (Asilomar, CA), Nov. 1998. Invited Paper. [82] E. Lyvers and O. Mitchell, “Precision edge contrast and orientation estimation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp. 927–937, November 1988. [83] D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman and Company, 1982. [84] R. Mech and M. Wollborn, “A noise robust method for segmentation of moving objects in video sequences,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, vol. 4, (Munich, Germany), pp. 2657–2660, Apr. 1997. [85] R. Mech and M. Wollborn, “A noise robust method for 2-D shape estimation of moving objects in video sequences considering a moving camera,” Signal Process., vol. 66, no. 2, pp. 203–217, 1998. 191 [86] G. Medioni, I. Cohen, F. Br´emond, and R. N. S. Hongeng, “Event detection and analysis from video streams,” IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 8, pp. 873–889, 2001. [87] P. Meer, R. Park, and K. Cho, “Multiresolution adaptive image smoothing,” Graphical Models and Image Process., vol. 44, pp. 140–148, Mar. 1994. [88] megapixel.net, “Noise: what it is and when to expect it.” Monthly digital camera web magazine, http://www.megapixel.net/html/articles/article-noise.html, 2001. [89] T. Meier and K. Ngan, “Automatic segmentation of moving objects for video object plane generation,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 525–538, Sept. 1998. Invited paper. [90] T. Minka, “An image database browser that learns from user interaction,” Master’s thesis, M.I.T. Media Laboratory, Perceptual Computing Section, 1996. [91] A. Mitiche, Computational Analysis of Visual Motion. New York: Plenum Press, 1994. [92] A. Mitiche and P. Bouthemy, “Computation and analysis of image motion: a synopsis of current problems and methods,” Intern. J. Comput. Vis., vol. 19, no. 1, pp. 29–55, 1996. [93] M. Naphade, R. Mehrotra, A. Ferman, J. Warnick, T. Huang, and A. Tekalp, “A high performance algorithm for shot boundary detection using multiple cues,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, (Chicago, IL), pp. 884–887, 1998. [94] W. Niblack, An introduction to digital image processing. Prentice Hall, 1986. [95] H. Nicolas and C. Labit, “Motion and illumination variation estimation using a hierarchy of models: Application to image sequence coding,” Tech. Rep. 742, IRISA, July 1993. [96] M. Nieto, “Public video surveillance: Is it an effective crime prevention tool?.” CRB California Research Bureau, California State Library, http://www.library.ca.gov/CRB/97/05/, June 1997. CRB-97-005. [97] A. Oliphant, K. Taylor, and N. Mission, “The visibility of noise in system-I PAL colour television,” Tech. Rep. 12, BBC Research and Development Department, 1988. [98] S. Olsen, “Estimation of noise in images: An evaluation,” Graphical Models and Image Process., vol. 55, pp. 319–323, July 1993. [99] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Trans. Syst., Mach. and Cybern., vol. 9, no. 1, pp. 62–66, 1979. [100] T. Pavlidis, Structural Pattern Recognition. Berlin: Springer Verlag, 1977. [101] T. Pavlidis, “Contour filling in raster graphics,” in Proc. SIGGRAPH, (Dallas, Texas), pp. 29–36, Aug. 1981. 192 Bibliography [102] T. Pavlidis, Algorithms for Graphics and Image Processing. Maryland: Computer Science Press, 1982. [103] T. Pavlidis, “Why progress in machine vision is so slow,” Pattern Recognit. Lett., vol. 13, pp. 221–225, 1992. [104] J. Peng, A. Srikaew, M. Wilkes, K. Kawamura, and A. Peters, “An active vision system for mobile robots,” in Proc. IEEE Int. Conf. on Systems, Man and Cybernetics, (Nashville, TN, USA), pp. 1472–1477, Oct. 2000. [105] A. Pentland, “Looking at people: Sensing for ubiquitous and wearable computing,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 107–119, Jan. 2000. [106] P. Perona and J. Malik, “Scale-space and edge detection using ansotropic diffusion,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12, pp. 629–639, July 1990. [107] R. Poole, “DVB-T transmissions - interference with adjacent-channel PAL services,” Tech. Rep. EBU-Winter-281, BBC Research and Development Department, 1999. [108] K. Pratt, Digital image processing. New York: John Wiley and Sons, Inc, 1978. [109] K. Rank, M. Lendl, and R. Unbehauen, “Estimation of image noise variance,” IEE Proc. Vis. Image Signal Process., vol. 146, pp. 80–84, Apr. 1999. [110] S. Reichert, “Comparison of contour tracing and filling methods,” Master’s thesis, Dept. Elect. Eng., Univ. Dortmund, Feb. 1995. In German. [111] A. Rosenfeld and C. Kak, Digital Picture Processing, vol. 2. Orlando: Academic Press, INC., 1982. [112] P. Rosin, “Thresholding for change detection,” in Proc. IEEE Int. Conf. Computer Vision, (Bombay, India), pp. 274–279, Jan. 1998. [113] P. Rosin and T. Ellis, “Image difference threshold strategies and shadow detection,” in Proc. British Machine Vision Conf., (Birmingham, UK), pp. 347–356, 1995. [114] Y. Rui, T. Huang, and S. Chang, “Digital image/video library and MPEG-7: Standardization and research issues,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, (Seattle, WA), pp. 3785–3788, May 1998. Invited paper. [115] Y. Rui, T. Huang, and S. Chang, “Image retrieval: Current techniques, promising directions and open issues,” J. Vis. Commun. Image Represent., vol. 10, pp. 1–23, 1999. [116] Y. Rui, T. Huang, and S. Mehrotra, “Relevance feedback techniques in interactive content based image retrieval,” in Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, (San Jose, CA), pp. 25–36, Jan. 1998. [117] P. Sahoo, S. Soltani, A. Wong, and Y. Chen, “A survey of thresholding techniques,” Comput. Vis. Graph. Image Process., vol. 41, pp. 233–260, 1988. 193 [118] P. Salembier, L. Garrido, and D. Garcia, “Image sequence analysis and merging algorithm,” in Proc. Int. Workshop on Very Low Bit-rate Video, (Linkoping, Sweden), pp. 1–8, July 1997. Invited paper. [119] P. Salembier and F. Marqu´es, “Region-based representations of image and video: Segmentation tools for multimedia services,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 1147–1169, 1999. [120] H. Schr¨ oder, “Image processing for TV-receiver applications,” in Proc. IEE Int. Conf. on Image Processing and its applications, (Maastricht, The Netherlands), Apr. 1992. Keynote paper. [121] H. Schr¨ oder, Mehrdimensionale Signalverarbeitung, vol. 1. Stuttgart, Germany: Teubner, 1998. [122] T. Seemann and P. Tischer, “Structure preserving noise filtering of images using explicit local segmentation,” in Proc. Int. Conf. on Pattern Recognition, vol. 2, (Brisbane, Australia), pp. 1610–1612, Aug. 1998. [123] Z. Sivan and D. Malah, “Change detection and texture analysis for image sequence coding,” Signal Process., Image Commun., vol. 6, pp. 357–376, Aug. 1994. [124] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval: the end of the early years,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 1349–1380, Dec. 2000. [125] S. Smith, Feature Based Image Sequence Understanding. PhD thesis, Robotics Research Group, Department of Engineering Science, Oxford University, 1992. [126] M. Spann and R. Wilson, “A quad-tree approach to image segmentation which combines statistical and spatial information,” Pattern Recognit., vol. 18, no. 3/4, pp. 257– 269, 1985. [127] “Call for analysis model comparisons.” http://www.tele.ucl.ac.be/EXCHANGE/. On-line COST211ter, [128] “Workshop on image analysis for multimedia interactive services.” Proc. COST211ter, Louvain-la-Neuve, Belgium, June 1997. [129] “Special issue on segmentation, description, and retrieval of video content.” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, Sept. 1998. [130] “Special section on video surveillance.” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, Aug. 2000. [131] J. Stauder, R. Mech, and J. Ostermann, “Detection of moving cast shadows for object segmentation,” IEEE Trans. on Multimedia, vol. 1, no. 1, pp. 65–76, 1999. [132] C. Stiller, “Object-based estimation of dense motion fields,” IEEE Trans. Image Process., vol. 6, pp. 234–150, Feb. 1997. 194 Bibliography [133] E. Stringa and C. Regazzoni, “Content-based retrieval and real time detection from video sequences acquired by surveillance systems,” in Proc. IEEE Int. Conf. Image Processing, (Chicago, IL), pp. 138–142, Oct. 1998. [134] H. Sundaram and S. Chang, “Efficient video sequence retrieval in large repositories,” in Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, vol. 3656, (San Jose, CA), pp. 108–119, Jan. 1999. [135] R. Thoma and M. Bierling, “Motion compensating interpolation considering covered and uncovered background,” Signal Process., Image Commun., vol. 1, pp. 191–212, 1989. [136] L. Torres and M. Kunt, Video coding: Second generation approach. Kluwer Academic Publishers, 1996. [137] O. Trier and A. Jain, “Goal-directed evaluation of binarization methods,” IEEE Trans. Pattern Anal. Machine Intell., vol. 17, pp. 1191–1201, Dec. 1995. [138] P. van Donkelaar, “Introductory overview on eye http://www.lucs.lu.se/EyeTracking/overview.html, 1998. movements.” On-line, [139] N. Vasconcelos and A. Lippman, “Towards semantically meaningful feature spaces for the characterization of video content,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, (Santa Barbara, CA), pp. 25–28, Oct. 1997. [140] P. Villegas, X. Marichal, and A. Salcedo, “Objective evaluation of segmentation masks in video sequences,” in Proc. Workshop on Image Analysis for Multimedia Interactive Services, (Berlin, Germany), pp. 85–88, May 1999. [141] P. Zamperoni, “Plus ca va, moins ca va,” Pattern Recognit. Lett., vol. 17, pp. 671–677, June 1996. [142] H. J. Zhang, C. Low, S. Smoliar, and J. Wu, “Video parsing, retrieval and browsing: An integrated and content-based solution,” in Proc. IEEE Conf. Multimedia, (San Francisco, CA), pp. 15–24, Nov. 1995. [143] S. Zhu and A. Yuille, “Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp. 884–900, Sept. 1996. [144] Z. Zhu, G. Xu, Y. Yang, and J. Jin, “Camera stabilization based on 2.5-D motion estimation and inertial motion filtering,” in Proc. IEEE Int. Conf. on Intelligent Vehicles, pp. 329–334, 1998. [145] F. Ziliani and A. Cavallaro, “Image analysis for video surveillance based on spatial regularization of a statistical model-based change detection,” in Proc. Int. Conf. on Image Analysis and Processing, (Venice, Italy), pp. 1108–1111, Sept. 1999. Appendix A Applications This thesis has proposed a framework for object-and-event based video processing and representation. This framework uses two systems: an object-oriented shot analysis and a context-independent shot interpretation. The resulting video representation includes i) shot’s global features, ii) objects’ parametric and qualitative low-level features, iii) object relationships, and iv) events. Following are samples of video applications that can benefit from the proposed framework. • Video databases: retrieval of shots based on specific events and objects. • Surveillance: automated monitoring of activity in scenes. ◦ detecting people, their activities, and related events such as fighting, or overstaying, ◦ monitoring traffic and related events such as accidents and other unusual events, ◦ detection of hazards such as fires, and ◦ monitoring flying objects such as aircrafts. • Entertainment and telecommunications: ◦ video editing and reproduction, ◦ smart video appliances, ◦ dynamic video summarization, and ◦ browsing of video on Internet. • Human motion analysis: ◦ dance performance, ◦ athletic activities in sports, and ◦ smart environments for human interactions. The next sections address in more details three applications and suggest ways of using the proposed video analysis and interpretation framework of this thesis. 196 A.1 Applications Video surveillance Close Circuit Television (CCTV), or video1 surveillance system, is a system to monitor public, private, or commercial sites such as art galleries, residential districts, and stores. Advances in video technologies (as in camcorder or digital video technology) have significantly increased the use of surveillance systems. Although surveillance cameras are widely used, video data is still mainly used as an ‘after event’ tool to manually locate interesting events. The continuous active monitoring of surveillance sites to alert human operators to events in progress is required in many applications. Human resources to detect events or observe the output of a surveillance system are expensive. Moreover, events occur, typically, at large time intervals and system operators may lose attention and miss important events. Therefore, there is an increasing and immediate need for automated video interpretation systems for surveillance. The goal of a video interpretation system for video surveillance is to detect, identify, and track moving objects, analyze their behavior and interpret their activities (Fig. A.1(a)). A typical scene classification in a surveillance application is depicted in Fig. A.1(b). Interpretation methods of surveillance video need to consider the following conditions: • Ever changing conditions: object appearance and shapes are highly variable and many artifacts such as shadows, poor light, and reflections, • Object occlusion and unpredictable behavior, • Fault-tolerance: robust localization and recognition of objects in the presence of occlusion. An error must not stop the whole system, Special consideration should be given to inaccuracies and other sources of errors to handle some specific situations such as false alarms, • Complexity and particular characteristics of each application. This may limit a wider use of general video processing systems, and • Real-time processing (typical frame rates are 3-15 frames per second). Considering these conditions and the definition of a video surveillance system (Fig. A.1), the proposed techniques in this thesis for video analysis and interpretation are suitable to meet the requirements of video surveillance applications. A.2 Video databases Considering the establishment of large video archives such as for the arts, environment, science, and politics, the development of effective automated video retrieval systems is a problem of increasing importance. For example, one hour of video represents approximately 0.5 Gigabyte and requires approximately 10 hours of manual cataloging and archiving. One clip requires 5-10 minutes for viewing, extraction, and annotation. In a video retrieval system, features of a query shot are computed, compared to features of the shots in the database, and shots most similar to the query are returned to the user. Three models for video retrieval can be defined based on the way video content is represented. In the first model (low-level Query-By-Example), the user either sketches or selects 1 Many video surveillance systems, involve no recording of sounds [96] which emphasizes the need for stable video analysis procedures. 197 Scene 1 0 0 1 0 1 0 1 Database Video Real-time video analysis Objects & Features Video interpretation Scene Events Event-based descision and control Empty Occupied Decisions/alarms/data Man-Machine interface Surveillance operator (a) Interpretation-based video surveillance. normal abnormal fast, stop long, deposit, ... (b) Scene classification in video surveillance. Figure A.1: A definition of a content-based video surveillance system. a video query, e.g., after browsing the video database. A video analysis module extracts a low-level quantitative video representation. This representation is compared to stored low-level representations and the video shots most similar to the query are selected. Comparison based on low-level quantitative parameters can be expensive, in particular when the dimension of the parameter vector is high. This model is suitable for unstructured (raw) video and for small size databases. In the second model (high-level Query-By-Example), the user selects a video query and the system finds a high-level video representation and compares high-level features to find similar shots. In the third model (Query-By-Description), the user can specify a qualitative and high-level description of the query and the system compares this description with the stored descriptions in the database. Such a model is useful when the user cannot specify a video but has memorized a (vague) description of it. In most existing object-based video retrieval tools, the user either sketches or selects a query example, e.g., by browsing. Browsing a large database can be time consuming and sketching is a difficult task, especially for complex scenes. Since the subjects of the majority of video shots are objects and related events, this thesis suggests a retrieval framework (Fig. A.2) where the user either selects a shot or gives a qualitative and high-level descriptions of a query shot as given in Fig. A.3. The suggested framework for video retrieval as given in Fig. A.2(a) aims at introducing functionalities that are oriented to the way users usually describe and judge video similarity and to requirements of efficiency and reliability of video interpretation that can forego precision. An advantage of the proposed high-level video representation in this thesis is that it allows the construction of user-friendly queries based on the observation that most people’s interpretation of real world domains is imprecise and that users, while viewing a video, usually memorize objects, their action, and their location and not the exact (quantitative) object features. In the absence of a specific application, such a generic model allows 198 Applications Video Shots Global-motion-based shot representation Global motion estimation Global motion interpretation global-motion description PreProcessing Object Segmentation Object Tracking Basic object features Spatio-Temporal Object and event interpretation Motion Estimation Spatio-Temporal Object and event description Meta Data Spatio-temporal Features Video Shot Interpretation Video Shot Analysis off-line Shot Analysis and Interpretation Shot Monitoring and Retrieval on-line Query Shot Analysis Monitoring and Query Interface Query global-motion description AND Query Shot Interpretation Retrieval system Spatio-Temporal Object and event description Retrieved shots OR objects Figure A.2: Object and event-based framework for video retrieval. scalability (e.g., by introducing new definitions of object actions or events). Using the proposed video interpretation of this thesis, users can formulate queries using qualitative object descriptions, spatio-temporal relationship features, location features, and semantic or high-level features (Fig. A.3). The retrieval system can then find video whose content matches some or all these qualitative descriptions. Since video shot databases can be very large, pruning techniques are essential for efficient video retrieval. This thesis has suggested two methods for fast pruning. The first is based on qualitative global motion detected in the scene, and the second on the notion of dominant objects (cf. Section 7.4.2). A.3 MPEG-7 Because of the increasing availability of digital audio and visual information in various domains, MPEG started the work on a new standard on Multimedia Content Description Interface, MPEG-7. The goal is to provide tools for the description of multimedia content where each type of multimedia data is characterized by a set of distinctive features. MPEG7 aims at supporting effective and efficient retrieval of multimedia based on their content features ranging from low-level to high-level features [114]. Fig. A.4 shows a high-level block diagram of a possible MPEG-7 processing chain. Both feature extraction and retrieval techniques are relevant in MPEG-7 activities but not part of the standard. MPEG-7 only defines the standard description of multimedia content and focuses on the inter-operability of internal representations of content descriptions. There are large dependencies between video representation, applications, and access to MPEG-7 tools. For example, tools for extracting and interpreting descriptions are essential for effective use of the upcoming MPEG-7 standard. On the other hand, a well-defined 199 Find video shots in which Event specifications object i Spatial object locations the scene appears in disappears from moves left in right in up in down in rests within lays in the left right top bottom center of the scene Spatial object features ... 11111111 00000000 texture (example) 00000000 11111111 shows Shape size (small, medium, large) object relations object i left right above below inside near within a circle of 50 pixels 50 pixels left/right/ ... to object j global shot specifications global motion: zoom pan rotation stationary dominant objects: 11111111 00000000 Event 00000000 11111111 Shape size (small, medium, large) Motion (slow, medium, fast) Figure A.3: An object and event-based query form. MPEG-7 standard will significantly benefit exchange among various video applications. Effective and flexible multi-level video content models that are user-friendly play an important role in video representation, applications, and MPEG-7. In the proposed system for video representation in this thesis, a video is seen as a collection of video objects, related meaning, local and global features. This supports access to MPEG-7 video content description models. Scope of MPEG-7 Multimedia Content Description extraction ( Feature Extraction & Indexing ) Description-based application Description Standard ( Search & retrieval tool and interface ) Figure A.4: Abstract scheme of a MPEG-7 processing chain. Appendix B Test Sequences B.1 Indoor sequences All test sequences used are real-world images that represent different typical environment of surveillance applications. Indoor sequences represent people walking in different environments. Most scenes include changes in illumination. The target objects have various features such as speed and shape. Many of the target objects have shadowed regions. ‘Hall’ This is an CIF-sequence (352 × 288, 30 Hz) of 300 images. It includes shadows, noise, and local illumination changes. A person enters the scene holding an object and deposits it. Another person enters and removes an object. Target objects are the two persons and the objects. ‘Stairs’ This is an indoor CIF-sequence (352 × 288, 25 Hz) of 1475 images. A person enters from the back door, goes to the front door and exits. The same person returns and exits from the back door. Another person comes down the stairs, goes to the back door, then to the front door and exits. The same person returns through the front door and goes up the stairs. This is a noisy sequence with illumination change (for example, through the glass door) and shadows. Figure B.1: Images of the ‘Hall’ shot, courtesy of the COST-211 group. 201 Figure B.2: Images of the ‘Stair’ shot, courtesy of the COST-211 group. Figure B.3: Images of the ‘Floor’ shot, INRS-T´el´ecommunications. ‘Floor’ This is an SIF-sequence (320 × 240, 30 Hz) of 826 images. It was recorded with an interlaced DV camcorder (320x480 pixels and frame rate of 60) then converted to AVI. All a-fields are dropped and the resulting YUV is progressive. This sequence contains many coding and interlace artifacts and shadows. Other sequences of the same environment were used for testing. B.2 Outdoor sequences Selected test sequences are real-world image sequences. The main difficulty is how to cope with illumination changes, occlusion, and shadows. ‘Urbicande’ This is an CIF-sequence (352 × 288, 4:2:0, 12.5 Hz) of 300 images. Several pedestrians enter, occlude each other, and exit. Some pedestrians enter the scene from buildings. A pedestrian remains in the scene for a long period of time and moves “suspiciously”. The sequence is noisy and has local illumination changes. Some local flicker is visible in the sequence. Objects are very small. ‘Survey’ This is an SIF-sequence (320 × 240, 30 Hz) of 976 images. This sequence was recorded with an analog NTSC-based camera and was PC-sized at 320x240. It has interlace artifacts and a number of frames are dropped. This sequence was recorded at 60 Hz - interlaced and converted to a 30 Hz progressive video by merging the even and odd fields. This was done automatically since the original capture format was MPEG-1. Strong interlace artifacts are present in the constructed frames. 202 Test sequences Figure B.4: Images of the ‘Urbicande’ shot, courtesy of the COST-211 group. Figure B.5: Images of the ‘Survey’ shot, courtesy of the University of Rochester. ‘Highway’ This is an CIF-sequence (352 × 288, 4:2:0, 25 Hz) of 600. This sequence was taken under daylight conditions from a camera placed on a bridge above the highway. Various vehicles with different features (e.g. speed, shape) are in the scene. Target Objects are the moving (entering and leaving) vehicles. The challenge here is the detection and tracking of individual vehicles in the presence of occlusion, noise, or illumination changes. Figure B.6: Images of the ‘Highway’ shot, courtesy of the COST-211 group. Appendix C Abbreviations CCIR HDTV PAL NTSC Y UV/CrCb YCrCb/YUV MPEG MPEG-7 COST AM PSNR MSE MBB MED LP MAP HVS 2-D FIR CCD DCT dB IID Comit´e Consultatif International des Radiocommunications High Definition Television Phase Alternate Line. Television standard used extensively in Europe National Television Standards Committee. Television standard used in extensively North America Luminance corresponding to the brightness of an image pixel Chrominance corresponding to the color of an image pixel A method of color encoding for transmitting color video images while maintaining compatibility with black-and-white video Moving Picture Experts Group A standard for Multimedia Content Description Interface Coop´eration Europ´eenne dans la recherche Scientifique et Technique Analysis Model Peak Signal to Noise Ratio Mean Square Error Minimum Bounding Box Median Low-pass Maximum A posteriori Probability Human Visual System Two Dimensional Finite Impulse Response Charged Couple Device Discrete Cosine Transform Decibel Independent Identically Distribution
© Copyright 2025