Multimedia Content Analysis Dr. Alan Hanjalic Information and Communication Theory Group Department of Mediamatics Delft University of Technology What is Multimedia Content Analysis? • MCA • Research direction within Multimedia Retrieval targeting the extraction of content-related information from multimedia data Audiovisual data Processed audiovisual data News report on topic T Alpine landscape Algorithm Algorithm Suspicious behavior 2 1 What is Multimedia Content Analysis? • Multimedia CA versus Audiovisual CA? • Search for synergy and not for simple add up • Integration of information from different modalities already at low-level processing steps • Combining features from different modalities in multi-modal content models • Letting the “small pieces” of information from different modalities complement each other at various levels of the content analysis process in providing reliable input for reasoning at the highest level 3 MCA Case Study: Video Content Analysis (VCA) • Video “True” multimedia • • • Vast popularity of digital video • • • • • Synchronized visual, audio (music, speech) and text modalities Communicated information A “synergy” of information segments carried by different modalities Compression technology High-capacity digital storage systems Affordable digital cameras Access to Internet and broadband networks Emerging huge collections of digital video – digital video libraries (DVL) 4 2 Benefits of VCA: Handling Digital Video Broadcasting (1) • Broadcasters transfer to digital video production, transmission and receiving chain • Growing popularity of high-capacity digital storage devices at consumers Huge amount of video hours instantaneously accessible by the user 5 Benefits of VCA: Handling Digital Video Broadcasting (2) • Explosion in consumer choice leads to important consequences for TV broadcast “consumption” • Changes in understanding of the broadcasting mechanism • The concept of a “channel” will lose its meaning • Broadcasted material will be recorded routinely and automatically • Programs will be accessed on demand – from the local storage • Viewing of live TV is likely to drastically diminish with the time 6 3 Benefits of VCA: Handling Digital Video Broadcasting (3) • VCA can make a difference! • Securing the maximum transparency of stored content • Efficiently and effectively organizing and presenting the stored content to the user Automated video abstraction • Channeling the stored video material to the user according to his preferences Personalized video delivery 7 Where else can VCA make a difference? • Remote education • Instructional video archives easily manageable, searchable and reusable • Business • Summarization and topic-based organization of conference/meeting videos • Internet video collections easily manageable, searchable and reusable • Security/Public Safety • Smart cameras for video surveillance 8 4 Video Content Analysis: Data Features Semantics • Features (Signal/data properties) - Color composition - Shape and texture characteristics - Camera and object motion intensity / direction - Speech and audio signal properties -… Semantic Gap • Semantics (Signal/data meaning) ? - News report on Euro - Car chase through NY - Dialog between A and B - An interview - Happiness - Romance -… 9 Development of the VCA research area • Shot-boundary detection (1991 - ) • Still and dynamic video abstracts (1992 - ) • Parsing a video into continuous camera shots Low-level VCA • Making video browsable via representative frames (keyframes) • Generating short clips carrying the essence of the video content • High-level parsing (1997 - ) • Parsing a video into semantically meaningful segments • Automatic annotation (indexing) (1999 - ) High-level VCA • Detecting prespecified events/scenes/objects in video • Affective VCA (2001 - ) • Future: Multimedia Content Mining and Knowledge Discovery! • Extracting moods and emotions conveyed by the video content 10 5 Toward the Meaning of Video Data: How to bridge semantic gap? • The “catch”: • Integration of feature-based evidence and domain knowledge • Example: • Finding the appearances of a news reader in TV news • Assumptions based on domain knowledge: • Shots with the news reader are characterized by (more-or-less) constant studio background • Faces of a news readers are the only faces in the program appearing at multiple times in a (more-or-less) constant setting • Speech signal characteristics are the same in all shots featuring the same news reader • Important: Level of prior knowledge • depends on context and nature of application • determines the flexibility and applicability scope of a method 11 Accessing Video by Text Labels Labels: “News report on topic T” • “Dialog” “Classical” video indexing approach • Classify video clips according to content labels • Content label Video Videoindexing indexing “Score in a soccer game” “News report on topic T” “Dialog” “Score in a soccer game” • Serves as index (annotation, label) for a given video clip in a video archive • Enables easy retrieval of video clips • Typically prespecified by the user • Examples of labels: Labeled video clip Video frame • News: “Parliament”, United Nations”, “Amsterdam”, Foreign politics” • Movie: “Action”, “Romance”, “Dialog”, “Car chase” • Wildlife documentary: “Hunt”, “Running lion” 12 6 Video Indexing: General approach • Generally, a pattern classification problem • Assigning a label to a clip according to the match between the content model representing the label and the pattern formed by the features of the data in the clip • A simple, illustrative example (Yeung and Yeo, 1997) • Apply time-constrained clustering to shots • Label all shots from one and the same cluster by the same letter • Search for content patterns in the series of labels “noise” label Minimum allowed repetition of labels XYZABABABCDEFEDEGHIABCDEBFGEHIBABCE dialog action 13 Video Indexing via Content Modeling • • More indexing robustness and more difficult indexing criteria require a more sophisticated indexing approach A promising approach: • Define general hierarchy of semantic concepts • Define probabilistic content models at each level of hierarchy • Put different concepts in probabilistic interrelation with each other using a networked model based on prior knowledge Allowing flexibility in feature-concept and concept-concept relations Spreading the uncertainty in these relations over different nodes and layers • • Assigning label X to a clip if model X is found with a sufficiently high confidence NOTE: Seamless use of different modalities at different model levels 14 7 General Hierarchy of Semantic Concepts • High-level concepts Topics • “Topics” • most general content aspects of video clips • Action movie scene Suspicious human behavior News topic T Goal Human-talking Dialog Moving car Hunt Explosion Events Car Cityscape Chair Snow Sky Indoor Lion Sunset Outdoor Mid-level concepts Sites, Objects • “Events” • Have dynamic audiovisual content • Serve as main components of a “topic” • Features Low-level concepts • “Sites” and “Objects” • Static site and object instances Video frames 15 Modeling Low-Level Semantic Concepts: Example approach • • Concept of “Multijects” (Naphade and Huang, 2002) A model of the semantic concept X • Giving as output the probability that concept X is present in the clip • Taking as input P(“Indoor”| Features, Other multijects) “Indoor” • Features computed in the clip • Weighted probabilities of the presence of other semantic concepts in the clip • Can be realized as e.g. a mixture of Gaussians • Weights reveal the likely cooccurrence of concepts weight Features Other multijects (“Sky”, “Bed”, “Chair”, “Snow”…) • “Sky” and “Snow” reduces the confidence in detecting “Indoor” • “Indoor” more likely if “Chair” or “Bed” have already been detected 16 8 Modeling Medium-level Semantic Concepts • Beginning of hunt Hidden Markov models (HMM) • Practical and effective mechanism for modeling time-varying patterns • 2nd hunt shot Non-hunt HMM-based “Event” model Valid hunt • A complex “multiject” End of hunt • Observations consisting of features • Prior probabilities depending on presence of other semantic concepts • Output representing the posterior probability of the modeled event • Complex events complex HMM-models • Event-coupled and hierarchical HMM 17 Example Approach • (Li and Sezan, 2001) Detection of “Play” and “Break” segments in baseball No play • Shot classification into • • • • “Start-of-play”, ( “No play”, ( ) “In play”, ( ) “End-of-play” ( ) Start-of-play End-of-play ) HMM In play 18 9 Modeling High-Level Semantic Concepts • Bringing low- and medium-level content models into probabilistic relations to each other • e.g. Bayesian belief networks, factor graphs + Skydiving Bird - + • Example: “Multinet” (Naphade and Huang, 2002) Airplane • Multijects serve as nodes • Links between multijects reveal prior knowledge regarding Shark - + Indoor Water • Cooccurrence of different semantic concepts • • “Shark” is unlikely to cooccur with “Bird” “Shark” is supported by “Water” Features • Contextual constraints (e.g. spatio-temporal ones) • • “Sky” is always above “Water” Speech synchronous with facial expressions to model the event of “Human talking” 19 Multi-Segment Video Indexing P(Cat.1/Cat.N) • HMM’s state transition probability • Directs category changes from one segment to another Miscellaneous Miscellaneous • Category Category11 ... Category CategoryNN All segments are classified simultaneously according to the most probable path through HMM Video sequence Cat.1 Segment 1 Cat.3 Segment 2 Misc. Segment 3 Cat.4 Segment 4 Cat.N Segment 5 Cat.5 Segment 6 20 10 Video Indexing via Content Modeling: What did we learn? • Main principle: • Use expert knowledge to compose high-level content models • Train the model parameters using (immense quantity and diversity) of training data • Apply models to new data, detect the occurrence of the modeled semantic concept, if confidence is high then assign label! • The more knowledge, the more sophisticated labels: • “Ronaldo making a sudden move to the right and catching the ball with his left leg while scratching his nose with his right arm” 21 Video Indexing via Content Modeling: Some observations • (Probabilistic) class modeling is one of basic approaches in pattern classification (generative classification) • Straightforward “jump” from Pattern Classification to MCA • Increasing tendency to “classify anything one gets hold of!” • Abundance of narrow-scope/narrow-domain MCA solutions • Examples of recent results • Face detectors that can not handle all face poses, camera views, occlusions, lighting conditions, complex backgrounds, … • Goal detectors that can handle specific camera views only • Tools capable of finding Bill Clinton in front of American flag (but, please don’t remove the flag!) • Main problem: Solutions not (sufficiently) scalable • Too sensitive to training data, too inflexible and too much based on heuristics 22 11 The curse of domain knowledge • Bridging the semantic gap by integrating feature-based evidence and domain knowledge • Example problem: Searching for video clips with a Ferrari • Narrowing the scope of the problem More domain knowledge Narrower semantic gap • Formula 1: red color, keywords, logos • Advantage: • Well-defined narrow-scope problem can be solved successfully • Disadvantage: • Number of specific scenarios to cover the whole problem ∞ • One solution needed per scenario Not (always) the way to go! 23 An analogy: Image compression • Image-specific compression method May lead to optimized rate-distortion performance per considered image, but practically irrelevant! • Take an image • Analyze the pixel content • Use analysis results to develop the compression method with optimized rate-distortion performance for that image • Generic compression principle • Analyze general image properties relevant for compression • Develop generic compression principle that may work well for every image (e.g. for redundancy and irrelevance removal) • Optimize the PRINCIPLE and not the coding performance for a single image! 24 12 Benefits of working on a generic principle • THE way to • approach solving the problem COMPLETELY • increase the robustness of VCA solutions due to strong theoretical foundation of the principle • secure the performance constancy across entire target domain • make turning of research into technology feasible • • Cross-domain applicability possible Concentrating research manpower on the same problem • Many brains working on one challenge, instead on many scattered small problems Joint successful standardization activities! 25 An Example: A Glimpse at Surveillance Application • Modeling and classification based on prespecified events (high-level prior knowledge) possible, but does not lead to a practical widelyapplicable solution: • What is a suspicious event? Don’t know a priori! • Don’t care what it is precisely! Just alert me if it is suspicious! • Possible alternative approach: • Use prior knowledge at the lowest inference levels • Let the system learn the highest-level inference itself, e.g. based on sporadic feedback from the user • No scattered event-detection modules, but a generic system based on one basic principle! 26 13 Automated surveillance: A concept of an ideal solution Low-level prior knowledge Audio Sensor Suspiciousness model: Integration of modalities Video Relevance feedback Threshold setting based on desired alertness level Level of suspiciousness Potential event of interest Threshold Alert! Alert! Alert! Alert! time 27 Smart Camera’s Suspiciousness Model: A generic development approach • Some “ingredients” • • • • • • • People detection People counting People/group motion detection and recognition New still object detection (“somebody left a suitcase!”) Extremes in “perceived” audio signals … Model development (integration) • • • • Calibration based on context Adaptation (learning) using relevance feedback Translating input/computed cues into confidence Integrating all confidences into the confidence for overall suspiciousness level System currently under development at ICT Group! 28 14 Where the “curve” idea already works: Affective MCA • Affective MCA • • • • • Importance • • • • • Extraction of affect (feelings, emotions, moods) from data/signal Extracting affect-related features from different modalities Combining features in affect models Indexing temporal signal segments in view of its affective properties Affect-based indexing (“find me exciting, relaxing, funny content”) Personalized content recommendation (“I’m in the mood for …”) Highlights extraction (e.g. “find 10 most exciting minutes of …”) Automated surveillance (e.g. detection of aggression, fight, tension) Relation to the state-of-the-art MCA research • So far, emphasis on “cognitive” content - “facts” (temporal content structure, scene composition and type, contained objects/events”) 29 An example: Extracting moods from AV signal • A “straightforward” approach • Pick a set of moods that you are interested in extracting • Pick a set of features and a classification mechanism • Collect training data, train the classifier and classify! • Problems: • Affect too abstract to be modeled explicitly • The prespecified set of moods can hardly be exhaustive • Which color, texture, motion or structure is to be linked to joy, sadness, tension or happiness? • Immense variety of the content corresponding to a given mood Training data? Difficult to generalize the obtained results 30 15 Searching for features: Ask people who know more about it! • Advertising people • Power of color to induce product-related mood effects • Combining color with scene structure to enhance the effect • Psychophysiology people • Impact of motion on emotional responses in viewers • Cinematography people • Patterning of shot lengths • HCI people • Determining the emotion in speech and audio 31 From features to moods • Remaining problems: • Feature – mood relations are rather vague • Vast variety of content for a given mood • A solution inspired by psychophysiology: • Uncouple the connection between the features and moods by introducing an intermediate layer! • Intermediate layer defined by • Arousal (intensity of affect) Level of excitement • Valence (type of affect) from pleasant to unpleasant 32 16 The Valence-Arousal paradigm * • All moods extracted from video can be mapped onto an emotion space created by the arousal and valence axes • Affective states points in the 2D affect space • Arousal and Valence Arousal Valence • Can be measured using physiological functions (e.g. heart rate, skin conductance) • Can be linked to a number of audiovisual features by means of user tests 33 * Dietz and Lang: Aefective agents: Effect of agent affect on arousal, attention, liking and learning, 3rd ICTC, 1999 Arousal, Valence and Affect time curve* Arousal • Video • Temporal content flow • Smooth transitions from one affective state to another 2 4 3 a) 1 t Valence 2 b) 1 t • Measurement of arousal and valence along a video: 3 4 Arousal • Arousal and Valence time curves ! • Combining the arousal and valence time curves into the Affect Curve Affect curve 4 2 c) 3 Valence 1 34 * Hanjalic and Xu: Affective video content representation and modeling, IEEE Trans. on Multimedia, February 2005 17 Example: A Model for Arousal Curve * • Arousal time curve: A( k ) = F (Gi ( k ), i = 1,..., N ) • N - number of features considered in the model • Gi(k) - models the arousal variations revealed by feature i • F - integrates the contributions of all features • Three features measured along a video per frame k: • motion activity m(k) • density of cuts c(k) • sound energy e(k) • Smoothing and scaling of each feature time curve Gi(k) • F - a weighted average of the components Gi(k) 35 * Hanjalic and Xu: Affective video content representation and modeling, IEEE Trans. on Multimedia, February 2005 Extracting moods using Valence-Arousal Paradigm * • • Effective and generic alternative to a content modeling approach Possibility to optimize either the measurement or the inference side separately and on-the-fly User query: -Horror thrill -Hilarious fun -Romantic “feel good” -… AV signal Valence Feature extraction Psychophysiology: Translation: Mood VA value range Arousal Measurement side Affect curve Inference side 36 * Hanjalic: Extracting Moods from Pictures and Sounds: Towards Truly Personalized TV, IEEE Signal Processing Magazine, March 2006 18 Personalized video content delivery: Affective user profile generation * User profile obtained by - Collecting affect curves of all programs watched in the past - Collecting prevailing moods into areas of interest Affect curve extracted from a new video Area of interest Area of interest Prevailing mood Any overlap between the areas? Area of interest 37 * Hanjalic: Extracting Moods from Pictures and Sounds: Towards Truly Personalized TV, IEEE Signal Processing Magazine, March 2006 Personalized video content delivery: Browsing the 2D affect space * Video Videolist list11 Video Videolist list22 Video Videolist list33 Video Videolist list44 User moves the pointer and scans the affect space Video Videolist list55 List of videos already known to the system and having similar prevailing mood 38 * Hanjalic: Extracting Moods from Pictures and Sounds: Towards Truly Personalized TV, IEEE Signal Processing Magazine, March 2006 19 Another example of a paradigm shift: Detecting soccer highlights • Classical approach: • Select a number of highlight-like events, • Train sufficient number of event detectors • Use event detectors to detect highlights • Paradigm shift: • Use the link between highlights and excitement! • An “outside the box” solution: • Model the excitement along a video as an arousal time curve • Select soccer video segments with maximum excitement by thresholding the arousal time curve 39 Soccer highlights extraction: A simple realization * • Cut off the peaks of the arousal time curve in the desired duration Cut-off line Highlights 40 * Hanjalic: Generic approach to highlights detection in a sports video, IEEE ICIP 2003 20 Soccer highlights extraction: A more sophisticated realization * • Possibility to influence the composition of the highlighting sequence of a fixed duration • Considering highlight “strength” number of “reacting” features Maximum Selectiveness Less selectiveness 41 * Hanjalic: Generic approach to highlights detection in a sports video, IEEE ICIP 2003 Maximum Selectiveness • Weighting the arousal components with respect to the “weakest” one w(x) 1 Gi ' (k ) = Gi (k ) w(k ) , i = 1,..,3 σ x with d w( k ) = min (Gi ( k )) − d 1 i 1 + erf 2 σ and erf ( x) = 2 π∫ x 0 e −t dt 2 42 21 Adaptive “filtering” of arousal curve * Maximum selectiveness Less selectiveness 43 * Hanjalic: Generic approach to highlights detection in a sports video, IEEE ICIP 2003 Affective MCA still a Grand Challenge • Need for more solid links between affect dimensions and features • Need for more sophisticated integrative models of affect dimensions • Need for optimal ways for employing affect measurements for personalization: • Affect curve abstraction and representation • … 44 22 Other Challenges in MCA • Multimedia Content Mining and Knowledge Discovery • Extracting key content elements or multimedia keywords (equivalent to keywords in text) • Revealing the semantic temporal content structure of a general multimedia data stream (equivalent to text segmentation) • Finding semantically meaningful data clusters at different granularity levels • Indexing of found clusters and segments by multimedia keywords • Cross-Media Learning and Reasoning • Links the persons, objects, events and scenes with the words appearing in the accompanying speech or overlay text • Learning the text-video-audio links on-the-fly (e.g. self-learning) • Importance • Enabling “Multimedia Google” 45 Computing Text Content Similarity (1) • Similarity between texts of clips m and n obtained on the basis of • Number of same words • Similarity of their significance in both clips • Word significance computed based on • TF - Word frequency (how many times a word occurs) • IDF - Collection frequency (how exclusive or unique a word is for a clip) • Document length (serves to normalize above two measures) • • • Significance of work k in text tm expressed by the “weight” w(k,tm) Feature vector for clip comparison V consists of word weights Clip similarity computation using a cosine measure: ∑ w(k , t m ) w(k , t n ) S (m, n, V ) = cos(t m , t n ) = K ∑ w 2 (k , t m ) ⋅ ∑ w 2 (k , t n ) K K “joint” vocabulary of the clips 46 23 Computing Text Content Similarity (2) • Improved performance by using more sophisticated text analysis techniques • Applying stemming to capture word derivatives (e.g. endings) • Exploiting dissimilar but semantically related words (e.g. “volcano” and “lava”) by using thesaurus or a collocation network • Inferring subtle semantic relations between words using e.g. Latent Semantic Analysis (LSA) 47 From textual to multimedia keywords: An audio example • Identify audio “words” • Clusters of elementary audio segments with similar signal properties • Requires suitable features, similarity metric and clustering mechanism • Defining similarity between audio words to be able to “count” them • Eliminating noise from feature vectors by working with dominant features (e.g. via Singular Value Decomposition) • Probabilistic “counting” based on level of “word” similarity • “Expected” TF and IDF instead of the exact ones! * 48 * Lu, Hanjalic: Towards Optimal Audio "Keywords" Detection for Audio Content Analysis and Discovery ACM Multimedia 2006 24 Illustration: Text-like Audio Scene Segmentation * • • Compute co-occurrence and significance of audio words Define semantic affinity between segments as function of • Their co-occurrence • Their mutual distance • The probability that they are key audio segments (“keywords”) Audio Element Potential boundary 0 L-Buf R-Buf ei A( si , s j ) = Co (ei , e j )e −T ( s i , s j ) / T m t L ej i −1 i j +1 j Pei Pe j • Confidence for audio scene boundary at time t : • Weight W serves to unbias the confidence: C (t ) = W= ∑ 1 ∑ ∑ A( si , s j ) W si∈Lt s j∈Rt ∑ Pei Pe j i , si ∈L t j,s j ∈R t 49 * Lu, Cai and Hanjalic: Audio Elements based Auditory Scene Segmentation. IEEE ICASSP 2006 Illustration: Content discovery in composite audio* Audio Streams Documents / Web Pages (Ι) Feature Extraction Word Parsing Context-based Scaling Factors Iterative Spectral Clustering Scheme Words Importance Measures Audio Elements Index Terms Selection Keywords Key Audio Elements ( ΙΙ ) BIC-based Estimation of Cluster Number (a ) Auditory Scene Detection Document Categorization Information-Theoretic Coclustering based Auditory Scene Categorization Auditory Scene Groups (b) Documents with Similar Topics 50 * Cai, Lu, Hanjalic: Unsupervised content discovery in composite audio, ACM Multimedia 2005 25 References • • • • • • • • • • • • • • • • • Boggs J.M., Petrie D.W.: The art of watching films, 5th ed., Mountain View, CA: Mayfield 2000 Cai R., Lu L., Hanjalic A.: Unsupervised content discovery in composite audio, ACM Multimedia 2005 Dietz, R., Lang, A.: Aefective Agents: Effects of Agent Affect on Arousal, Attention, Liking and Learning, 3rd International Cognitive Technology Conference, CT’99, 1999 Hanjalic A., Lagendijk R.L., Biemond J.: Automated high-level movie segmentation for advanced video-retrieval systems, IEEE Transactions on Circuits and Systems for Video Technology, Vol.9, No.4, pp. 580-588, June 1999 Hanjalic A. Xu L.-Q.: Affective Video Content Representation and Modeling, IEEE Transactions on Multimedia, February 2005 Hanjalic A.: Generic approach to highlights detection in a sports video, IEEE ICIP 2003 Hanjalic A.: Content-based analysis of digital video, Kluwer/Springer 2004 Hanjalic A., Kakes G., Lagendijk R.L., Biemond J.: Indexing and Retrieval of TV Broadcast News using DANCERS, JEI, Oct. 2001 Jiang et al.: Video segmentation with the assistance of audio content analysis, IEEE ICME 2000 Kender J.R., Yeo B.-L.: Video scene segmentation via continuous video coherence, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1998 Li, B.; Sezan, M.I.: Event detection and summarization in sports video, IEEE Workshop on Content-Based Access of Image and Video Libraries, 2001. (CBAIVL 2001), Page(s): 132 -138 Lu L., Cai R. and Hanjalic A.: Audio Elements based Auditory Scene Segmentation. IEEE ICASSP 2006 Naphade M., Huang T.S.: Extracting semantics from audiovisual content: The final frontier in multimedia retrieval, IEEE Transactions on Neural Networks, Vol. 13, No.4, July 2002 Rui Y., Huang T.S., Mehrotra S.: Constructing table-of-content for videos, Multimedia Systems, Special Section on Video Libraries, 7(5), pp. 359-368, 1999 Sundaram H. and Chang S.-F.: Determining computable scenes in films and their structures using audio-visual memory models”, ACM Multimedia 2000 Yeung M., Yeo B.-L.: Time-constrained clustering for segmentation of video into story units, Proceedings of the International Conference on Pattern Recognition (ICPR’96), pp. 375-380, 1996 Yeung M., Yeo B.-L.: Video visualization for compact presentation and fast browsing of pictorial content, IEEE Transactions on Circuits and Systems for Video Technology, Vol.7, pp. 771-785, 1997 51 26
© Copyright 2024