Character retrieval and annotation in multimedia - or "How to find Buffy" Andrew Zisserman (work with Josef Sivic and Mark Everingham) Department of Engineering Science University of Oxford, UK http://www.robots.ox.ac.uk/~vgg/ The objective (I) Retrieve all shots/clips in a video, e.g. a feature length film, containing a particular person Visually defined search – on faces “Pretty Woman” [Marshall, 1990] Matching faces in video “Pretty Woman” (Marshall, 1990) Are these faces of the same person? The objective (II) Automatically annotate characters in video with their identity “Pretty Woman” [Marshall, 1990] Edward Lewis (Richard Gere) Vivian (Julia Roberts) Requires additional information Why this is difficult: uncontrolled viewing conditions Image variations due to: • pose/scale • lighting • partial occlusion • expression c.f. Standard face databases Matching Faces Are these images of the same person ? Can be difficult for individual examples … Matching Faces Are these images of the same person ? Easier for sets of faces The benefits of video Automatically associate expression exemplars Approach a. Minimize variations due to pose, lighting and partial occlusions by choice of feature vector Focus on near frontal faces (use frontal face detector) b. Multiple faces to represent expressions Set of faces associated by tracking frames c. Each identity represented by a distribution over a face set Efficient matching or classification for sets of faces ... Outline 1. Obtaining and matching face sets • aim is to obtain large faces sets a. face detection and representation b. obtaining face sets by tracking within shots c. matching face sets within and between shots 2. Application I: Retrieval of shots containing a particular person • Video Google - faces 3. Application II: Annotation of person identity • Use of text and speaker detection as weak supervision – multimedia 4. Extensions to human matching a. Face detection and representation Face detection - Viola & Jones [2001] 1. Detect face in a rectangular window of fixed size • Learn face/non-face classifier from training data 2. Scan window face detector over each pixel in image (translation) and over multiple sizes (scale) Classifier is learnt from labelled data Training Data • 5000 faces All near frontal • 108 non faces • faces are normalized scale, translation Face detection Operate at high precision (90%) point – few false positives Matching faces Easier if faces aligned to remove pose variation face detector eyes/nose/mouth Eyes/nose/mouth detectors Training data: ~5,000 images with hand-marked facial features • Scale determined by face detector • Fixed-size patches extracted around feature points Constellation like Appearance/Shape Model Model shape X (2-D points) and appearance A (patches at points in X) • Appearance and shape are assumed independent Appearance of a feature is modelled as a mixture of Gaussians (GMM) Joint position of all features is modelled as a (mixture of) Gaussians • Full covariance (positions of all features interact) position xi GMM clusters appearance aj Detect face features for rectification Video with detected features close-up rectified face Face normalization • affine transform face using detected features original detection rectified Representing faces Matching improved if faces aligned to remove pose variation face detector eyes/nose/mouth Compare faces by measuring distance between vectors SIFT descriptor [Lowe 1999] rectified face Create array of orientation histograms 8 orientations x 3x3 spatial bins = 72 dim. Face feature vector - summary Benefits of local SIFT descriptors: SIFT unaffected by small localization errors in eyes/nose/mouth detector Centre weighting de-emphasizes background (no foreground segmentation) Illumination normalization per SIFT allows lighting to vary across face multiple overlapping SIFTs SIFT for each facial feature, i.e. 5 x 72 = 360 vector for entire face b. Obtaining sets of faces using tracking within shots Face detection Operate at high precision (90%) point – few false positives Need to associate detections with the same identity frames Example – tracked regions affine covariant regions Affine-Harris Maximally stable extremal regions Tracking covariant regions – two stages Goal: develop very long and good quality tracks • Stage I – match regions detected in neighbouring frames Problems: e.g. missing detections • Stage II – repair tracks by region propagation Region tubes Connecting face detections temporally Goal: associate face detections of each character within a shot Approach: Agglomeratively merge face detections based on connecting ‘tubes’ frames require a minimum number of region tubes to overlap face detections Alternatives: KLT, Avidan CVPR 01, Williams et al ICCV 03 Example: Buffy the Vampire Slayer Breakfast Scene raw face detections face tubes c. Matching face sets Matching face sets Representation of face set • represent tube by set of 360-vectors “face space” Matching face sets within a shot min-min distance: dist(A, B) = min a∈A,b∈B A, B ... sets of face descriptors (360-vectors) dist(a, b) Matching face sets within shot Goal: Match face tubes of a particular person within a shot (to overcome occlusions, self-occlusions) Approach: Agglomeratively merge facesets using min-min distance with exclusion constraints. Exclusion principle: The same character cannot appear twice in the same frame face tubes (tracking only) intra-shot matching Application I: Video Google – faces The objective Retrieve all shots/clips in a video, e.g. a feature length film, containing a particular person Visually defined search – on faces “Pretty Woman” [Marshall, 1990] Applications: • intelligent fast forward on characters • pull out all videos of “x” from 1000s of digital camera mpegs Preliminaries Film statistics for Pretty Woman • 170,000 frames • 1151 shots Pre-processing • Track local regions through every shot • Detect faces in every frame using a `frontal’ face detector (38,0457 face detections) • Obtain face tubes by tracking (659 face tubes) • After intra-shot matching (plumbing) (611 face tubes) Face tube – representation as single vector Descriptor for each face Compact representation of the entire face set Representation of face set – II • obtain compact representation for the entire face tube • treat face descriptors as samples from an underlying unknown pdf • represent face tube as a histogram over face exemplars (non-parametric model of pdf) Making the search efficient (Google like retrieval) Represent video by histogram over facial feature exemplars for each face tube 42 facial feature exemplars 5 face tubes Counts pk Each column is a normalized histogram Exemplars cf words vs documents (e.g. web pages) in text retrieval Employ text-retrieval techniques e.g. • Ranking (here on chi-squared) Demo “Pretty Woman” “Groundhog Day” “Casablanca” “Charade” Retrieval example I – Pretty Woman Query sequence Select a face Detail I Query sequence Retrieved sequences (shown by first detection) Example sequence Retrieval example II – Pretty Woman Query sequence Detail II Query sequence Retrieved sequences (shown by first detection) Examples of recognized faces Retrieval example III – Groundhog Day Query sequence Retrieval example IV – Casablanca Query sequence Example: Matching across movies Bill Murray Lost in translation Groundhog Day [Coppola 2003] [Ramis 1993 ] Lost in translation - query Query shot Example face detection 192 associated face detections Find Bill Murray in ‘Groundhog Day’ Face detections from the first 36 retrieved face tracks: First false positive ranked 42nd, 15 false positives in the first 100 retrieved face tracks (out of total of 596 face tracks ) Application II: Video Annotation "Names and Faces in the News", Berg et al, CVPR, 2004 Also, Alex Hauptmann’s slides • weak supervision from text • correspondence problem Aims Automatically label appearances of characters in TV/movies (“Buffy the Vampire slayer”) with their names Willow Buffy Need to learn appearance and names No supervision from the user, use only readily-available annotation Textual Annotation: Subtitles/Closed-captions DVD contains timed subtitles as bitmaps Automatically convert to text using simple OCR What is said, and when, but not who says it Textual Annotation: Script Many fan websites publish transcripts Automatically extract text from HTML What is said, and who says it, but not when Alignment by Dynamic Time Warping Script has no timing but sequence is preserved Efficient alignment by dynamic programming Subtitle words Script words Path maximizing number of matching words Subtitle/Script Alignment Alignment of what allows subtitles to be tagged with identity giving who and when “Dynamic Time Warping” algorithm 00:18:55,453 --> 00:18:56,086 Get out! HARMONY Get out. 00:18:56,093 --> 00:19:00,044 - But, babe, this is where I belong. - Out! I mean it. SPIKE But, baby... This is where I belong. 00:19:00,133 --> 00:19:03,808 I've been doing a lot of reading, and I'm in control of my own power now,... 00:19:03,893 --> 00:19:05,884 ..so we're through. HARMONY Out! I mean it. I've done a lot of reading, and, and I'm in control of my own power now. So we're through. Ambiguity Knowledge of speaker is a weak cue that the character is visible Tara ? Tara ? Multiple characters Buffy ? Speaker not detected Tara ? Tara ? Speaker not visible Ambiguities will be resolved using vision-based speaker detection Face Association Measure “connectedness” of a pair of faces by point tracks intersecting both Doesn’t require contiguous detections Independent evidence – no drift KLT tracker Tracking faces in spatio-temporal video volume Automatically associated facial exemplars Example Face Tracks Facial Feature Localization Stabilize representation by localizing features Pose of face varies and face detector is noisy Extended “pictorial structure” model Joint model of feature appearance and position Speaker Detection Subtitles/script gives the speaker’s name Identify who (if anyone) in the video is speaking Speaking? In this frame, the subtitles/script says Willow is speaking. If this person is speaking, it must be Willow. Speaker Detection Measure the amount of motion of the mouth Search across frames around detected mouth points 0.014 0.012 SSD 0.01 Speaking 0.008 “Don’t Know” 0.006 0.004 0.002 0 0 20 40 60 frame 80 100 Not Speaking Resolved Ambiguity When the speaker (if any) is identified, the ambiguity in the textual annotation is resolved Tara ? Tara ? Buffy ? Tara ? Tara ? Tara Speaking Non-speaking Non-speaking Non-speaking Non-speaking Exemplar Extraction Face tracks detected as speaking and with a single proposed name give exemplars Buffy Willow Xander 2,300 faces 1,222 faces 425 faces Assign names to unlabelled faces by classification based on extracted exemplars Classification by Exemplar Sets Classify tracks by nearest exemplar Estimate probability of class from distance ratios Refuse to predict names for uncertain tracks Willow Willow Confident ? Buffy Xander Buffy Uncertain Xander “Refusal to predict” Want to be able to leave some uncertain face tracks unlabelled An approximate posterior probability of the label λ for a face track F is defined where Only tracks for which the posterior exceeds a threshold are labelled (also gives a ranking) Experiments Tested on two episodes 130,000 frames, 50,000 faces, 1,000 tracks Methods Proposed method Prior: label all faces with the name occurring most frequently in the script (Buffy) Subtitles only: Label any tracks with proposed names (not using speaking detection) as one of the proposed names, breaking ties by prior. Example Results No user involvement, just hit “go”… Buffy Joyce Dawn Willow Buffy Buffy Buffy Giles Dawn Willow Buffy Willow Other Tara Tara Willow Willow Example Results Wide range of pose, expression, lighting… Tara Dawn Tara Willow Buffy OtherOther Harmony Xander Anya Riley Buffy Harmony Buffy Dawn Spike Precision/Recall 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision Precision Recall is proportion of face tracks assigned a name Precision is proportion of correct names 0.5 0.4 0.3 0.4 0.3 0.2 0.2 Prior (buffy) Subtitles only Proposed method 0.1 0 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Prior (buffy) Subtitles only Proposed method 0.1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Recall Recall Episode 05-02 Episode 05-05 0.8 0.9 1 Example Video Labelling at 100% recall (all faces labelled) 1,900 frames, 2 errors (1 non-face, 1 wrong name) Quantitative Results Speaker detection: ~25% of tracks labelled with 90% accuracy Naming: ~80% precision at 80% recall ~70% precision at 100% recall Episode 05-02 Recall: Proposed method 60% 87.5 80% 78.6 90% 72.9 Episode 05-05 100% 68.2 60% 88.5 80% 80.1 90% 75.6 100% 69.2 Subtitles only 45.2 45.5 Prior (buffy) 21.3 36.9 Subtitles only (no vision): ~45% precision Using an SVM classifier 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision Precision • Large margin, smooth discriminant • Explicit robustness to outliers 0.5 0.4 0.4 0.3 0.3 0.2 0.2 NN−Auto NN−Manual SVM 0.1 0 0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 Recall Episode 05-02 0.7 0.8 0.9 NN−Auto NN−Manual SVM 0.1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Recall Episode 05-05 0.7 0.8 0.9 1 Summary Quite accurate labelling using only readily-available textual annotation No user involvement, just hit “go”… Vision-based speaker detection essential Cues from subtitles alone are very weak Further work Generalization across episodes/movies Extension to non-frontal faces Better models of appearance and fusion Learning recognition of actions and interactions 4. Extensions to human matching Improving Coverage – beyond frontal faces Use of a frontal face detector limits the data which can be labelled Non-frontal views • Multi-view face detection [Klaeser, Schmid, Dalal & Triggs] • Facial feature localization in profile • Speaker detection in profile • Transfer of labels from speaker detection and classification between frontal/profile views by tracking Feature Localization & Speaker Detection Pictorial structure and speaker detection methods adapted successfully to profile views Profile Speaker Detection Transfer of frontal/profile speaker detections expands available annotation for both views Extensions to human matching (II) Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs, CVPR 2005 Extensions to human matching (II) Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs, CVPR 2005 Extensions to human matching (III) Strike a Pose: Tracking People by Finding Stylized Poses Deva Ramanan, David Forsyth and Andrew Zisserman, CVPR 2005 Extensions to human matching (III) Strike a Pose: Tracking People by Finding Stylized Poses Deva Ramanan, David Forsyth and Andrew Zisserman, CVPR 2005 edges walking pose pictorial structure efficient matching Build Model small scale find discriminative features torso bg learn limb classifiers (limb pixels alone are poor model) unusual pose Build Model & Detect torso small scale arm learn limb classifiers label pixels le g head unusual pose general pose pictorial structure Running Example How well do classifiers generalize? References Sivic, J. , Everingham, M. and Zisserman, A. Person spotting: video shot retrieval for face sets International Conference on Image and Video Retrieval (CIVR), Singapore (2005) http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05a.pdf Everingham, M., Sivic, J. and Zisserman, A. Hello! My name is... Buffy - Automatic Naming of Characters in TV Video British Machine Vision Conference (2006) http://www.robots.ox.ac.uk/~vgg/publications/papers/everingham06a.pdf Demo for “Charade”: http://www.robots.ox.ac.uk/~vgg/research/fgoogle/ END
© Copyright 2024