A Morphological Approach to Scene Change Detection and Digital Video Storage and Retrieval Woonkyung M. Kima, S. Moon-Ho Songa , Hyeokman Kimb, Cheeyang Songb , Byung Woong Kwona and Sun Geun Kima a School of Electrical Engineering Korea University, Sungbuk-gu Anam-dong 5 Ga 1, Seoul 136-701, Korea b Multimedia Technology Research Laboratory Korea Telecom, Seocho-gu Woomyeon-dong 17, Seoul 137-792, Korea ABSTRACT With the abstraction of digital grayscale video as the corresponding binary video{ a process which upon numerous subjective experimentation seems to preserve (most of the) intelligibility of video content{ we can pursue a precise and analytic approach to (digital video storage and retrieval) algorithm design that are based upon geometrical (morphological) intuition. The foremost benet of such abstraction is the immediate reductions of both data and computational complexities involved in implementing various algorithms and databases. The general paradigm presented may be utilized to address all issues pertaining to video library construction including visualization, optimum feedback query generation, object recognition, e.t.c., but the primary focus of attention of this paper is the one pertaining to detection of fast (including presence of ashlights) and gradual scene changes such as dissolves, fades, and various special eects such as wipes. Upon simulation we observed that we can achieve performances comparable to those of others with drastic reductions in both storage and computational complexities. Since the conversion from grayscale to binary videos can be performed directly with minimal additional computation in the compressed domain by thresholding on the DCT DC coecients themselves (or by using the contour information attached to MPEG4 formats), the algorithms presented herein are ideally suited for performing fast (on-the-y) determinations of scene change, object recognition and/or tracking, and other more intelligent tasks traditionally requiring heavy demand on computational and/or storage complexities. The fast determinations may then be used on their own merits or can be used in conjunction or complementation with other higher-layer information in the future. Keywords: morphological signal processing, scene change detection, storage and retrieval 1. INTRODUCTION AND NOTATIONS Following convention, every digital video sequence V is to be logically segmented as a sequence of scenes SiV ,i = 0,1, : : : ,LV ; 1, i.e. V def = (S0V , : : : ,SLVV ;1 ), (1) where every scene SiV can be further decomposed into a sequence of shots ShVij ,j = 0,1, : : : ,Mi ; 1, i.e. SiV def = (ShVi0 ,ShVi1 , : : : ,ShVi(Mi ;1) ), (2) V ,k = 0,1, : : : ,Nij ; 1, i.e. and every shot ShVij can be further decomposed into a sequence of frames Fijk ShVij def = (FijV0 ,FijV1 , : : : ,FijV(Nij ;1) ). Further author information: (Send correspondence to Woonkyung M. Kim) Woomkyung M. Kim: E-mail: [email protected] (3) In other words, the digital video sequence V is viewed as a sequence of frames (F0V ,F1V , : : : ,FNVV ;1 ) (NV frames) which is divided into LV groups called scenes, each of which is divided into Mi groups called shots, each of which consists of Nij frames; in particular, NV = LX V ;1 MX i ;1 i=0 j =0 Nij . (4) The preliminary goal of scene change detection is to identify given video V as a collection of shots some of which are to be labelled as gradual transitions (fade-in and fade-out included), wipes and other special eects. It is possible to adopt the function theoretic viewpoint towards unifying the many dierent existing approaches to scene change detection. Omitting much of the discussions regarding this summary, we point out merely that we envision each frame FnV of any digital video V as a \point" in a discrete trajectory lying in a suitably large dimensional space F, henceforth termed frame space. With this conceptualization of digital video, it then seems plausible to ask whether there is a metric dened on the corresponding frame (metric) space F with respect to which the video trajectory maps out a discernible pattern. Within this context, the generic goal of the existing scene change detection algorithms is to generate a similarity metric1,2 d : F F ! R+ dened on F such that each points of shot Shij maps out a \localized" subpaths and between any two unrelated consecutive shots Shij and Shi(j+1) is a distinct jump in distance. In Dening the similarity metric d(:,:), it is of signicant practical interest achieve the following two objectives: perception correlation a strong correlation between perceptual and computed similarities are preserved, and computational tractability the computational and storage complexities involved in calculating computed similarity is minimized. Various metric spaces and corresponding metrics were proposed to achieve the above two objectives. The metrics in previous works3{6,1,7,8,2 were proposed to fulll requirement of computational tractability and those in other works9{11 and other classical works12 were proposed to fulll requirement of perception correlation{ the frame space F can, without any loss in generality, be considered to be a vector space, and the following is an abbreviated overview of the ingredients for the proposed metrics considered to satisfy above mentioned objectives: fullllment of perception correlation 1. the standard l1 , l2 , l1 norms dened on frame space F considered to consist of histograms, raw-image pixel values, DCT coecients, e.t.c., 2. the standard inner product dened on frame space F considered to consist of histograms, DCT coecients, e.t.c. 3. some statistical measures (e.g. motion vector counts, variances) to reect activities of each frame. fullllment of computational tractability 1. decimation of data achieved in transform domains (e.g. neglecting of some ac coecients), spatial domains (e.g. utilization of only some of the macroblocks, averaging of pixel values within a block), and/or other hybrid domains (e.g. histogramming into a small number of bins), 2. utilization of (readily available) compressed domain features such as dc/ac coecients (JPEG,MPEG), motion vectors(MPEG), and some derived measures. It is also noted that{ as for the cases of gradual transitions such as dissolve, fade-in and fade-out, and/or of false negatives such as zoom-in, zoom-out, ashlights, panning{ when we have more or less a clear production model as in Hampapur,9 the detections of these features involve detections of associated regularities tracked via some derived quantities (e.g. statistics, interframe distances). i.e. a sequence of points It is possible to conceptualize within the context of our function-theoretic framework all of the above proposed (similarity) metric-based scene detection algorithms. For the sake of brevity, just to convey the avor, we point out merely (at the risk of inaccuracy) the following: 1. the transformation of raw-image frame into a histogram frame followed by some norm and/or inner product operation (between two adjacent frames) entails partitioning the original F into equivalence classes (which includes all of the images under the symmetric group of permutations) dened via the (distinct) histogram frame, and then dening the metric to be the one induced by the standard metric on the histogram frame space. The hope (and evidence12,1 ) here is that both the object motions and the motions of the camera are absorbed well (for example) by the actions of the symmetric group. 2. the (normalized) inner product proposed to measure frame similarity can be viewed as indicating the angle between the two frames (viewed as vectors) as viewed from the origin. As such, the measurement would be suspect for those video V (viewed as a trajectory in frame space) for which the intershot trajectory (representing some portion of Shij and Shi(j+1) ) lies in line with the origin (some special cases of dissolves, and all cases of fade-in and fade-out). One avenue hereto untapped towards developing algorithms for storage and retrieval of digital video (scene change detection algorithms included) is that pertaining to the decimation in pixel-value domain. Our morphological approach to scene change detection and digital video storage and retrieval generally involves abstracting an N V -frame grayscale video V (normally viewed as a sequence of grayscale-valued functions) as the corresponding binary and/or set videos: Definition 1.1 (Grayscale, Binary, Set Video). V 1. (grayscale video): A sequence of grayscale functions (GVn )Nn=0;1 where each function is dened by: GVn (i,j ) = GVijn , where each GVijn is the ij -th (quantized) graycale pixel value of n-th frame. V 2. (binary video): A sequence of binary-valued functions (BnV )Nn=0;1 where each function BnV is dened by: V , BnV (i,j ) = Bijn V is the ij -th binarized pixel value of n-th frame; we assume a generalized grayscale/binary where each Bijn V = TA (GV ) where A stands for a general adaptation logic, converting thresholding in the form of GVijn ! Bijn ijn e.g. the following representative notations are generally used: (a) A = hV implies that a uniform xed global threshold is used to convert all pixel values to binary, irrespective of temporal and spatial context, and (b) A = hVn implies that a time-varying global threshold, presumably taking into account intraframe spatial context, is used to convert all pixels in the frame into respective binary values. V V of Z2 is dened 3. (set video): A sequence of (foreground) sets (SnV )Nn=0;1 where each \foreground" subset Sijn indirectly by: SnV = TA(GVn ) = BnV , where: (a) C denotes the standard characteristic function of set C , and (b) the functions TA(GVn ) and BnV are dened by: i. TA (GVn )(i,j ) = TA(GVijn ), and V . ii. BnV (i,j ) = Bijn Equivalently, the set video may be formulated as a single set S V consisting of all \foreground" pixels, i.e. the subset S V of Z3 is dened as the collection of all pixels constituting the \foreground" image, i.e.: S V = [(i,j)2SnV (i,j ,n). With the above denitions of binary and set video at hand we are ready to tackle the problems of scene change detection and storage and retrieval of digital video. It is clear that with the above abstraction of digital video, rather natural and classical representation of digital V video V is as a frame-point trajectory, a sequence of points (GVn )Nn=0;1 , in a suitable large dimensional vector space. For example, viewed in the context of natural Banach Space (with l2 -norm, say), the following characterizing manifestations are presumed to occur: 1. Abrupt scene changes $ \jumps" in frame-point trajectory 2. Flashlights $ sudden short-duration (typically 1 or 2 frame-points) \jumps" in frame-point trajectory 3. Intrashot changes (within ShVij ) $ \close" neighboring frame-points 4. Gradual scene changes (some ShVij ) dissolves, fade-ins, fade-outs $ straight line segment(s) joining the two end frame-points GVm and GVm+M ;1 : GVn = (n,m,M )GVm + (n,m,M )GVm+M ;1 , where: (5) (n,m,M ), (n,m,M ) such that 0 (n,m,M ), (n,m,M ) 1, are monotonic decreasing/increasing functions with: (a) (m,m,M ) = 1,(m + M ; 1,m,M ) = 0, and (b) (m,m,M ) = 0, (m + M ; 1,m,M ) = 1 is a generic model of dissolves and fades. A simple and typical model is a uniform dissove where: n;m, (6) (n,m,M ) = M ;M1 +;m1 ; n = 1 ; (n,m,M ) , and (n,m,M ) = M ;1 i.e. where the rst frame begins to turn o exactly when the second frame begins to turn o, and the rst frame turns o completely exactly when the second frame turns on completely. Equation 5, viewed as vector addition, is clearly one corresponding to a straight line segment joining the two end frame-points. The morphological approach entails utilizing the homomorphisms: FnV $ GVn $ BnV $ SnV , which (upon heuristic observationsy) retain desirable properties of perception correlation and computational tractability, towards developing the following morphological paradigm (which may be used in conjunction with traditional paradigms) which operate directly and exclusively in binary or boolean-valued domain (see Figure 1). The goal of the following sections is to examine the validity of such minimal paradigm for multimedia manipulations involving storage and retrieval of digital video. y with some exceptions of some wipes and other minor defects which do not register upon binary conversion Figure 1. Overall Morphological System for Storage and Retrieval 2. RENDERING OF VIDEO FOR PRESENTATION, VERIFICATION AND EDITING It is clear that when the underlying frame space is partitioned into equivalence classes corresponding to either of the equivalence relations (this is actually a homomorphism): 1. 2. GVp GVq , ThV (GVp ) = ThV (GVq ), and GVp GVq , ThV (k (GVp )) = ThV (k (GVq )) 8k 2 I , the index set, where k 's are the windowing or masking transformations, the characterizing manifestations of the last part of the last section remain (more or less) invariant. Such intuition was also veried subjectively by binarizing M-JPEG video and noting that the underlying video content remains (more or less, irrespective of reasonable choices of thresholding constant ) \intelligible". Furthermore, a natural (from computational eciency and real-time decoding considerations) abstraction of digital video results from choosing the existing windowing method (macroblock) and setting: ThVn (k (GVn ))(i,j ) = ThVn ( X (p,q)2MB (k) GVpqn ) (7) which corresponds to thresholding on the DCT DC of macroblock MB (k) containing pixel (i,j ) of the n-th frame), where: P V i,j) Gijn (each n-th frame of video V is M N pixels), 2. 8k = 0,1,2, : : : ,K ; 1 (there are K macroblocks in a given frame), 1. hVn = ( MN which requires only a trivial (MPEG/M-JPEG/H.26x) decoding (extraction of DCT DC coecient of the corresponding macroblock) followed by thresholding (or logical OR'ing of highest order bits). Adhering to the maxim of \picture is worth a thousand words", we are currently in the process of eciently rendering video for verication and editing purposes via the following more compact (computation and storage ecient) visual presentations: V 1. Sequences that are\projections" (or other suitable reductions) of the elements of the original sequence (GVn )Nn=0;1 onto subsampled (2-D)-spaces,13 or 2. S V in Z3 , or the reduced set corresponding to Equation 7; such representationsz are appropriate for distinguishing interframe (e.g. object movements, pans, dissolves, fades, wipes, e.t.c.) variations. The above presentations, being able to depict intra- and interframe variations eectively can serve to complement the representation as proposed by Yeung14 which, more or less, convey structural semantics of video. The visual rhythm13 is a special case of 1. which captures the scale space variations well and the rendering in 2. captures (in addition) the geometries(shapes and motion) within underlying image frames. 3. MOPHOLOGICAL APPROACH TO SCENE CHANGE DETECTION V The primary motivation behind this strategy is that given a (color) grayscale video (GVn )Nn=0;1 the intelligibility of V V the video is retained by the corresponding binary and/or set videos (BnV )Nn=0;1 and (SnV )Nn=0;1 . It is clear that the new frame space F~ , in the context of the old frame space F, can be visualized as\binarized" space in that each original real coordinate axis is mapped into a binary axis consisting only of 0,1, or equivalently the new frame space induces once again equivalence classes on the old one which seem correspond wellx to the perceptual notion of equivalence of frames (in that the members of the same equivalence class correspond to\virtually" same images, especially as the dimension of the underlying frame spaces gets larger). In this context, the l1 -norm{ (operating on the representative elements of the equivalence classes) does give a reasonable measure of distance between original frame images. Viewing the new frame space F~ in conjunction with the familiar l1 -norm, we propose the following three metrics for scene change detectionk : V ;1 Definition 3.1 (metrics). The following metrics apply for (BnV )N n=0 : P V ; BV j = P V V conventional l1-metric: d1 (BpV ,BqV )def = jjBpV ; BqV jj1 = (i,j) jBijp ijq (i,j ) jBijp Bijq j, P V ; B V )j, and equivalently, translation invariant metric: dti(BpV ,BqV )def = j (i,j) (Bijp ijq P V (1 ; B V )]. foreground biased metric: dfbm(BpV ,BqV )def = (i,j) [Bijp ijq V Equivalently, the following metrics apply for (SnV )Nn=0;1 : conventional l1-metric: d1 (SpV ,SqV )def = jjSpV ; SqV j + jSqV ; SpV jj = jSpV SqV j, translation invariant metric: dti(SpV ,SqV )def = jjSpV ; SqV j ; jSqV ; SpV jj, foreground biased metric: dfbm(SpV ,SqV )def = jSpV ; [SpV \ SqV ]j. In particular, the distance dm (:,:) between two consecutive frames BnV and BnV+1 is measured by: Dnm def = dm (BnV ,BnV+1 ), where n stands for any of 0,1,2, : : : N V ; 2, and m stands for any of 1, ti, and fbm. (8) It is clear that the computational complexities involved in anyV of the three metrics above are far below those corresponding to (non-binary-valued) grayscale digital video (GVn )Nn=0;1 , and the metrics can be implemented trivially utilizing (bitwise) hardware and software operations. z it is clear that we may also use the boundary information x upon visual inspection of test videos { in fact, any standard norm will serve the same purpose k stands for exclusive-OR and : stands for l -metric jj jj1 1 contained in MPEG/4 for this particular rendering In our implementations, each threshold value hVn is set globally (in a prepass) as the mean Vof all grayscale pixelvalues within a given range of frames. When the original video is converted from V to (BnV )Nn=0;1 and played back, the overall intelligibility of the scenes within the video seemed to be preserved (even for dark video). V V The real advantage of converting to (BnV )Nn=0;1 or (SnV )Nn=0;1 is that traditional shape-based signal processing15{17 can then take place to derive additional features (expected to be important in general content-based retrieval applications14,18{21 ) to ensure more accurate determination. It can be pointed out that the calculations involved in computing the metrics in Equation 8 and Denition 3.1 V ;1 N V are trivial. In fact, any calculation performed on binarized video (Bn )n=0 is expected to be simple, and this point oers yet another reason for binarizing the video at the outset. We comment also that simple manipulations can be V performed (comparable to those in for compressed-domain approaches4,22{25) to extract (BnV )Nn=0;1 . 3.1. Detection of Abrupt Scene Changes When Dm1 's as dened above is utilized towards abrupt scene change detection, the performances seem comparable to those reported elsewhere3,4,2 using substantially more computation, and the performances were surprisingly robust with respect to severe object motion, which was expected from the works of Meng and Nakajima6,8 who used (thresholded) l1 -norms on histogram frames. It is also clear that metric dti (:,:) on the new binary frame space F~ can be expected to accurately weed out localized (non-entering and non-exiting) bilinear (2-D) object motions: dti (BpV ,BqV ) def X V V )j = j (Bijp ; Bijq (i,j ) X V X V = j Bijp ; Bijq j (i,j ) (i,j ) = jArea of foreground of BpV ; Area of foreground of BqV j = 0. Thus, it can be veried that for a small object movement (either foreground or background) in a saturated (i.e. mostly foreground or mostly background) the above metric is close to 0 and oers a meaningful indication of similarity between consecutive frames within a (non-gradual scene change) shot ShVij , i.e. Dnti def = dti (BnV ,BnV+1 ) = Dnti+1 = 0. (9) From our testing of digital video for abrupt scene changes (with the setting of global threshold hV ) the metric dti (:,:) designed to alleviate the eects of motion performed worse (60 percent detection with low false alarm) than d1 (:,:) whose performance was markedly worse (75 percent detection with low false alarm) than that corresponding to dfbm (:,:) which performed comparable (90 percent detection with low false alarm) to Yeo's algorithm.4 Such surprising performance can be seen to be attributable to various nonlocalized(inframe) motion and noise inherent in (binarizing) real digital video. 3.2. Detection of Dissolves, Fade-ins, Fade-outs Gradual scene changes as represented by dissolves, by virtue of them being scale-space transformations, are illposed to be detected by (binary) morphological means. Nonetheless, the simplest case of Equations 5 and 7 of uniform dissolve where the respective fades begins and ends simultaneously (other types can be treated similarly V ;1 N V V with a straightforward generalization), the visualization of the corresponding S (or equivalently (Sn )n=0 ) yields the following simple necessary and sucient condition for the existence of a dissolve when we use a single global threshold hV in Denition 1.1 and k (:) as in Equation 7. Since: ThV (k (GVn )) = ThV (k ((n,m,M )GVm + (n,m,M )GVm+M ;1 )) = ThV ((n,m,M )k (GVm ) + (n,m,M )k (GVm+M ;1 )) = ThV (k (GVm ) + (n,m,M )[k (GVm+M ;1 ) ; k (GVm )]), so as to appear quasi-static; for color video, each color pixel value had to be converted to its grayscale equivalent n;m as in Equation 6, where (n,m,M ) = M ;1 [ThV (8 k (GVn ))](i,j ) GVm ))](i,j ) = 1 ^ [ThV (k (GVm+M ;1 ))](i,j ) = 1 > > < 0 "1 1 [[TThhVV ((kk ((G V ))](i,j ) = 0 ^ [ThV (k (GV m+M ;1 ))](i,j ) = 1 = > 1 # 0 [T V ( (GVm ))]( V i , j ) = 1 ^ [ T V ( ( G i,j ) = 0 k k > : 0 [ThhV (k (GmV ))](i,j ) = 0 ^ [ThhV (k (GVm+M ;1 ))]( ))]( m m+M ;1 i,j ) = 0 (10) (11) where ", # denotes the fact that the behaviors are monotonic in n, i.e. that as a function of n in the range [m,m+M ;1] the values correspond to a shifted (inverted) unit step functions. Utilizing such necessary and sucient condition for a uniform dissolve (corresponding to a noiseless ideal case) we have derived the following pair of heuristic measures that count and weigh the occurences of frame transitions 0 0, 0 1, 1 0 and 1 1 over a xed window of duration (suspected maximal/likely duration of M dissolves) and thereby detect peaks of the \number" of favorableyy patterns of binary sequences corresponding to pixel (i,j ): QVm,1 ((SnV )mn=+mM ;1 ) def = QVm,2 ((SnV )mn=+mM ;1 ) def = m+X M ;2 [( ; )jSnV ; SnV+1 j + ( ; )jSnV+1 ; SnV j], and (12) [( ; )jSnV ; SnV+1 j + ( ; )jSnV+1 ; SnV j]. (13) n=m m+X M ;2 n=m 0 0 0 0 With the aid of elementary (morphological) detector in the form of Equation 12 we were able to clearly discern the occurences of synthetic dissolves but for real digital video data there was clearly room for improvement in the selection of: 1. optimal threshold value hV (possibly hVn ), 2. optimal weighting parameters , , , , , , 3. pattern detectors themselves QVm,1 (:) and QVm,2 (:). 0 0 0 But it is also clear that the involved computations remain extremely simple{ more specically, by Equation 7 the V ;1 N V data (Sn )n=0 can be fetched directed from DCT DC components of the compressed video stream and the involved calculations in Equation 12, being virtually bit-level (binary-valued) operations, are trivial and possibly could be implemented via simple dedicated (cellular) hardware at the detector for virtual instantaneous performance. It was clear in our experiments that the proposed solution for detection of dissolves did not work due to motion within the dissolving shot. The fact that motion remains of paramount concern in detection of dissolves can be seen by witnessing that a quick version of ML detection (which ignores structured motion) in the form of recognizing bers (corresponding to the points on the (i,j )-th pixel within the dissolving shot) still fails to detect real dissolves upon experimentation. +M ;1 being in a More specically, it can be seen that the a posteriori probability of a given sequence (BnV )m n=m dissolving shot is: APDmV ((BnV )mn=+mM ;1 ) +M ;1 = P (dissolvej(BnV )m n=m ) V m + M ; 1 )P (dissolve) = P ((Bn )n=m jVdissolve m + M ; P ((Bn )n=m 1 ) / P ((BnV )mn=+mM +1 jdissolve) +M +1 = i,j P ((Bijn )m n=m jdissolve) = i,j P (Bijm Bij(m+1) Bij(m+M +1) jdissolve) = i,j P (Bijm Bij(m+M +1) jBijm Bij(m+M +1) dissolve)P (Bijm Bij(m+M +1) jdissolve) yywe took: = = 1, = = 0, = = ;4 0 0 0 / i,j P (Bijm Bij(m+M +1) jBijm Bij(m+M +1) dissolve) X = i,j [ P (Bij(m+1) Bij(m+M ) xjBijm Bij(m+M +1) dissolve)] x2f0,1gM X = i,j [ P (xjBijm Bij(m+M +1) dissolve)P (Bij(m+1) Bij(m+M ) jxBijm Bij(m+M +1) dissolve)] M x2f0,1g X / i,j [ P (Bij(m+1) Bij(m+M ) jxBijm Bij(m+M +1) dissolve)] x2CBijm Bij(m+M +1) since (in the last line): P (xjBijm Bij(m+M +1) dissolve) / 1 x 2 CBijm Bij m 0 x 26 CBijm Bij m M +1) M +1) ( + ( + , (14) where the sets of (favorable) monotonicity patterns are as follows: def C00 C11 C01 C10 = = def = def = def f0 0g, f1 1g, f1 1,01 1, : : : ,0 0g, f0 0,10 0, : : : ,1 1g. Obviously, when we parameterize the probability of (statistically independent) inconsistency with a single parameter p, the following classical minimum-distance12 detection rule arises: APDmV ((BnV )mn=+mM ;1 ) X x x / i,j [ pyijm (1 ; p)M ;yijm ] x2CBijm Bij(m+M +1) / i,j [x2CB max ijm B / ij(m+M +1) x x pyijm (1 ; p)M ;yijm ] x min yx M ;minx CBijm Bij(m+M +1) yijm i,j p x CBijm Bij(m+M +1) ijm (1 ; p) , 2 2 where: def x yijm = dH (Bij(m+1) Bij(m+M ) ,x) denotes the respective Hamming distance, and if we dene: min Pijm def = p x 2 x x M ;minx2CBijm Bij(m+M +1) yijm CBijm Bij(m+M +1) yijm (1 ; p) , the likelihood quantity: +M ;1 def Vm ((BnV )m n=m ) = X i,j X log Pijm / i,j x min y ] Bijm Bij(m+M +1) ijm [x2C can be thresholded against a threshold parameter for dissolve detection provided that p is small enough. 4. OTHER FEATURES OF THE MORPHOLOGICAL APPROACH The morphological approach in Section 3 can be adopted throughout to detect higher-layer features such as regarding motion (pans, zooms, object motion, e.t.c.) and objects. The morphological feature exploited and manifested in Section 3.2 is that of monotonicity of set sequences: S V ,0!1 " S V ,0!1 and S V ,1!0 # S V ,1!0 = ; n m+M ;1 n m+M ;1 for m n m + M ; 1, where: SnV ,0!1 def = SnV \ [(SmV )c \ SmV +M ;1 ], and SnV ,1!0 def = SnV \ [SmV \ (SmV +M ;1 )c ]. The above morphological features manifest themselves as\cones" of points when visualized in the context of S V . Similarly, camera pans and object motions would be discernable in the context of S V as \slanted" cylinders, degree of slant depending directly on the velocity of motion, wipes would manifest themselves as two \wedges" producing slices along the border,13 and so on. The detection of these geometrical (or morphological) features can be carried out using morphological lters (openings, HMT's, e.t.c.)15{17 and invariably lead, in contrast to the various computationV intensive techniques for digital video abstracted as (GVn )Nn=0;1 , to structured spatio-temporal lter designs that are inherently trivial to implement. V We comment briey that the storage and retrieval of digital video abstracted as (BnV )Nn=0;1 can proceed along in the same structured fashion, producing an ecientzz and streamlined architecture for various (higher-layer) multimedia applications that is bound to require more of the digital video library for ensuing intelligent processing. 5. CONCLUSIONS We have summarized continuing work on morphological scene change detection and storage and retrieval of digital video from the unied and structured framework. Viewed in this context, our set-theoretic approach becomes another probe into what constitutes a good minimal metric (norm) for our frame space F. By a good minimal metric, we mean an easily computable distance measure with respect to which any video V tracts out a path which are composed of \jerked" motion{ i.e. frames of Shij constitute a localized (smooth) subpath, whereas there is a distinct jump when we encounter an abrupt scene change. The cases of dissolve, fade-in and fade-out dened in our denition(Equations 2 and 3) to be constituting another shot where (in the context of underlying F as a vector space), as noted indirectly and exploited by researchers,3,4,6,8 the subpath tracks out a linear motion (from the start frame to the end frame) which are reected in (broken) monoticity behavior of the bers. With the set-theoretic framework (where a video V is identied as set S V ), there are numerous structured (nonad-hoc) and shape-based geometrical approaches to synthesizing scene change detection and more general multimedia algorithms{ this forms both the primary motivation and the topic of current research. ACKNOWLEDGMENTS This research was supported in part by the Information and Communications Research Program (Grant #98-19) from Korea Telecom (KT) and KMIC(Korea Ministry of Information and Communication). REFERENCES 1. A. H. F. Arman and M. Chiu, \Image processing on compressed data for large video databases," ACM Multimedia , pp. 267{272, 1993. 2. J. Boreczky and L. Rowe, \Comparison of video shot boundary detection techniques," SPIE 2670, pp. 170{178, 1996. 3. C. L. H. Zhang and S. Smoliar, \Video parsing and browsing using compressed data," Multimedia Toos and Applications 1, pp. 89{111, 1995. 4. B. Yeo and B. Liu, \Rapid scene analysis on compressed video," IEEE Transactions on Circuits and Systems for Video Technology 5, pp. 533{544, 1995. 5. O. Gerek and Y. Altunbasak, \Key frame selection from mpeg video data," SPIE 3024, pp. 920{925, 1997. 6. Y. J. J. Meng and S. Chang, \Scene change detection in a mpeg compressed video sequence," SPIE Symposium on Digital Video Compression 24191, pp. 14{25, 1995. 7. Y. L. T.C.T. Kuo and A. Chen, \Ecient shot change detection on compressed video data," IEEE International Workshop on Multimedia Database , pp. 101{108, 1996. zze.g. we are currently working on a lossless coding strategy that will compress the digital video data 5-10 fold 8. K. U. Y. Nakajima and A. Yoneyama, \Universal scene change detection on mpeg-coded data domain," SPIE , pp. 992{1003, 1997. 9. R. J. A. Hampapur and T. Weymouth, \Production model based digital video segmentation," Multimedia Tools and Application 1, pp. 9{45, 1995. 10. S. S. H.J. Zhang, A. Kankanhalli, \Automatic partitioning of full-motion video," Multimedia Systems 1, pp. 10{ 28, 1993. 11. J. M. R. Zabih and K. Mai, \A feature-based algorithm for detecting and classifying scene breaks," Multimedia , pp. 189{200, 1995. 12. R. Duda and P. Hart, Pattern Classication and Scene Analysis, John Wiley, New York, 1973. 13. J. L. W. K. H. Kim, S. Park and S. Song, \Processing of partial video data for detection of wipes," in Processing of Partial Video Data for Detection of Wipes, Proc. SPIE , 1999. 14. B. Y. M. Yeung and B. Liu, \Extracting story units from long programs for video browsing and navigation," Proceedings of Multimedia , pp. 296{305, 1996. 15. J.Serra, Image Analysis and Mathematical Morphology, St. Edmundsbury Press Limited, 1989. 16. P.Maragos and R.W.Schafer, \Morphological lters{part i: Their set-theoretic analysis and relations to linear shift-invariant lters," IEEE Trans. on Acoustics, Speech, and Signal Processing ASSP-35-8, pp. 1153{1169, 1987. 17. S. R.M.Haralick and X.Zhuang, \Image analysis using mathematical morphology," IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-9, pp. 532{550, 1987. 18. B. L. M. Yeung, \Ecient matching and clustering of video shots," International Conference on Image Processing 1, pp. 338{341, 1995. 19. S. S. H.J. Zhang, C.Y. Low and J. Wu, \Video parsing, retrieval and browsing: An integrated and content-based solution," Multimedia , pp. 15{24, 1995. 20. H. Z. D. Zhong and S. Chang, \Clustering methods for video browsing and annotation," SPIE 2670, pp. 239{246, 1996. 21. A. H. F. Arman, R. Depommier and M. Chiu, \Content-based browsing of video sequences," Multimedia , pp. 97{103, 1994. 22. D. L. Gall, \Mpeg: A video compression standard for multimedia applications," Commun. ACM 34, pp. 46{58, 1991. 23. G. Wallace, \The jpeg still picture compression standard," Commun. ACM 34, pp. 30{44, 1991. 24. S. Chang and D. Messerschmitt, \Manipulation and compositing of mc-dct compressed video," IEEE Journal on Selected Areas in Communications 13, pp. 1{11, 1995. 25. B. Smith and L. Rowe, \Algorithms for manipulating compressed images," IEEE Computer Graphics and Applications , pp. 34{42, 1993.
© Copyright 2025