A Morphological Approach to Scene Change Detection and Digital

A Morphological Approach to Scene Change Detection and
Digital Video Storage and Retrieval
Woonkyung M. Kima, S. Moon-Ho Songa , Hyeokman Kimb, Cheeyang Songb , Byung Woong Kwona
and Sun Geun Kima
a School of Electrical Engineering
Korea University, Sungbuk-gu Anam-dong 5 Ga 1, Seoul 136-701, Korea
b Multimedia Technology Research Laboratory
Korea Telecom, Seocho-gu Woomyeon-dong 17, Seoul 137-792, Korea
ABSTRACT
With the abstraction of digital grayscale video as the corresponding binary video{ a process which upon numerous
subjective experimentation seems to preserve (most of the) intelligibility of video content{ we can pursue a precise
and analytic approach to (digital video storage and retrieval) algorithm design that are based upon geometrical
(morphological) intuition. The foremost benet of such abstraction is the immediate reductions of both data and
computational complexities involved in implementing various algorithms and databases. The general paradigm
presented may be utilized to address all issues pertaining to video library construction including visualization,
optimum feedback query generation, object recognition, e.t.c., but the primary focus of attention of this paper is the
one pertaining to detection of fast (including presence of ashlights) and gradual scene changes such as dissolves,
fades, and various special eects such as wipes. Upon simulation we observed that we can achieve performances
comparable to those of others with drastic reductions in both storage and computational complexities. Since the
conversion from grayscale to binary videos can be performed directly with minimal additional computation in the
compressed domain by thresholding on the DCT DC coecients themselves (or by using the contour information
attached to MPEG4 formats), the algorithms presented herein are ideally suited for performing fast (on-the-y)
determinations of scene change, object recognition and/or tracking, and other more intelligent tasks traditionally
requiring heavy demand on computational and/or storage complexities. The fast determinations may then be used
on their own merits or can be used in conjunction or complementation with other higher-layer information in the
future.
Keywords: morphological signal processing, scene change detection, storage and retrieval
1. INTRODUCTION AND NOTATIONS
Following convention, every digital video sequence V is to be logically segmented as a sequence of scenes SiV ,i =
0,1, : : : ,LV ; 1, i.e.
V def
= (S0V , : : : ,SLVV ;1 ),
(1)
where every scene SiV can be further decomposed into a sequence of shots ShVij ,j = 0,1, : : : ,Mi ; 1, i.e.
SiV def
= (ShVi0 ,ShVi1 , : : : ,ShVi(Mi ;1) ),
(2)
V ,k = 0,1, : : : ,Nij ; 1, i.e.
and every shot ShVij can be further decomposed into a sequence of frames Fijk
ShVij def
= (FijV0 ,FijV1 , : : : ,FijV(Nij ;1) ).
Further author information: (Send correspondence to Woonkyung M. Kim)
Woomkyung M. Kim: E-mail: [email protected]
(3)
In other words, the digital video sequence V is viewed as a sequence of frames (F0V ,F1V , : : : ,FNVV ;1 ) (NV frames)
which is divided into LV groups called scenes, each of which is divided into Mi groups called shots, each of which
consists of Nij frames; in particular,
NV =
LX
V ;1 MX
i ;1
i=0 j =0
Nij .
(4)
The preliminary goal of scene change detection is to identify given video V as a collection of shots some of which
are to be labelled as gradual transitions (fade-in and fade-out included), wipes and other special eects.
It is possible to adopt the function theoretic viewpoint towards unifying the many dierent existing approaches
to scene change detection. Omitting much of the discussions regarding this summary, we point out merely that
we envision each frame FnV of any digital video V as a \point" in a discrete trajectory lying in a suitably large
dimensional space F, henceforth termed frame space.
With this conceptualization of digital video, it then seems plausible to ask whether there is a metric dened on
the corresponding frame (metric) space F with respect to which the video trajectory maps out a discernible pattern.
Within this context, the generic goal of the existing scene change detection algorithms is to generate a similarity
metric1,2 d : F F ! R+ dened on F such that each points of shot Shij maps out a \localized" subpaths and
between any two unrelated consecutive shots Shij and Shi(j+1) is a distinct jump in distance. In Dening the
similarity metric d(:,:), it is of signicant practical interest achieve the following two objectives:
perception correlation a strong correlation between perceptual and computed similarities are preserved, and
computational tractability the computational and storage complexities involved in calculating computed similarity is minimized.
Various metric spaces and corresponding metrics were proposed to achieve the above two objectives. The metrics
in previous works3{6,1,7,8,2 were proposed to fulll requirement of computational tractability and those in other
works9{11 and other classical works12 were proposed to fulll requirement of perception correlation{ the frame space
F can, without any loss in generality, be considered to be a vector space, and the following is an abbreviated overview
of the ingredients for the proposed metrics considered to satisfy above mentioned objectives:
fullllment of perception correlation
1. the standard l1 , l2 , l1 norms dened on frame space F considered to consist of histograms, raw-image
pixel values, DCT coecients, e.t.c.,
2. the standard inner product dened on frame space F considered to consist of histograms, DCT coecients,
e.t.c.
3. some statistical measures (e.g. motion vector counts, variances) to reect activities of each frame.
fullllment of computational tractability
1. decimation of data achieved in transform domains (e.g. neglecting of some ac coecients), spatial domains
(e.g. utilization of only some of the macroblocks, averaging of pixel values within a block), and/or other
hybrid domains (e.g. histogramming into a small number of bins),
2. utilization of (readily available) compressed domain features such as dc/ac coecients (JPEG,MPEG),
motion vectors(MPEG), and some derived measures.
It is also noted that{ as for the cases of gradual transitions such as dissolve, fade-in and fade-out, and/or of false
negatives such as zoom-in, zoom-out, ashlights, panning{ when we have more or less a clear production model as
in Hampapur,9 the detections of these features involve detections of associated regularities tracked via some derived
quantities (e.g. statistics, interframe distances).
i.e.
a sequence of points
It is possible to conceptualize within the context of our function-theoretic framework all of the above proposed
(similarity) metric-based scene detection algorithms. For the sake of brevity, just to convey the avor, we point out
merely (at the risk of inaccuracy) the following:
1. the transformation of raw-image frame into a histogram frame followed by some norm and/or inner product
operation (between two adjacent frames) entails partitioning the original F into equivalence classes (which
includes all of the images under the symmetric group of permutations) dened via the (distinct) histogram
frame, and then dening the metric to be the one induced by the standard metric on the histogram frame
space. The hope (and evidence12,1 ) here is that both the object motions and the motions of the camera are
absorbed well (for example) by the actions of the symmetric group.
2. the (normalized) inner product proposed to measure frame similarity can be viewed as indicating the angle
between the two frames (viewed as vectors) as viewed from the origin. As such, the measurement would be
suspect for those video V (viewed as a trajectory in frame space) for which the intershot trajectory (representing
some portion of Shij and Shi(j+1) ) lies in line with the origin (some special cases of dissolves, and all cases of
fade-in and fade-out).
One avenue hereto untapped towards developing algorithms for storage and retrieval of digital video (scene change
detection algorithms included) is that pertaining to the decimation in pixel-value domain.
Our morphological approach to scene change detection and digital video storage and retrieval generally involves
abstracting an N V -frame grayscale video V (normally viewed as a sequence of grayscale-valued functions) as the
corresponding binary and/or set videos:
Definition 1.1 (Grayscale, Binary, Set Video).
V
1. (grayscale video): A sequence of grayscale functions (GVn )Nn=0;1 where each function is dened by:
GVn (i,j ) = GVijn ,
where each GVijn is the ij -th (quantized) graycale pixel value of n-th frame.
V
2. (binary video): A sequence of binary-valued functions (BnV )Nn=0;1 where each function BnV is dened by:
V ,
BnV (i,j ) = Bijn
V is the ij -th binarized pixel value of n-th frame; we assume a generalized grayscale/binary
where each Bijn
V = TA (GV ) where A stands for a general adaptation logic,
converting thresholding in the form of GVijn ! Bijn
ijn
e.g. the following representative notations are generally used:
(a) A = hV implies that a uniform xed global threshold is used to convert all pixel values to binary, irrespective
of temporal and spatial context, and
(b) A = hVn implies that a time-varying global threshold, presumably taking into account intraframe spatial
context, is used to convert all pixels in the frame into respective binary values.
V
V of Z2 is dened
3. (set video): A sequence of (foreground) sets (SnV )Nn=0;1 where each \foreground" subset Sijn
indirectly by:
SnV = TA(GVn ) = BnV ,
where:
(a) C denotes the standard characteristic function of set C , and
(b) the functions TA(GVn ) and BnV are dened by:
i. TA (GVn )(i,j ) = TA(GVijn ), and
V .
ii. BnV (i,j ) = Bijn
Equivalently, the set video may be formulated as a single set S V consisting of all \foreground" pixels, i.e. the
subset S V of Z3 is dened as the collection of all pixels constituting the \foreground" image, i.e.:
S V = [(i,j)2SnV (i,j ,n).
With the above denitions of binary and set video at hand we are ready to tackle the problems of scene change
detection and storage and retrieval of digital video.
It is clear that with the above abstraction of digital video, rather natural
and classical representation of digital
V
video V is as a frame-point trajectory, a sequence of points (GVn )Nn=0;1 , in a suitable large dimensional vector
space. For example, viewed in the context of natural Banach Space (with l2 -norm, say), the following characterizing
manifestations are presumed to occur:
1. Abrupt scene changes $ \jumps" in frame-point trajectory
2. Flashlights $ sudden short-duration (typically 1 or 2 frame-points) \jumps" in frame-point trajectory
3. Intrashot changes (within ShVij ) $ \close" neighboring frame-points
4. Gradual scene changes (some ShVij ) dissolves, fade-ins, fade-outs $ straight line segment(s) joining the two end
frame-points GVm and GVm+M ;1 :
GVn = (n,m,M )GVm + (n,m,M )GVm+M ;1 , where:
(5)
(n,m,M ), (n,m,M ) such that 0 (n,m,M ), (n,m,M ) 1, are monotonic decreasing/increasing functions
with:
(a) (m,m,M ) = 1,(m + M ; 1,m,M ) = 0, and
(b) (m,m,M ) = 0, (m + M ; 1,m,M ) = 1
is a generic model of dissolves and fades. A simple and typical model is a uniform dissove where:
n;m,
(6)
(n,m,M ) = M ;M1 +;m1 ; n = 1 ; (n,m,M ) , and (n,m,M ) = M
;1
i.e. where the rst frame begins to turn o exactly when the second frame begins to turn o, and the rst
frame turns o completely exactly when the second frame turns on completely. Equation 5, viewed as vector
addition, is clearly one corresponding to a straight line segment joining the two end frame-points.
The morphological approach entails utilizing the homomorphisms:
FnV $ GVn $ BnV $ SnV ,
which (upon heuristic observationsy) retain desirable properties of perception correlation and computational
tractability, towards developing the following morphological paradigm (which may be used in conjunction with
traditional paradigms) which operate directly and exclusively in binary or boolean-valued domain (see Figure 1).
The goal of the following sections is to examine the validity of such minimal paradigm for multimedia manipulations involving storage and retrieval of digital video.
y with
some exceptions of some wipes and other minor defects which do not register upon binary conversion
Figure 1. Overall Morphological System for Storage and Retrieval
2. RENDERING OF VIDEO FOR PRESENTATION, VERIFICATION AND EDITING
It is clear that when the underlying frame space is partitioned into equivalence classes corresponding to either of the
equivalence relations (this is actually a homomorphism):
1.
2.
GVp GVq , ThV (GVp ) = ThV (GVq ), and
GVp GVq , ThV (k (GVp )) = ThV (k (GVq )) 8k 2 I , the index set, where k 's are the windowing or masking
transformations,
the characterizing manifestations of the last part of the last section remain (more or less) invariant. Such intuition
was also veried subjectively by binarizing M-JPEG video and noting that the underlying video content remains
(more or less, irrespective of reasonable choices of thresholding constant ) \intelligible". Furthermore, a natural (from
computational eciency and real-time decoding considerations) abstraction of digital video results from choosing the
existing windowing method (macroblock) and setting:
ThVn (k (GVn ))(i,j ) = ThVn (
X
(p,q)2MB (k)
GVpqn )
(7)
which corresponds to thresholding on the DCT DC of macroblock MB (k) containing pixel (i,j ) of the n-th frame),
where:
P
V
i,j) Gijn
(each n-th frame of video V is M N pixels),
2. 8k = 0,1,2, : : : ,K ; 1 (there are K macroblocks in a given frame),
1. hVn =
(
MN
which requires only a trivial (MPEG/M-JPEG/H.26x) decoding (extraction of DCT DC coecient of the corresponding macroblock) followed by thresholding (or logical OR'ing of highest order bits).
Adhering to the maxim of \picture is worth a thousand words", we are currently in the process of eciently
rendering video for verication and editing purposes via the following more compact (computation and storage
ecient) visual presentations:
V
1. Sequences that are\projections" (or other suitable reductions) of the elements of the original sequence (GVn )Nn=0;1
onto subsampled (2-D)-spaces,13 or
2. S V in Z3 , or the reduced set corresponding to Equation 7; such representationsz are appropriate for distinguishing interframe (e.g. object movements, pans, dissolves, fades, wipes, e.t.c.) variations.
The above presentations, being able to depict intra- and interframe variations eectively can serve to complement
the representation as proposed by Yeung14 which, more or less, convey structural semantics of video. The visual
rhythm13 is a special case of 1. which captures the scale space variations well and the rendering in 2. captures (in
addition) the geometries(shapes and motion) within underlying image frames.
3. MOPHOLOGICAL APPROACH TO SCENE CHANGE DETECTION
V
The primary motivation behind this strategy is that given a (color) grayscale video (GVn )Nn=0;1 the intelligibility of
V
V
the video is retained by the corresponding binary and/or set videos (BnV )Nn=0;1 and (SnV )Nn=0;1 .
It is clear that the new frame space F~ , in the context of the old frame space F, can be visualized as\binarized"
space in that each original real coordinate axis is mapped into a binary axis consisting only of 0,1, or equivalently the
new frame space induces once again equivalence classes on the old one which seem correspond wellx to the perceptual
notion of equivalence of frames (in that the members of the same equivalence class correspond to\virtually" same
images, especially as the dimension of the underlying frame spaces gets larger). In this context, the l1 -norm{
(operating on the representative elements of the equivalence classes) does give a reasonable measure of distance
between original frame images.
Viewing the new frame space F~ in conjunction with the familiar l1 -norm, we propose the following three metrics
for scene change detectionk :
V ;1
Definition 3.1 (metrics). The following metrics apply for (BnV )N
n=0 :
P
V ; BV j = P
V
V
conventional l1-metric: d1 (BpV ,BqV )def
= jjBpV ; BqV jj1 = (i,j) jBijp
ijq
(i,j ) jBijp Bijq j,
P
V ; B V )j, and equivalently,
translation invariant metric: dti(BpV ,BqV )def
= j (i,j) (Bijp
ijq
P
V (1 ; B V )].
foreground biased metric: dfbm(BpV ,BqV )def
= (i,j) [Bijp
ijq
V
Equivalently, the following metrics apply for (SnV )Nn=0;1 :
conventional l1-metric: d1 (SpV ,SqV )def
= jjSpV ; SqV j + jSqV ; SpV jj = jSpV SqV j,
translation invariant metric: dti(SpV ,SqV )def
= jjSpV ; SqV j ; jSqV ; SpV jj,
foreground biased metric: dfbm(SpV ,SqV )def
= jSpV ; [SpV \ SqV ]j.
In particular, the distance dm (:,:) between two consecutive frames BnV and BnV+1 is measured by:
Dnm def
= dm (BnV ,BnV+1 ),
where n stands for any of 0,1,2, : : : N V ; 2, and m stands for any of 1, ti, and fbm.
(8)
It is clear that the computational complexities involved in anyV of the three metrics above are far below those
corresponding to (non-binary-valued) grayscale digital video (GVn )Nn=0;1 , and the metrics can be implemented trivially
utilizing (bitwise) hardware and software operations.
z it is clear that we may also use the boundary information
x upon visual inspection of test videos
{ in fact, any standard norm will serve the same purpose
k stands for exclusive-OR and :
stands for l -metric
jj jj1
1
contained in MPEG/4 for this particular rendering
In our implementations, each threshold value hVn is set globally (in a prepass) as the mean Vof all grayscale pixelvalues within a given range of frames. When the original video is converted from V to (BnV )Nn=0;1 and played back,
the overall intelligibility of the scenes within the video seemed to be preserved (even for dark video).
V
V
The real advantage of converting to (BnV )Nn=0;1 or (SnV )Nn=0;1 is that traditional shape-based signal processing15{17
can then take place to derive additional features (expected to be important in general content-based retrieval applications14,18{21 ) to ensure more accurate determination.
It can be pointed out that the calculations involved in computing the
metrics in Equation 8 and Denition 3.1
V ;1
N
V
are trivial. In fact, any calculation performed on binarized video (Bn )n=0 is expected to be simple, and this point
oers yet another reason for binarizing the video at the outset. We comment also that simple manipulations
can be
V
performed (comparable to those in for compressed-domain approaches4,22{25) to extract (BnV )Nn=0;1 .
3.1. Detection of Abrupt Scene Changes
When Dm1 's as dened above is utilized towards abrupt scene change detection, the performances seem comparable to
those reported elsewhere3,4,2 using substantially more computation, and the performances were surprisingly robust
with respect to severe object motion, which was expected from the works of Meng and Nakajima6,8 who used
(thresholded) l1 -norms on histogram frames. It is also clear that metric dti (:,:) on the new binary frame space F~ can
be expected to accurately weed out localized (non-entering and non-exiting) bilinear (2-D) object motions:
dti (BpV ,BqV )
def X V
V )j
= j (Bijp ; Bijq
(i,j )
X V X V
= j Bijp
; Bijq j
(i,j )
(i,j )
= jArea of foreground of BpV ; Area of foreground of BqV j = 0.
Thus, it can be veried that for a small object movement (either foreground or background) in a saturated (i.e.
mostly foreground or mostly background) the above metric is close to 0 and oers a meaningful indication of similarity
between consecutive frames within a (non-gradual scene change) shot ShVij , i.e.
Dnti def
= dti (BnV ,BnV+1 ) = Dnti+1 = 0.
(9)
From our testing of digital video for abrupt scene changes (with the setting of global threshold hV ) the metric
dti (:,:) designed to alleviate the eects of motion performed worse (60 percent detection with low false alarm) than
d1 (:,:) whose performance was markedly worse (75 percent detection with low false alarm) than that corresponding
to dfbm (:,:) which performed comparable (90 percent detection with low false alarm) to Yeo's algorithm.4 Such
surprising performance can be seen to be attributable to various nonlocalized(inframe) motion and noise inherent in
(binarizing) real digital video.
3.2. Detection of Dissolves, Fade-ins, Fade-outs
Gradual scene changes as represented by dissolves, by virtue of them being scale-space transformations, are illposed to be detected by (binary) morphological means. Nonetheless, the simplest case of Equations 5 and 7 of
uniform dissolve where the respective fades begins and ends simultaneously (other types can be treated
similarly
V ;1
N
V
V
with a straightforward generalization), the visualization of the corresponding S (or equivalently (Sn )n=0 ) yields
the following simple necessary and sucient condition for the existence of a dissolve when we use a single global
threshold hV in Denition 1.1 and k (:) as in Equation 7.
Since:
ThV (k (GVn ))
= ThV (k ((n,m,M )GVm + (n,m,M )GVm+M ;1 ))
= ThV ((n,m,M )k (GVm ) + (n,m,M )k (GVm+M ;1 ))
= ThV (k (GVm ) + (n,m,M )[k (GVm+M ;1 ) ; k (GVm )]),
so
as to appear quasi-static; for color video, each color pixel value had to be converted to its grayscale equivalent
n;m as in Equation 6,
where (n,m,M ) = M
;1
[ThV (8
k (GVn ))](i,j )
GVm ))](i,j ) = 1 ^ [ThV (k (GVm+M ;1 ))](i,j ) = 1
>
>
< 0 "1 1 [[TThhVV ((kk ((G
V ))](i,j ) = 0 ^ [ThV (k (GV
m+M ;1 ))](i,j ) = 1
= > 1 # 0 [T V ( (GVm ))](
V
i
,
j
)
=
1
^
[
T
V
(
(
G
i,j ) = 0
k
k
>
: 0 [ThhV (k (GmV ))](i,j ) = 0 ^ [ThhV (k (GVm+M ;1 ))](
))](
m
m+M ;1 i,j ) = 0
(10)
(11)
where ", # denotes the fact that the behaviors are monotonic in n, i.e. that as a function of n in the range [m,m+M ;1]
the values correspond to a shifted (inverted) unit step functions. Utilizing such necessary and sucient condition for
a uniform dissolve (corresponding to a noiseless ideal case) we have derived the following pair of heuristic measures
that count and weigh the occurences of frame transitions 0 0, 0 1, 1 0 and 1 1 over a xed window of
duration (suspected maximal/likely duration of M dissolves) and thereby detect peaks of the \number" of favorableyy
patterns of binary sequences corresponding to pixel (i,j ):
QVm,1 ((SnV )mn=+mM ;1 ) def
=
QVm,2 ((SnV )mn=+mM ;1 ) def
=
m+X
M ;2
[( ; )jSnV ; SnV+1 j + ( ; )jSnV+1 ; SnV j], and
(12)
[( ; )jSnV ; SnV+1 j + ( ; )jSnV+1 ; SnV j].
(13)
n=m
m+X
M ;2
n=m
0
0
0
0
With the aid of elementary (morphological) detector in the form of Equation 12 we were able to clearly discern
the occurences of synthetic dissolves but for real digital video data there was clearly room for improvement in the
selection of:
1. optimal threshold value hV (possibly hVn ),
2. optimal weighting parameters , , , , , ,
3. pattern detectors themselves QVm,1 (:) and QVm,2 (:).
0
0
0
But it is also
clear that the involved computations remain extremely simple{ more specically, by Equation 7 the
V ;1
N
V
data (Sn )n=0 can be fetched directed from DCT DC components of the compressed video stream and the involved
calculations in Equation 12, being virtually bit-level (binary-valued) operations, are trivial and possibly could be
implemented via simple dedicated (cellular) hardware at the detector for virtual instantaneous performance.
It was clear in our experiments that the proposed solution for detection of dissolves did not work due to motion
within the dissolving shot. The fact that motion remains of paramount concern in detection of dissolves can be seen
by witnessing that a quick version of ML detection (which ignores structured motion) in the form of recognizing
bers (corresponding to the points on the (i,j )-th pixel within the dissolving shot) still fails to detect real dissolves
upon experimentation.
+M ;1 being in a
More specically, it can be seen that the a posteriori probability of a given sequence (BnV )m
n=m
dissolving shot is:
APDmV ((BnV )mn=+mM ;1 )
+M ;1
= P (dissolvej(BnV )m
n=m )
V
m
+
M
;
1
)P (dissolve)
= P ((Bn )n=m jVdissolve
m
+
M
;
P ((Bn )n=m 1 )
/ P ((BnV )mn=+mM +1 jdissolve)
+M +1
= i,j P ((Bijn )m
n=m jdissolve)
= i,j P (Bijm Bij(m+1) Bij(m+M +1) jdissolve)
= i,j P (Bijm Bij(m+M +1) jBijm Bij(m+M +1) dissolve)P (Bijm Bij(m+M +1) jdissolve)
yywe
took: = = 1, = = 0, = = ;4
0
0
0
/ i,j P (Bijm Bij(m+M +1) jBijm Bij(m+M +1) dissolve)
X
= i,j [
P (Bij(m+1) Bij(m+M ) xjBijm Bij(m+M +1) dissolve)]
x2f0,1gM
X
= i,j [
P (xjBijm Bij(m+M +1) dissolve)P (Bij(m+1) Bij(m+M ) jxBijm Bij(m+M +1) dissolve)]
M
x2f0,1g
X
/ i,j [
P (Bij(m+1) Bij(m+M ) jxBijm Bij(m+M +1) dissolve)]
x2CBijm Bij(m+M +1)
since (in the last line):
P (xjBijm Bij(m+M +1) dissolve) /
1 x 2 CBijm Bij m
0 x 26 CBijm Bij m
M +1)
M +1)
(
+
(
+
,
(14)
where the sets of (favorable) monotonicity patterns are as follows:
def
C00
C11
C01
C10
=
=
def
=
def
=
def
f0 0g,
f1 1g,
f1 1,01 1, : : : ,0 0g,
f0 0,10 0, : : : ,1 1g.
Obviously, when we parameterize the probability of (statistically independent) inconsistency with a single parameter
p, the following classical minimum-distance12 detection rule arises:
APDmV ((BnV )mn=+mM ;1 )
X
x
x
/ i,j [
pyijm
(1 ; p)M ;yijm ]
x2CBijm Bij(m+M +1)
/ i,j [x2CB max
ijm B
/
ij(m+M +1)
x
x
pyijm
(1 ; p)M ;yijm ]
x
min
yx
M ;minx CBijm Bij(m+M +1) yijm
i,j p x CBijm Bij(m+M +1) ijm (1 ; p)
,
2
2
where:
def
x
yijm
= dH (Bij(m+1) Bij(m+M ) ,x)
denotes the respective Hamming distance, and if we dene:
min
Pijm def
= p x
2
x
x
M ;minx2CBijm Bij(m+M +1) yijm
CBijm Bij(m+M +1) yijm
(1 ; p)
,
the likelihood quantity:
+M ;1 def
Vm ((BnV )m
n=m ) =
X
i,j
X
log Pijm /
i,j
x
min
y ]
Bijm Bij(m+M +1) ijm
[x2C
can be thresholded against a threshold parameter for dissolve detection provided that p is small enough.
4. OTHER FEATURES OF THE MORPHOLOGICAL APPROACH
The morphological approach in Section 3 can be adopted throughout to detect higher-layer features such as regarding
motion (pans, zooms, object motion, e.t.c.) and objects. The morphological feature exploited and manifested in
Section 3.2 is that of monotonicity of set sequences:
S V ,0!1 " S V ,0!1 and S V ,1!0 # S V ,1!0 = ;
n
m+M ;1
n
m+M ;1
for m n m + M ; 1, where:
SnV ,0!1 def
= SnV \ [(SmV )c \ SmV +M ;1 ], and
SnV ,1!0 def
= SnV \ [SmV \ (SmV +M ;1 )c ].
The above morphological features manifest themselves as\cones" of points when visualized in the context of S V .
Similarly, camera pans and object motions would be discernable in the context of S V as \slanted" cylinders, degree
of slant depending directly on the velocity of motion, wipes would manifest themselves as two \wedges" producing
slices along the border,13 and so on. The detection of these geometrical (or morphological) features can be carried out
using morphological lters (openings, HMT's, e.t.c.)15{17 and
invariably lead, in contrast to the various computationV
intensive techniques for digital video abstracted as (GVn )Nn=0;1 , to structured spatio-temporal lter designs that are
inherently trivial to implement.
V
We comment briey that the storage and retrieval of digital video abstracted as (BnV )Nn=0;1 can proceed along in the
same structured fashion, producing an ecientzz and streamlined architecture for various (higher-layer) multimedia
applications that is bound to require more of the digital video library for ensuing intelligent processing.
5. CONCLUSIONS
We have summarized continuing work on morphological scene change detection and storage and retrieval of digital
video from the unied and structured framework. Viewed in this context, our set-theoretic approach becomes another
probe into what constitutes a good minimal metric (norm) for our frame space F. By a good minimal metric, we mean
an easily computable distance measure with respect to which any video V tracts out a path which are composed of
\jerked" motion{ i.e. frames of Shij constitute a localized (smooth) subpath, whereas there is a distinct jump when
we encounter an abrupt scene change. The cases of dissolve, fade-in and fade-out dened in our denition(Equations
2 and 3) to be constituting another shot where (in the context of underlying F as a vector space), as noted indirectly
and exploited by researchers,3,4,6,8 the subpath tracks out a linear motion (from the start frame to the end frame)
which are reected in (broken) monoticity behavior of the bers.
With the set-theoretic framework (where a video V is identied as set S V ), there are numerous structured (nonad-hoc) and shape-based geometrical approaches to synthesizing scene change detection and more general multimedia
algorithms{ this forms both the primary motivation and the topic of current research.
ACKNOWLEDGMENTS
This research was supported in part by the Information and Communications Research Program (Grant #98-19)
from Korea Telecom (KT) and KMIC(Korea Ministry of Information and Communication).
REFERENCES
1. A. H. F. Arman and M. Chiu, \Image processing on compressed data for large video databases," ACM Multimedia
, pp. 267{272, 1993.
2. J. Boreczky and L. Rowe, \Comparison of video shot boundary detection techniques," SPIE 2670, pp. 170{178,
1996.
3. C. L. H. Zhang and S. Smoliar, \Video parsing and browsing using compressed data," Multimedia Toos and
Applications 1, pp. 89{111, 1995.
4. B. Yeo and B. Liu, \Rapid scene analysis on compressed video," IEEE Transactions on Circuits and Systems
for Video Technology 5, pp. 533{544, 1995.
5. O. Gerek and Y. Altunbasak, \Key frame selection from mpeg video data," SPIE 3024, pp. 920{925, 1997.
6. Y. J. J. Meng and S. Chang, \Scene change detection in a mpeg compressed video sequence," SPIE Symposium
on Digital Video Compression 24191, pp. 14{25, 1995.
7. Y. L. T.C.T. Kuo and A. Chen, \Ecient shot change detection on compressed video data," IEEE International
Workshop on Multimedia Database , pp. 101{108, 1996.
zze.g.
we are currently working on a lossless coding strategy that will compress the digital video data 5-10 fold
8. K. U. Y. Nakajima and A. Yoneyama, \Universal scene change detection on mpeg-coded data domain," SPIE ,
pp. 992{1003, 1997.
9. R. J. A. Hampapur and T. Weymouth, \Production model based digital video segmentation," Multimedia Tools
and Application 1, pp. 9{45, 1995.
10. S. S. H.J. Zhang, A. Kankanhalli, \Automatic partitioning of full-motion video," Multimedia Systems 1, pp. 10{
28, 1993.
11. J. M. R. Zabih and K. Mai, \A feature-based algorithm for detecting and classifying scene breaks," Multimedia
, pp. 189{200, 1995.
12. R. Duda and P. Hart, Pattern Classication and Scene Analysis, John Wiley, New York, 1973.
13. J. L. W. K. H. Kim, S. Park and S. Song, \Processing of partial video data for detection of wipes," in Processing
of Partial Video Data for Detection of Wipes, Proc. SPIE , 1999.
14. B. Y. M. Yeung and B. Liu, \Extracting story units from long programs for video browsing and navigation,"
Proceedings of Multimedia , pp. 296{305, 1996.
15. J.Serra, Image Analysis and Mathematical Morphology, St. Edmundsbury Press Limited, 1989.
16. P.Maragos and R.W.Schafer, \Morphological lters{part i: Their set-theoretic analysis and relations to linear
shift-invariant lters," IEEE Trans. on Acoustics, Speech, and Signal Processing ASSP-35-8, pp. 1153{1169,
1987.
17. S. R.M.Haralick and X.Zhuang, \Image analysis using mathematical morphology," IEEE Trans. on Pattern
Analysis and Machine Intelligence PAMI-9, pp. 532{550, 1987.
18. B. L. M. Yeung, \Ecient matching and clustering of video shots," International Conference on Image Processing
1, pp. 338{341, 1995.
19. S. S. H.J. Zhang, C.Y. Low and J. Wu, \Video parsing, retrieval and browsing: An integrated and content-based
solution," Multimedia , pp. 15{24, 1995.
20. H. Z. D. Zhong and S. Chang, \Clustering methods for video browsing and annotation," SPIE 2670, pp. 239{246,
1996.
21. A. H. F. Arman, R. Depommier and M. Chiu, \Content-based browsing of video sequences," Multimedia ,
pp. 97{103, 1994.
22. D. L. Gall, \Mpeg: A video compression standard for multimedia applications," Commun. ACM 34, pp. 46{58,
1991.
23. G. Wallace, \The jpeg still picture compression standard," Commun. ACM 34, pp. 30{44, 1991.
24. S. Chang and D. Messerschmitt, \Manipulation and compositing of mc-dct compressed video," IEEE Journal
on Selected Areas in Communications 13, pp. 1{11, 1995.
25. B. Smith and L. Rowe, \Algorithms for manipulating compressed images," IEEE Computer Graphics and
Applications , pp. 34{42, 1993.