Learning the meaning of music Brian Whitman MIT Media Lab

Learning the
meaning of music
Brian Whitman
MIT Media Lab
April 14 2005
Committee
Barry Vercoe
Professor of Media Arts & Sciences
Massachusetts Institute of Technology
[Music Mind and Machine]
Daniel P.W. Ellis
Assistant Professor of Electrical Engineering
Columbia University
[LabROSA]
Deb Roy
Associate Professor of Media Arts & Sciences
Massachusetts Institute of Technology
[Cognitive Machines]
15784
NEWTON v. DIAMOND
APPENDIX
[James Newton, “Choir” from Axum (ECM Recordings)]
[Beastie Boys, “Pass the Mic” from Check Your Head, Grand Royal]
4
5
6
JEFFREY A. BERCHENKO, SBN 094902
LAW OFFICE OF JEFFREY BERCHENKO
240 Stockton Street, 3rd Floor
San Francisco, California 94108
(415) 362-5700; Fax (415) 362-4119
7
8
Attorneys for Plaintiff
James W. Newton, Jr. dba Janew Music
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
UNITED STATES DISTRICT COURT
CENTRAL DISTRICT OF CALIFORNIA
JAMES W. NEWTON, JR. dba
JANEW MUSIC,
)
)
)
Plaintiff,
)
)
v.
)
)
MICHAEL DIAMOND, ADAM HOROVITZ
)
and ADAM YAUCH, dba BEASTIE BOYS,)
a New York Partnership, CAPITOL )
RECORDS, INC., a Delaware
)
Corporation, GRAND ROYAL RECORDS,)
INC., a California Corporation, )
UNIVERSAL POLYGRAM INTERNATIONAL)
PUBLISHING, INC., a Delaware
)
Corporation, BROOKLYN DUST MUSIC,)
an entity of unknown origin,
)
MARIO CALDATO, JR., an
)
individual, JANUS FILMS, LLC, a )
New York Limited Liability
)
Company, CRITERION COLLECTION, a )
California Partnership, VOYAGER )
PUBLISHING COMPANY, INC., a
)
Delaware Corporation, SONY MUSIC)
ENTERTAINMENT, INC.,A Delaware
)
Corporation, BMG DIRECT
)
FIRST AMENDED COMPLAINT – PAGE 1
Case No. CV 00-04909-NM (MANx)
FIRST AMENDED COMPLAINT
(COPYRIGHT INFRINGEMENT -17 U.S.C. §101 et seq.)
DEMAND FOR JURY TRIAL
0
Jan 11
Jan 12
Jan 13
Jan 14
Jan 15
Jan 16
Jan 17
Jan 18
Jan 19
Jan 20
Jan 21
Jan 22
Jan 23
Jan 24
Jan 25
Jan 26
Jan 27
Jan 28
Jan 29
Jan 30
Jan 31
Feb 1
Feb 2
Feb 3
Feb 4
Feb 5
Feb 6
Feb 7
Feb 8
Feb 9
Feb 10
Feb 11
Feb 12
Feb 13
Feb 14
Feb 15
Feb 16
Feb 17
Feb 18
Feb 19
Feb 20
Feb 21
Feb 22
Feb 23
Feb 24
Feb 25
Feb 26
Feb 27
Feb 28
Mar 1
Mar 2
Mar 3
Mar 4
Mar 5
Mar 6
Mar 7
Mar 8
Mar 9
Mar 10
Mar 11
Mar 12
Mar 13
Mar 14
Mar 15
Mar 16
Mar 17
Mar 18
Mar 19
Mar 20
Mar 21
Mar 22
Mar 23
Mar 24
Mar 25
Mar 26
Mar 27
Mar 28
Mar 29
Mar 30
Mar 31
Apr 1
Apr 2
Apr 3
Apr 4
Apr 5
Apr 6
Apr 7
Apr 8
Apr 9
300
261
225
225
32
1
1
1
3
4
16
237
[M.I.A. “Galang” from Arular, XL Recordings]
169
150
140
119
87
96
37
29
17
93
75
64
37
21
5
14
31
73
43
22 2021
14
14
3
48
21
12
93
1
1313 13
26 21
3 2
49
41
{
“my favorite song”
“i hate this song”
“four black women in rural arkansas”
“from sri lanka”
“about the falklands war”
“romantic and sweet”
“loud and obnoxious”
“sounds like early XTC”
#1 in the country
“reminds me of my ex-girlfriend”
{
“my favorite song”
“i hate this song”
“four black women in rural arkansas”
“from sri lanka”
“about the falklands war”
“romantic and sweet”
“loud and obnoxious”
“sounds like early XTC”
#1 in the country
“reminds me of my ex-girlfriend”
{
“my favorite song”
“i hate this song”
“four black women in rural arkansas”
“from sri lanka”
“about the falklands war”
“romantic and sweet”
“loud and obnoxious”
“sounds like early XTC”
#1 in the country
“reminds me of my ex-girlfriend”
USER
MODEL
{
“my favorite song”
“i hate this song”
“four black women in rural arkansas”
“from sri lanka”
“about the falklands war”
“romantic and sweet”
“loud and obnoxious”
“sounds like early XTC”
#1 in the country
“reminds me of my ex-girlfriend”
USER
MODEL
{
“my favorite song”
“i hate this song”
“four black women in rural arkansas” ✓
“from sri lanka”
✓
“about the falklands war” ✓
“romantic and sweet”
“loud and obnoxious”
“sounds like early XTC”
#1 in the country
✓
“reminds me of my ex-girlfriend”
RLSC
Penny
"Semantic projection"
Perceptual features
Community Metadata
Interpretation
"My favorite song"
"Romantic electronic music"
"Sounds like old XTC"
★★★★★
Contributions
1 Music retrieval problems
2 Meaning
3 Contextual & perceptual analysis
4 Learning the meaning
5 “Semantic Basis Functions”
1 Music retrieval problems
a Christmas
b Semantic / Signal approaches
c Recommendation
◆ Field for organization classification of musical data
◆ Score level, audio level, contextual level
◆ Most popular: “genre ID,” playlist generation, segmentation
Music Retrieval
Christmas
Referential
Genre ID
Style ID
Preference
Artist ID
Absolutist
Audio similarity
Structure extraction
Verse/chorus/bridge
Energy
Beat / tempo
Query by humming
Transcription
Key finding
Music Retrieval
[Jehan 2005]
4
x 10
2
2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
October 19-22, 2003, New Paltz, NY
[Cooper, Foote 2003]
1
! -1
! -2
0
25
20
15
10
5
1
0
mented training data. In contrast, our technique uses the digital
audio to model itself for both segmentation and clustering. Tzanetakis and Cook [6] discuss “audio thumbnailing” using a segmentation based method in which short segments near segmentation
0.5
1
1.5
2
2.5
boundaries are concatenated. This is similar to “time-based compression” of speech [7]. In contrast, we use complete segments for
summaries, and we do not alter playback speed. Previous work by
the authors has also used similarity matrices for excerpting, without an explicit segmentation step [8]. The present method results
in a structural characterization, and is far more likely to start or
end the summary excerpts on actual segment boundaries. We have
also presented an earlier version of this approach, however with
0.5
1
2
2.5
less complete
validation
[4]. 1.5
"The Magical Mystery Tour" Spectrogram Similarity Matrix
3
Time (sec)
0
0.8
40
0.6
60
0.4
80
0.2
0
120
!0.2
140
!0.4
160
!0.6
20
40
60
80
100
Time (sec)
120
140
160
3
Novelty Score: The Magical Mystery Tour
1
0.8
2.2. Media Segmentation, Clustering, & Similarity Analysis
0.6
Our clustering approach is inspired by methods developed for segmenting still images [9]. Using color, texture, or spatial similarity measures, a similarity matrix is computed between pixel
pairs. This similarity matrix is then factorized into eigenvectors
and eigenvalues.
Ideally,
the foreground
and background
pixels ex0.5
1
1.5
2
2.5
hibit within-class similarity and between-class dissimilarity. Thus
thresholding the eigenvector corresponding to the largest eigenvalue can classify the pixels into foreground and background. In
contrast, we employ a related technique to cluster time-ordered
data. Gong and Liu have presented an SVD based method for
video summarization [10], by factorizing a rectangular time-feature
matrix, rather than a square similarity matrix. Cutler and Davis use
affinity matrices to analyze periodic motion using a correlation0.5
1.5
2
2.5
based method
[11]. 1
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
20
100
1
0
[Casey 2002]
0
0.9
0.8
0.6
0.5
0.4
3
0.6
0.4
0.2
0
0
0.3
0.2
0.1
0
3
1
0.8
[Goto 2002-04]
0.7
0
20
40
60
80
100
Time (seconds)
120
140
160
180
Figure 2: Top: The similarity matrix computed from the song
“The Magical Mystery Tour ” by The Beatles. Bottom: the timeindexed novelty score produced by correlating the checkerboard
kernel along the main diagonal of the similarity matrix.
3. SIMILARITY ANALYSIS
“Unsemantic” music-IR
Similarity analysis
is a non-parametric
technique for studying the
(what
works)
global structure of time-ordered streams. First, we calculate 80-bin
3.1. Constructing the similarity matrix
spectrograms
from the
Fourier transform
(STFT)
0.5
1 short time
1.5
2
2.5 of 0.05
second non-overlapping frames in the source audio. Each frame is
Hamming-windowed, and the logarithm of the magnitude of the
FFT is binned into an 80-dimensional vector. We have also ex-
squares along the main diagonal. Brighter rectangular regions off
the main diagonal indicate similarity between segments.
3
3.2. Audio Segmentation
[Tzanetakis 2001]
azz R ock
n c in g
classic
country
Disco
Hiphop
jazz
Rock
classic
86
2
0
4
18
1
country
1
57
5
1
12
13
disco
0
6
55
4
0
5
Hiphop
0
15
28
90
4
18
Jazz
7
1
0
0
.37
12
Rock
6
19
11
0
27
48
Table 2. Genre classification confusion matrix
[Whitman and Smaragdis 2002]
R&B
choral
orchestral
Piano
string 4tet
99
10
16
12
0
53
2
5
1
20
75
3
Contemporary
string
4tet Country 0
17
7
80
choral
IDM
orchestral
ults
piano
Classical
Rap
Heavy Metal
25
76
Table 2. Classical music classification confusion matrix
100%
Context
0%
Signal
Genre Futility
100%
Personalization
2 Meaning
a Musical meaning
b Grounding
c Our approach
Meaning: relationship between
perception and interpretation.
Meaning: relationship between
perception and interpretation.
&
?
œ œ
bœ
œ.
‰ œ. <n>œ. œ
œ
bœ œ
j
j ‰
œ
œ
bœ <n>œ œ œœ
mp
j
‰ ##œœ
œ
# œ.
œ œ
œ bœ
‰
"About the Falklands war"
Correspondence
Correspondence:
Connection between representation and content.
Musical “story”: lyrics, discussion
Explicit correspondence: instruments, score etc.
Meaning: relationship between
perception and interpretation.
&
?
œ œ
bœ
œ.
‰ œ. <n>œ. œ
œ
bœ œ
j
j ‰
œ
œ
bœ <n>œ œ œœ
mp
j
‰ ##œœ
œ
# œ.
œ œ
œ bœ
‰
Similar artists
Genres: Rock, pop, world
Styles: IDM, math rock
"About the Falklands war"
Correspondence
Reference
Reference:
Connection between music and other music.
Similar artists, styles, genres
Meaning: relationship between
perception and interpretation.
&
?
œ œ
bœ
œ.
‰ œ. <n>œ. œ
œ
bœ œ
j
j ‰
œ
œ
bœ <n>œ œ œœ
mp
j
‰ ##œœ
œ
# œ.
œ œ
œ bœ
‰
Similar artists
Genres: Rock, pop, world
Styles: IDM, math rock
"#1 in America"
Buzz, trends
Influencing
Reference
Significance
"About the Falklands war"
Correspondence
Significance:
Aggregated cultural preference, “meaningful”
Charts, popularity, critical review
Meaning: relationship between
perception and interpretation.
&
?
œ œ
bœ
œ.
‰ œ. <n>œ. œ
œ
bœ œ
j
j ‰
œ
œ
bœ <n>œ œ œœ
mp
j
‰ ##œœ
œ
# œ.
œ œ
œ bœ
‰
Similar artists
Genres: Rock, pop, world
Styles: IDM, math rock
"#1 in America"
Buzz, trends
Influencing
Reference
Significance
"funky, loud, romantic"
"reminds me of my father"
Usage and behavior
"About the Falklands war"
Correspondence
Reaction:
Effect of music on listener (personal significance)
Personal comments, reviews
Usage patterns, ratings
Reaction
[Mueller]
[Miller]
[All Media Guide]
2. THE EXISTING SYSTEMS
buildwith
a system
that hasofthese
called
SAR
rateWe
tracks,
330 minutes
audiofunctions,
recordings
of animal
[Duygulu, Barnard,
Freitas,
Forsyth 2002]
(semantic–audio
retrieval),
by learning thename
connection
a
sounds. In addition,
the concatenated
of thebetween
CD (e.g.,
There are many multimedia retrieval systems that use a comb
semantic
space
auditory (e.g.,
space.“One
Semantic
spacehay
maps
“Horses I”)
and and
trackan
description
horse eating
and
tion of words and/or examples to retrieve audio (and video)
words
a high-dimensional
space.
Acoustic
movinginto
around”)
form a unique probabilistic
semantic label
for each
track.
users.
space
describes
a multidimensional
vector.
In general,
The audio
fromsounds
the CDbytrack
and the liner notes
form
a pair of
An effective way to find an image of the space shuttle i
the
connection
betweendocuments
these two used
spaces
many.
acoustic
and semantic
to will
trainbe
themany
SARtosystem.
enter the words “space shuttle jpg” into a text-based web sea
Horse sounds, for example, might include footsteps and neighs.
engine. The original Google system did not know about ima
Figure 1 shows one half of SAR; how to retrieve sounds from
but, fortunately, many people created web pages with the ph
2.
THE
EXISTING
SYSTEMS
words. Annotations that describe sounds are clustered within a
“space shuttle” that contained a JPEG image of the shuttle. M
hierarchical
semantic
model retrieval
that usessystems
multinomial
models.
The
There are many
multimedia
that use
a combinarecently, both Google and AltaVista for images, and Compus
sound
or acoustic
documents,
that correspond
to each
node
tion offiles,
words
and/or examples
to retrieve
audio (and
video)
for
ics for audio, have built systems that automate these searc
in
the semantic hierarchy are modeled with Gaussian mixture
users.
They allow people to look for images and sound based on nea
models
Given
semantic
request,
identifies
the
sea to
skyafind
sun
waves
catis
grass
jet planethose
sky search techniques by c
An (GMMs).
effective way
an image
of theSAR
space
shuttle
to tiger words. The SAR work expands
portion
of words
the semantic
the request,web
andsearch
then
enter the
“space space
shuttlethat
jpg”best
intofits
a text-based
sidering the acoustic and semantic similarity of sounds to al
measures
the original
likelihood
that each
sound
in the
database
the
engine. The
Google
system
did not
know
about fits
images,
[Slaney 2002]
[Roy, Hsiao, Mavridis, Gorniak 2001-05]
but, fortunately, many people created web pages with the phrase
Semantic Space
Acoustic Space
Semantic
Space
Acoustic
Space
“space shuttle” that contained a JPEG image of the shuttle. More
Horse
recently, both Google and AltaVista for images, and CompusonTrot
ics for audio, have built systems that automate
Step these searches.
They allow people to look for images and sound based on nearby
words. The SAR work expands those search techniques by considering the acoustic and semantic similarity of sounds to allow
Whinny
Figure 1:
SAR models
with aSpace
hierarchical
Semantic
Space all of semantic space
Acoustic
collection of multinomial models, each portion in the semantic
model is linked to Horse
equivalent sound documents in acoustic space
Trot
with a GMM.
Figure 2: SAR describes with words an audio query by partitioning the audio space with a set of hierarchical acoustic models and
then linking each set of audio files (or documents) to a probability model in semantic space.
Figure 2: SAR describes with words an audio query by partit
ing the audio space with a set of hierarchical acoustic models
then linking each set of audio files (or documents) to a proba
ity model in semantic space.
Grounding
[All Media Guide]
source
packing
meaning extraction
application
query by description
Audio
DSP
semantic basis functions
recommendation
Community
Metadata
NLP,
statistics
reaction prediction
long distance song effects
chat
reviews
explicits
cluster
RLSC
LSA
charts
SVM
tfidf
usage
edited
info
tagging
word
net(s)
HMM
3 Contextual & perceptual analysis
a
b
c
d
“Community Metadata”
Usage mining
Community identification
Perceptual analysis
"angry loud
guitars"
"The best
album of 2004"
WQHT adds
"My exgirlfriend's
favorite song"
Community Metadata
[Whitman, Lawrence 2002]
Pages for
"Context"
Search for target context
(artistname, songname)
Parsing,
position
Terms per type
for "Context"
POS tagging,
NP chunking
contexts
TF-IDF
Gaussian
smoothing
type
Webtext
terms
ft
s( f t , f d ) =
fd
s( f t , f d ) =
ft e
− (log( f d ) − µ ) 2
2σ
2
TF-IDF
90%
Gaussian smoothed
60%
30%
0%
Unigram
Bigram
Noun Phrase
Adjectives
Artist Terms
C (a ) − C (b)
C ( a, b)
S ( a, b) =
(1 −
)
C (b)
C (c )
Peer-to-peer crawling
[Ellis, Whitman, Berenzweig, Lawrence 2002]
[Berenzweig, Logan, Ellis, Whitman 2003]
Survey (self)
Audio (anchor)
Audio (MFCC)
Expert (AMG)
Playlists
Collections
Webtext
Baseline
54%
20%
22%
28%
27%
23%
Top-rank agreement
Evaluation
19%
12%
funky0
sm
ab it
ba h
m
ad
o
po nn
a
rt
is
xt he
ad
c
ae
ro
ith
sm
ro
ab
ae
heavy metal
loud
cello
ad
m
ba
ith
sm
ro
o
po nn
a
rt
is
xt he
ad
c
funky0
ab
ae
b
m a
ad
o
po nn
a
rt
is
xt he
ad
c
funky
funky1
...
romantic
funky_p
funky1
funky_k
Community Identification
5000
4500
4000
3500
Pitch (mels)
3000
2500
2000
Frames (l)
1500
1000
500
0
0
500
1000
1500
2000
2500
Frequency (Hz)
3000
3500
4000
4500
Dimensions (d)
Figure 3-7: Mel scale: mels vs. frequency in Hz.
◆ Audio features: not too specific
◆ High expressitivity at a low rate
◆ No assumptions other than biological
◆ “The sound of the sound”
%-"
!-$
%
!-#
!
!"
!-"
!-&
!#
!
!-$
!!-"
!$
!-#
!!-#
!-"
!
!!-"
!&
!!-$
!%!
!!-&
!
"!
#!
'()*+,
$!
!%
!
"!
#!
'()*+,
$!
!%"
!
"!
#!
'()*+,
Figure 3-8: Penny V1, 2 and 3 for the first 60 seconds of “Shipbuilding.”
Audio representation
To compute modulation cepstra we start with MFCCs at a cepstral frame rate (o
between 5 Hz and 100 Hz), returning a vector of 13 bins per audio frame. We then s
successive time samples for each MFCC bin into 64 point vectors and take a sec
Fourier transform on these per-dimension temporal energy envelopes. We aggre
Modulation cepstra:
[Ellis / Whitman 2004]
FFT of the MFCC
Mixed to 6 ‘channels’
MFCC
5
Modulation
range
0
!5
!10
!15
2
4
6
0-1.5 Hz
2
4
6
1.5-3 Hz
2
4
6
3-6 Hz
2
4
6
6-12 Hz
2
4
6
12-25 Hz
2
4
6
25-50 Hz
!20
!25
!30
!35
0
0.5
1
1.5
2
2.5
4
x 10
FFT
mixing
2
4
6
8
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
2
4
6
8
2
4
6
8
2
4
6
8
2
4
6
8
2
4
6
8
100
200
300
400
500
600
700
Penny
800
900 Frames
MFCC 5Hz
MFCC 20Hz
Penny 20Hz
Penny 5Hz
PSD 5Hz
0%
15%
30%
45%
Figure 3-10: Evaluation of five features in a 1-in-20 artist ID task.
is still a valuable feature for us: low data rate and time representation. Because of the
overlap in the fourier analysis of the cepstral frames, the Penny data rate is a fraction of
the cepstral rate. In usual implementation (Penny with a cepstral frame rate of 5 Hz,
300 MFCC frames per minute) we end up with 45 Penny frames per minute of audio.
Even if MFCCs outperform at equal cepstral analysis rates, Penny needs far less actual
{Featurefight}
4 Learning the meaning
a SVM / Regularized least-squares classification
b A note on computation
c Evaluations
able.
6
rock
dance
rap
s of the SVM lies in the representer theorem, where a high d
x can be represented fully by a generalized dot product (in a Re
t Space [7]) between xi and xj using a kernel function K(x
binary classification problem shown in figure 5-1 could be clas
hyperplane learned by an SVM. However, non-linearly sepa
-2 need to consider a new topology, and we can substitute in
n that represents data as
20
5.5
40
60
5
80
100
4.5
120
4
140
160
3.5
180
3
0.3
200
0.32
0.34
0.36
0.38
0.4
0.42
0.44
50
0.46
100
150
200
(|x1 −x2 |)2
−
σ2
Kf (x1 , x2 ) = e
nable parameter. Kernel functions can be viewed as a ‘distanc
all the high-dimensionality
points
in your input feature spac
SVM / Kernel
methods
r
nk osm
ith
y
”
th
is
so
lo ng s
ud
u
gu cks
fa
lk itar
la
nd
sw
ar
Reaction
(community metadata)
fu
“is
d dimensions
by
Ae
Perception
(audio)
✓
l frames
✓
✓
✓
...
✓
...
✓
c classes
◆ Most machine learning classifiers have compute time linear in c
◆ For higher accuracy, DAG or 1 vs. 1 classifiers required - c(c-1) classifiers!
◆ We need to scale to over 30,000 c
◆ Bias, incorrect ground truth, unimportant truth
Multiclass
(|x1 −x
|)−x2 |)
(|x
2
1
− −2
σ
σ2
ry other in the example space. We usually use
(1)
(x
,
x
)
=
e
real-valued
classification
function
f
is
or
un
on
f
is
(1)
K
(x
,
x
)
=
e
or
unimportant.
f
1
2
f
1
2
P
(a
)
P
(a
n in
n kernel,
n
Training
Evaluation
RLSC
system
consists
of
solving
the
syslowest
two
modulati
fined
as
fined
!
meter
we
keep
at
0.5.
parameter we keep
at
0.5.
!(|x61 −x2 Linguistic
|)2
6
L
Experts
for
Pa
uations
ture
across
all
cepstr
−
σ2
even
in
th
K
(x
,
x
)
=
e
(1)
f
(x)
=
c
K(x,
xof
(3)
even
(3)
!
f RLSC
1 system
2
an RLSC
consists
the
i consists
i ).solving
ning
an
system
of solving
the
!
use
a
10
Hz
feature
f
D
Discovery
f
(x)
=
c
K(x,
x
)
i
i Now
i=1
w
I
No
equations
ear
equations
1
Hz.
We
split
the
a
(K
+
)c
=
y,
(2)
i=1
y| is the conventional
Euclidean
distance
beC
threshold
thres
half
of
the
albums
in
Give
Given
a
set
of
‘grounded’
single
ter
al
property
of
RLSC
is
that
if
we
store
the
inwe
store
the
inpoints, and σIis a parameter
we keepfat
(x)0.5.
∼ P (termi |audiox )
I
scribed
above
we
com
I c−1
P
(a
)
P
(a
ernel
matrix,
is
a
classifier
‘machine,’
y
m
tourgr
(K +
)c
=
y,
(2)
t
our
method
for
uncovering
paramete
)
,
then
for
a
new
right-hand
side
y,
rix
(K+
(K
+
)c
=
y,
(2)
ght-hand
side
y,
C
data
and
then
invert,
C
C
,matrix
and C ismultiplia user-supplied
regularization
manner
w
mann
terms
terms
and
learning
the
knobs
to
vary
ompute
the
new
c
via
a
simple
matrix
multiplireview corpus.
1
we
keep
at
10.
The
crucial
property
of
r-supplied
regularization
constant.
The(after
ahis
user-supplied
constant.
The
mode
scoring
c
scori
model new
states
that certain
knowledge
allows (after
us
toregularization
compute
classifiers
classifiers
ask ◆isSubstitute
that if we
store
the
inverse
matrix
square loss for hinge loss in the SVM problem
senso
ued
classification
function
f
is
sensory
input
or
intrinsic
knowledge
l-valued
classification
function
f
is
the
data
and
storing
it
in
memory)
on
the
fly
or
unimp
or
un
mory)
on
the
fly
n for
a new
right-hand
side yformation
(i.e. a -new
◆ Easily
graspable
linear algebra
solution is linear
ing
a ‘linguistic
expert.’ If we ing
heara
◆
“Anwe
SVM
where
experimenter
defines
support
vectors”
ple
matrix
multiplications.
values
are
trying
to
predict),
we
can
!
!
Evaluation of
◆ New classes
can
be added after training and each7.2.
is a simple
!
!
w classifier
c
via
a
simple
matrix
multiplihear
hear
‘quiet’
audio,
we
would
needL
6
6
Lin
matrix multiplication!
(3)
(x)for
= aci“Query-by-Description”
cito
K(x,
(3)
fSC
(x)
=
K(x,
x
aluation
Task
cription”
Task
i ). xi ).of this
isfvery
well-suited
problems
terms
terms are antonymially
related
befo
D
Disc
To
evaluate
the
mod
RLSC
i=1
i=1 observations and a large
d
set
of
training
ate
we compute
em,our
we connection-finding
compute dation system,
datioa
space between
them.
testing
gram matrix
Compute time vs. classes
RLSC
Memory/disk allocation vs. observations
RLSC
SVM
60000 min
40000 MB
30000 min
20000 MB
0 min
0 MB
5
10
100
1000
5000
10000
100
1000
SVM
10000
%
classes, n = 1000
RLSC
50000
100000
2
2
σ
σ
(x
,
x
)
=
e
real-valued
classification
function
f over
is (1)
or
un
on
f
is
(1)P (aelimin
K
(x
,
x
)
=
e
or
unimportant.
f
21
2 in half the operations
o be1fcomputed
Gaussian
)
P
(a
n in
n
n kernel,
ation
only
requires
the
lower
triangle
of
the
matrix
to
be
st
◆
Creating
the
kernel
K:
fined
as
fined
!
meter we keep
at 0.5.
parameter
we keep
at
0.5.
!
2
(|x
−x
|)
6
L
1 by
2Linguistic
6
Experts
for
Pa
kernel
matrix
K
(which
definition
is
symmetric
positive
−
in th
σ2
f RLSC
(x)
=
c
K(x,
).solving
(3)eveneven
(3)
Kan
,
x
)
=
e
(1)the
f (x
consists
x
isof
parallelizable
or vectorized
an RLSC
system
consists
i
ieasily
ning
system
of
solving
the
1
2
makes it fully definite)
is
D
Discovery
i=1
NowNow
y| is
the
conventional
Euclidean
distance
be◆ Solving theof
system
of equations:
threshold
thresh
Given
Given
a
set
of
‘grounded’
single
ter
al
property
RLSC
is
that
if
we
store
the
inwe
store
the
inpoints, and σIis a I
parameter we keep at 0.5.
I −1
−1
T
−1
P
(a
)
gr
P
(a
our
m
t
(K
+
)c
=
y,
(2)
via
Cholesky:
t
our
method
for
uncovering
parameter
)
,
then
for
a
new
right-hand
side
y,
rix
(K+
(K
+
)c
=
y,
(2)
ght-handCside y,
K = (LL )
C
C
manner
w
mann
terms
terms
and
learning
the
knobs
to
vary
ompute
the
new
c
via
a
simple
matrix
multiplimatrix
multipliK is always symmetric positive definite because of the regularization term
Iterative
gradient,
pseudoinverse
r-supplied
regularization
constant.
The(after
user-supplied
regularization
constant.
The
mode
scoring
c
scorin
model
states
that etc.
certain
knowledge
his
allowsmethods:
us
toconjugate
compute
new
classifiers
classifiers
(after
where L was derived from the Cholesky decomposition. The
senso
ued
classification
function
f
is
sensory
input
or
intrinsic
knowledge
-valued
classification
function
f
is
the
data
and
storing
it
in
memory)
on
the
fly
or
unimp
or
un
mory)
on
the
fly
ithms for both computing the Cholesky decomposition in pl
equations
ear equations
◆ On a single 4GB machine, l < 40,000 ( (l*(l+1))/2)*4 bytes ) = 3.2GB
Accuracy
the classifier
increasesof
as the
l goesCholesky
up
matrix
andof
also
the
inverse
factorization
!
!
Random subsampling on obs. space over each node
a
ing a ‘linguistic expert.’ If we ing
hear
le matrix multiplications.
avai
! ! hear ‘quiet’ audio, we would
hear
needL
6
6
Lin
(3)
(x)for
= aci“Query-by-Description”
ci K(x,
(3)
fluation
(x)f =
K(x,
xi ). xi ).
Task
ription”
Task
terms
terms
are the
antonymially
related
befo
n our implementations,
we use
single precision
LAPACK
D
Disc
i=1 i=1
te our
connection-finding
we compute
RLSC system,
Optimizations
em,
we(SPPTRI)
computeon adation
space
them.
nverse
packed
lowerbetween
triangular
matrix. datio
This
!2
quiet
loud
1.5
1.5
1
1
0.5
0.5
0
1000
2000
3000
4000
5000
0
1000
funky
1.5
1
1
0.5
0.5
1000
2000
3000
3000
4000
5000
4000
5000
lonesome
1.5
0
2000
4000
5000
0
1000
2000
Query by description
3000
[Whitman and Ri.in 2002]
Good terms
Bad terms
Electronic
33%
Annoying
0%
Digital
29%
Dangerous
0%
Gloomy
29%
Fictional
0%
Unplugged
30%
Magnetic
0%
Acoustic
23%
Pretentious
1%
Dark
17%
Gator
0%
Female
32%
Breaky
0%
Romantic
23%
Sexy
1%
Vocal
18%
Wicked
0%
Happy
13%
Lyrical
0%
Classical
27%
Worldwide
2%
Baseline = 0.14%
◆ Collect all terms through CM as ground truth against corresponding
artist feature space -- artist (broad) level!
◆ Evaluation: on a held out test set of audio (with known labels), how well
does each classifier predict its label?
◆ In evaluation model, bias is countered: Accuracy of positive association
times accuracy of negative association = “P(a) overall accuracy”
if P (ap ) is the overall positive accuracy
(i.e.and given
an
[Whitman
Ellis 2004]
audio frame, the probability that a positive association
adj
Term
K-L
bits
np
Term
to a term is predicted) and P (an ) indicates overall negaaggressive
0.0034
tive accuracy,
P (a) is defined
as P (ap )P (areverb
n ). This measure gives
us a tangible 0.0030
feeling for how our
modsofter
theterm
noise
els are working against the held out test set and is use0.0029and the review
newtrimming
wave
ful for synthetic
grounded term prediction
punkbelow. However,
0.0024
costello
experiment
to rigorouslyelvis
evaluate
our
term model’s
in a review generation
task, we
sleepyperformance0.0022
the mud
note that this value has an undesirable dependence on the
funky of each0.0020
histerm
guitar
prior probability
label and rewards
classifiers with
a very high natural
df , often by chance.
noisy
0.0020
guitarInstead,
bass and
for thisangular
task we use a model
of relative entropy,
using the
0.0016
instrumentals
Kullback-Leibler (K-L) distance to a random-guess prob0.0015
melancholy
ability acoustic
distribution.
drums
random guessing is:
0.0014
chords
We romantic
use the K-L distance
in a two-classthree
problem
described by the four trial counts in a confusion matrix:
!
KL =
a
log
N
K-L bits
0.0064
0.0051
0.0039
0.0036
0.0032
0.0029
0.0027
0.0021
0.0020
0.0019
"
Na
(a + b) (a + c)
!
"
b
Nb
+ log
N
(a + b) (b + d)
!
"
c
Nc
+ log
N
(a + c) (c + d)
!
"
d
Nd
+ log
N
(b + d) (c + d)
Table 2. Selected top-performing models of adjective and
“funky” “not funky”
noun funky
phrase terms
used bto predict new reviews of music
a
not funky
c
d
with their
corresponding
bits of information from the K-L
distance measure.
(3)
5 Semantic Basis Functions
a Anchor models
b Music intelligence evaluation
c Media intelligence
vectors of size m × n, V is the right singular matrix matrix of size n × n, and Σ is
a diagonal matrix of the singular values σk . The highest singular value will be in the
upper left of the diagonal matrix Σ and in descending order from the top-left. For
w of the
a matrix
A when
the
λ is an
if and
covariance
matrixAw
input=ofλw.
AAT(λ
, Uare
and
VTeigenvalues:
will be equivalent
foreigenvalue
the non-zero
only ifeigenvalued
det(A − λI)
= 0.)
vectors.
To reduce rank of the observation matrix A we simply choose the
top r vectors of U and the top r singular values in Σ.
We use the singular value decomposition (SVD) [33] to compute the eigenvectors and
To compute a weight matrix w from the decomposition we multiply our (cropped)
eigenvalues:
eigenvectors by a scaled version of our
(cropped)
A=
UΣVTsingular values: [74]
(6.3)
w=
√
Σ−1 UT
(6.4)
Here, if A is of size m × n, U is the left singular matrix composed of the singular
This w will now be of size r × m. To project your original data (or new data) through
vectors of size m × n, V is the right singular matrix matrix of size n × n, and Σ is
the weight matrix you simply multiply w by A, resulting in a whitened and rank rea diagonal
the rsingular
values σk . The
singularprojected
value will
be in the
duced matrix
matrix f of
of size
× n. To ‘resynthesize’
rank highest
reduced matrices
through
upperwleft
thecompute
diagonal
Σ andthis
in new
descending
youof
first
w−1matrix
and multiply
iw by f . order from the top-left. For
T
the covariance
matrix
input
of
AA
, U and multiply.
VT will be equivalent
for the non-zero
where × is a per-element
The divergence
measure [PCA]
here is
The intuition
behind
PCA isrank
to reduce
dimensionality
of an
set; by the
eigenvalued
vectors.
To reduce
of thethe
observation
matrix
Aobservation
we simply choose
creasing
thetofollowing
two
update rules:
[Lee,
Seunggiven
1999]needed
the
eigenvectors
regenerate
top r ordering
vectors of
U
and
the
top r singular
valuesthe
in matrix
Σ. and ‘trimming’ only the top
Originalthe rate of lossy compression. The compression is
the experimenter can choose
T that V
achieved through analysis of the correlated dimensions so that dimensions
move
W
·
To compute
a weight matrix w from the decomposition we multiply
our
(cropped)
W·H
in the same direction are minimized. Geometrically,H
the =
SVDH
(and,
by extension,
PCA)
×
eigenvectors
by a scaled version of our (cropped) singular values: [74] T between
is explained as the top r best rotations of your input data space so that variance
W ·1
√
the dimensions are maximized.
V
T
−1 UT
·
H
w
=
Σ
(6.4)
W·H
!
=
W=W×
1 · HT
r,
NMF
=
6.2.5
NMF
This w
will now
be of size r × m. To project your original data (or new data) through
the weight matrix you simply multiply w by A, resulting in a whitened and rank refactorization
[44] isof
that enforces
am
× n(NMF)
matrix
all 1.decomposition
VQNon-negative
PCA
duced
matrixwhere
f of matrix
size1r is
×
n.
To ‘resynthesize’
ranka matrix
reduced
matrices projected
through
a positivity constraint
on
the
bases.
Given
a
positive
input
matrix
V
of
size
m
×
n,
it
−1
w youis first
compute
w
and
multiply
this
new
iw
by
f
.
Rank
factorized into twoStatistical
matrices W of size
m × rReduction
and H of size r × n, where r ≤ m.
=
The error of %W· H& ≈ V is minimized. The advantage of the NMF decomposition
!
The intuition !behind PCA is to reduce
the dimensionality of an observation set; by
[NMF]
the visual cortex [8]. We also provide the combination classifier with the precise positions
of the detected components relative to the upper left corner of the 58 58 window. Overall
we have three values per component classifier that are propagated to the combination classifier: the maximum output of the component classifier and the - image coordinates of
[Heisele, Serre, Pontil, Vetter, Poggio 2001]
the maximum.
Left Eye
Eye
Left
expert:
expert:
Linear SVM
SVM
Linear
*
*Outputs of component
experts: bright intensities
indicate high confidence.
(O1 , X 1 , Y1 )
.
.
.
Nose expert:
expert:
Nose
Linear SVM
SVM
Linear
(O1 , X 1 , Y1 ,..., O14 , X 14 , Y14 )
*
(Ok , X k , Yk )
.
.
.
Mouth
Mouth
expert:
expert:
Linear SVM
SVM
Linear
1. Shift 58x58 window
over input image
2. Shift component
experts over
58x58 window
Combination
Combination
classifier:
classifier:
Linear SVM
SVM
Linear
*
(O14 , X 14 , Y14 )
3. For each component k,
determine its maximum
output within a search
region and its location:
(Ok , X k , Yk )
4. Final decision:
face / background
Figure 2: System overview of the component-based classifier.
[Berenzweig, Ellis, Lawrence 2002]
[Apple 2003]
. E15-491
A 02139 USA
dia.mit.edu
[Whitman 2003]
2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoust
PCA
NMF
Semantic
2
1
1
0
0.5
0.5
!2
0hz
345hz
690hz
0
0hz
345hz
690hz
0
0hz
non
pca
funky
345hz
690hz
5
5
10
10
15
15
20
20
25
25
cool
5
10
highest
junior
low
15
20
25
5
10
nmf
5
10
10
15
15
20
20
25
25
10
15
20
25
20
25
sem
5
5
15
20
25
5
10
15
Figure 3: Confusion matrices for the four experiments. Top: no
dimensionality reduction and PCA with r = 10. Bottom: NMF
with r = 10 and semantic rank reduction with r = 10. Lighter
five bases for each type of depoints indicate that the examples from artists on the x-axis were
five second power spectral denthought to be by artists on the y-axis.
Figure 1: Comparison of the top
composition, trained from a set of
sity frames. The PCA weights aim to maximize variance, the
NMF weights try to find separable additive parts, and the semantic
weights map the best possible labels to the generalized observatraining across the board, with perhaps the NMF hurting the accutions.
versus not having an reduced rank representation at all. For
Semantic Rankracy
Reduction
the test case, results widely vary. PCA shows a slight edge over no
reduction in the per-observation metric while NMF appears to hurt
Basis extraction set
community metadata
✓
✓
✓
✓
✓
✓
“What the community
hears”
“What the community
thinks”
Electronic
Digital
Gloomy
Unplugged
Acoustic
Dark
Female
Romantic
Vocal
Happy
Classical
33%
29%
29%
30%
23%
17%
32%
23%
18%
13%
27%
sorted class P(a) outputs
“What are the most
important things to a
community”
Semantic Basis Functions
Good terms
Electronic
Digital
Gloomy
Unplugged
Acoustic
Dark
Female
Romantic
Vocal
Happy
Classical
33%
29%
29%
30%
23%
17%
32%
23%
18%
13%
27%
Experimenter chooses r
New audio is represented as the
prediction community reaction to
the signal:
sorted class P(a) outputs
“Electronic” 0.45
“Digital” 0.21
“Gloomy” -0.12
“Unplugged” -0.45
“Acoustic” 0.84
Semantic Basis Functions
Artist ID: rare true ground truth in music-IR!
Test: of a set of c artists / classes, with training data for each, how
many of a set of n songs can be placed in the right class in testing?
◆ Album effect
- Learning producers instead of musical content
◆ Time-aware
- “Madonna” problem
◆ Data density / overfitting
- Sensitive to rate of feature, amount of data per class
◆ Features or learning?
Evaluation
70%
67.1%
No rank reduction
PCA
NMF
Semantic rank reduction
Baseline (random)
35%
22.2%
0%
24.6%
19.5%
3.9%
Artist ID accuracy, 1-in-20, obs = 8000
beautiful
free
small
100
10
20
30
40
50
60
70
80
0.13
0.10
0.33
Figure 7-3: Top terms for community metadata vectors associated with the image
at left.
informed by the probabilities p(i) of each symbol i in X. More ‘surprising’ symbols
in a message need more bits to encode as they are less often seen. This equation commonly gives a upper bound for compression ratios and is often studied from an artistic
standpoint. [54] In this model, the signal contains all the information: its significance
is defined by its self-similarity and redundancy,npaTerm
very absolutist view.
Score However, we inaustrailia exhibit
0.003
tend instead to consider the meaning of those bits,
and
by
working
with other domains,
light and shadow
0.003
this incredibly beautiful
countrydata
0.002from these signifidifferent packing schemes, and methods for synthesizing
new
sunsets
0.002
god’shope
creations to bring meaning
0.002
cantly semantically-attached representations we
back into the
the southeast portion
0.002
Score
adj Term
notion of information.
10
20
30
40
50
60
70
80
90
100
7.2.1 Images and Video
10
20
30
40
50
60
70
80
religious
human
simple
beautiful
free
small
1.4
0.36
0.21
0.13
0.10
0.33
Figure
7-3: Top terms
vectors
the image
High
Termassociated
Typewith Accuracy
Low
Term
Typefor community
Accuracymetadata
at left.
sea
np
20%
antiquarian
adj
0%
purei in X. Moreadj
18.7%
boston
np
0%
informed by the probabilities
p(i) of each symbol
‘surprising’
symbols
in a message neednp
more bits to
are less often seen.adj
This equation
com17.1%
library
0%encode as theypacific
monly gives a upper
compression ratios
and is often studied
an artistic
cloudy
adj from17.1%
analytical
adjbound for
0%
standpoint. [54] In this model, the signal contains all the information: its significance
np However,
17.1%
disclaimer
np
0%
is defined by its self-similarity
and redundancy,air
a very absolutist view.
we incolorful
generation
np the meaning
0% of those bits,
tend instead to consider
and by workingadj
with other11.1%
domains,
different packing schemes, and methods for synthesizing new data from these signifi-
a
large
−
ich
Sentence
g(s)
mother
loves
this
album.”
We
look
to
the
success
2 of our
σ
erstandgrounded
term
models
for
insights
into
th
testing
gram
matrix
and
check
each
learned
c
again
K
(x
,
x
)
=
e
The drums that kick in midway are also
decidedly
more
similar
to
Air’s
previous
work.
3.170%
f
1
2
defined
But
at
first,
it’s
all
Beck:
a
harmonica
solo,
folky
acoustic
strumming,
Beck’s
distinctive, marble-mouthed
and tolls ringing in
ndgrounded term models for insights
into thevocals,
musicality
of 2.257%
the background.
audio
frame
in
the
test
set.
without
description
and
develop
a
‘review
trimmin
But with lines such as, ”We need to use envelope filters/ To say how we feel,” the track is also an oddly beautiful lament.
2.186%
out
description
and
develop
a
‘review
trimming’
system
that
The beat, meanwhile, is cut from the exact same mold as The Virgin Suicides– from the dark, ambling pace all the way down to the
1.361%
ere
angelicchovoices coalescing
in the background.
summarizes
reviews
and
retains
only
the
m
We
used
two
separate
evaluation
techniques
t
hoobservasummarizes
reviews
and
retains
only
the
most
descriptive
After listing off his feelings, the male computerized voice receives an abrupt retort from a female computerized voice: ”Well, I really
0.584%
think you shouldcontent.
quit smoking.”
trimmed
reviews
can
then
be fedOne
into
furthebutThe
strength
ofa music
our
term
predictions.
metric
is
associtrimmed
reviews
can
then
I wouldn’t say she was a lost cause,content.
my girlfriend needed The
doctor
like I needed, well, a girlfriend.
0.449% b
She’s taken to thether
Pixies, and
I’ve taken to,understanding
um, lots of sex.
0.304%
textual
systems
or
read
directly
by
the
sure
classifier
performance
with
the recallor
produc
el-space
Needless to say, we became well
acquainted
with the album, which
both of us were already fond of
to begin with.
0.298%
ther
textual
understanding
systems
read
listener.
if listener.
Ptheir
(apg(s)
) inisa review
the trimming
overall
positive accuracy (i.e. gi
ularizaTable 3. Selected sentences and
experiment. From Pitchfork’s review of Air’s “10,000
To trim a review we create a grounding sum term operHz Legend.”
frame, the probability that a positive asso
esultant ated onaudio
a sentence
s ofaword
lengthwe
n, create a grounding
To
trim
review
to a term is predicted) and P (an ) indicates overal
rom the
#
lation
established
that a random
association
of these
n we
ated
on
a
sentence
s
of
length
n,
i word
P (agives
) a as
tive accuracy, P (a) two
is datasets
defined
P (acoefficient
). Th
ing
or each
correlation
of n
magnitude
p )P (a
i=0
g(s) = smaller than r = 0.080 with 95% confidence.(4)
Thus, these
m’s
#
sure gives us a tangible
feeling
forn correlation
how our
term
esenting
resultsnindicate
a very significant
between
the
i
automatic and ground-truth ratings. P (a )
ehat
testing
rei=0
are
working
against
the
held
out
test
set
and
term where els
The
Pitchfork
model
did
not
fare
as
well
with
r =
a perfectly grounded sentence
(in
which
the
predicg(s)
=
a
0.127 (baseline of r = 0.082 with 95% confidence.) Figtgaterm’s
npreciful forofgrounded
prediction
and
the
review tri
‘ma- tive qualities
each termterm
onurenew
music
has
100%
1 shows
the scatter
plot/histograms
for each experictor
ment; we see that the audio predictions are mainly bunched
sion)
is
100%.
This
upper
bound
is virtually
experiment
below.
However,
to impossible
rigorouslyin evalu
xamples
dlsoin rearound the mean of the ground truth ratings and have a
where
a
perfectly
sentence
(in
wh
muchgrounded
smaller
variance.
Visually,
itsee
is hard
to judge
how
a
grammatically
correct
sentence,
and
we
usually
g(s)
term
model’s
performance
in
a
review
generation
t
etting
a
na
well the review information has been captured. However,
of {0.1%
..
10%}.
The
user
sets
a
threshold
and
the
systive
qualities
of
each
term
on
new
music
the
correlation
values
demonstrate
that
the
automatic
anal-h
Perceptual
Text
Analysis
note
that
this
value
has
an
undesirable
dependence
eos-vectortem simply removes sentences
ysis is indeed finding and exploiting informative features.
! under the threshold. See
re σ is a parameter we keep at 0.5.
hen, training an RLSC system consists of
em of linear equations
100
Pitchfork
AMG
90
I
(K + )c = y,
C
80
100
Pitchfork
AMG
% of review kept
70
90
60
80
50
% of review kept
70
re C is a user-supplied regularization con
lting real-valued classification function f
40
60
50
30
40
20
2.0
1.8
1.5
1.2
1.0
g(s) threshold
0.8
0.5
30
20
2.0
1.8
1.5
1.2
1.0
g(s) threshold
0.8
0.5
0.2
0.2
[June of 44, “Four Great Points”] June of 44's fourth full-length is their most
experimental effort to date -- fractured melodies and dub-like rhythms collide
in a noisy atmosphere rich in detail, adorned with violins, trumpet, severe
phasing effects, and even a typewriter. - Jason Ankeny
4.15
[Arovane, “Tides”] The homeless lady who sits outside the yuppie coffee bar on the corner of my street assures
passers-by that the end is coming. I think she’s desperate to convey her message. Though the United States is
saber-rattling with the People’s Republic of China, it seems that everyone has overcome their millennial
tension, and the eve of destruction has turned to a morning of devil-may-care optimism.
Collectively, we’re overjoyed that, without much effort or awareness, we kicked the Beasťs ass. The Beast, as
prophesied by some locust-muncher out in the Negev Desert thousands of years ago, was supposed to arrive
last year and annihilate us before being mightily smote by our Lord and Savior Jesus Christ. I missed this.
Living as I do in America’s capital, the seat of iniquity and corruption, I should have had ring-side seats to the
most righteous beatdown of all time. I even missed witnessing the Rapture, the faithfuľs assumption to the
right hand of God that was suppose to occur just before Satan’s saurian shredded all of creation.... [it goes o%
like this for a while] - Paul Cooper
0.862
Perceptual Text Analysis
◆ “Human” meaning vs. “Computer” meaning: the junior problem
◆ Target scale
- Artist vs. album vs. song
◆ Better audio representation
◆ Other multimedia domains
◆ Human evaluation:
- Community modeling
- Query by description
- Similarity / recommendation
Problems and Future
Thanks:
Barry Vercoe & the MMM group; esp. Youngmoo, Paris, Keith, Michael Casey,
Judy, Mike Mandel, Wei, Victor, John, Nyssim, Rebecca, Kristie, Tamara Hearn.
Dan Ellis & Columbia; Adam Berenzweig, Ani Nenkova, Noemie Elhadad.
Deb Roy.
Ben Recht, Ryan Ri.in, Jason, Mary, Tristan, Rob A, Hugo S, Ryan McKinley,
Aggelos, Gemma & Ayah & Tad & Limor, Hyun, Cameron, Peter G. Dan P., Chris
C, Dan A, Andy L., Barbara, Push, Beth Logan. ex-NECI: Steve Lawrence, Gary
Flake, Lee Giles, David Waltz.
Kelly Dobson, Noah Vawter, Ethan Bordeaux, Scott Katz, Tania & Ruth,
Lauren Kroiz. Drew Daniel, Kurt Ralske, Lukasz L., Douglas Repetto.
Bruce Whitman, Craig John and Keith Fullerton Whitman. Stanley and Albert
(mules), Wilbur (cat), Sara Whitman and Robyn Belair.
Sofie Lexington Whitman:
Selected Publications:
Whitman, Brian, Daniel P.W. Ellis. “Automatic Record Reviews.” In Proceedings
of ISMIR 2004 - 5th International Conference on Music Information Retrieval.
October 10-14, 2004, Barcelona, Spain.
Berenzweig, Adam, Beth Logan, Daniel Ellis, Brian Whitman. “A Large Scale
Evaluation of Acoustic and Subjective Music Similarity Measures.” Computer
Music Journal, Summer 2004, 28(2), pp 63-76.
Whitman, Brian. “Semantic Rank Reduction of Music Audio.” In Proceedings of
the 2003 Workshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA). 19-22 October 2003, New Paltz, NY. pp135-138
Whitman, Brian, Deb Roy, and Barry Vercoe. “Learning Word Meanings and
Descriptive Parameter Spaces from Music.” in Proceedings of the HLTNAACL03 workshop on Learning Word Meaning from Non-Linguistic Data. 26
-31 May 2003, Edmonton, Alberta, Canada.
Whitman, Brian and Ryan Ri.in. “Musical Query-by-Description as a Multiclass
Learning Problem.” In Proceedings of the IEEE Multimedia Signal Processing
Conference. 8-11 December 2002, St. Thomas, USA.
Ellis, Daniel, Brian Whitman, Adam Berenzweig and Steve Lawrence. “The
Quest For Ground Truth in Musical Artist Similarity.” In Proceedings of the 3rd
International Conference on Music Information Retrieval. 13-17 October 2002,
Paris, France.
Whitman, Brian and Paris Smaragdis. “Combining Musical and Cultural Features
for Intelligent Style Detection.” In Proceedings of the 3rd International
Conference on Music Information Retrieval. 13-17 October 2002, Paris, France.
Whitman, Brian and Steve Lawrence (2002). “Inferring Descriptions and
Similarity for Music from Community Metadata.” In “Voices of Nature,”
Proceedings of the 2002 International Computer Music Conference. pp 591-598.
16-21 September 2002, Göteborg, Sweden.
Questions?