Slides

A Bayesian Framework for
Estimating Properties of Network Diffusions
Varun Embar1 Rama Kumar Pasumarthi2
Indrajit Bhattacharya1
1 IBM
2 American
Research India
Express (work done when at IBM)
IISc MLSIG Lunch Talk 2015
(based on SIGKDD 2014 paper)
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
1 / 42
Quick Summary
Network Diffusion
Example 1 : Spread of Ebola in West Africa
Example 2 : Spread of hashtags among Twitter users
Entities of interest
I Network: Captures connections between people / Twitter users
I
Diffusion Process: Stochastic mechanism of ‘infection’ spread
I
Diffusion Cascades: Time-stamped infection paths over network
Long studied in epidemiology, sociology, econometrics, marketing
Recent interest in computer science
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
2 / 42
Quick Summary
Problems in Network Diffusion Analysis
Evaluating Properties of Observed Network and Cascades
I
Centrality and reach of individual nodes
I
Viral marketing seeds (Kempe KDD03, Goyal CIKM08, PVLDB11)
I
Community structures (Mehmood ECML13, Barbieri ICDM13)
I
Likelier diffusion mechanism (Milling SIGMetrix12)
Inferring Network from Partially-observed Cascades
I
Estimate network connections and strengths given cascades
I
Maximum likelihood estimation
I
Saito (AML09), Gomez-Rodriguez (KDD10, ICML11,13, WSDM13) Du
(NIPS12), Netrapalli (SIGMetrix12), Wang (ECML12), Kutzkov (KDD13),
Daneshmand (ICML14)
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
3 / 42
Quick Summary
Example Property: Leaders of tribes (LoT)
Leader of Tribes (Goyal CIKM08)
I
Network property: High-weight paths to large ‘tribe’ of nodes
I
Cascade property: Frequent transmissions over these paths
Not tractable even given complete observations
Weak LoT
I Network property: High-weight edges to tribe nodes
I
Cascade property: Frequent transmissions over these edges
Easy to compute given complete observations
Still interesting for marketing, epidemiology
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
4 / 42
Quick Summary
Weak LoT Distribution in Random Graphs
Varies with nw
structure
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
5 / 42
Quick Summary
Evaluating Joint Properties: Challenges
Network and Diffusion Cascade
I αuv ∈ R+ : connection strength between nodes u, v
I
Cascade of infections: infected node ui , parent node zi , time ti
Diffusion Process: Independent Cascade Model
I
Infected node proposes infection time for uninfected neighbors
I
Uninfected node catches infection with earliest proposed time
I
Multiple infections of same node (Splitting model) (Wang ECML12)
Hidden variables
I Network: Connection strengths and sometimes edges unobserved
I
Cascades: Infection sources zi unobserved
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
6 / 42
Quick Summary
Network Diffusion Properties and a Possible Approach
Network Diffusion Property
Function f (α, z) defined on network α and cascade z
Frequentist Plug-in
I
Point estimate of the network, e.g. MLE
I
Point estimate of cascade given network estimate
I
Evaluate property using point estimates
Disadvantages
I
MLE overfits for infrequent edges
I
Property not always one-to-one: most likely value of property does
not correspond to mostly likely network and cascade
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
7 / 42
Quick Summary
Example: Weak LoT Reconstruction
Does not work very well!
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
8 / 42
Quick Summary
A Bayesian Solution
Evaluate expectation of property under posterior distribution
p(z, α|{c o }) given partially observed cascades
¯f (z, α) = Ep(z,α|{c o }) [f (z, α)]
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
9 / 42
Quick Summary
Example: Weak LoT Reconstruction
Many-to-one function
Bayesian approach recovers the signature shape of the distribution, by
considering less likely networks
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
10 / 42
Quick Summary
Example: Network Inference
One-to-one function
Error
Bayesian
Frequentist
CorePeriphery
0.116
2.553
Hierarchical
0.884
3.210
Random
0.147
17.483
ForestFire
0.329
736.821
Bayesian approach significantly reduces error by avoiding over-fitting
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
11 / 42
Quick Summary
Computational Tractability
Two marginalization operations
I
Integration over network strengths
I
Summation over cascade paths
Characterization of properties (for Independent Cascade Model)
I
Not nice at all: Neither can be done efficiently
I
Partially-nice: One of the marginalizations can be done efficiently
I
Totally-nice: Joint marginalization can be done efficiently
Interesting properties in all of these classes
Monte Carlo approximation framework when not totally nice
Map-reduce framework for large scale network diffusion analysis
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
12 / 42
Quick Summary
“the rest are details” ...
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
13 / 42
Problem Definition
Network and Diffusion Cascades
Network
I G = (V , E) with nodes V and edges E
I
αuv ∈ R+ : connection strength between nodes u, v
Diffusion Cascades
I Collection of cascades: C = {c}
I
Cascade c: set of infections (ui , zi , ti )
I
Infected node ui , parent node zi , time ti
I
All cascades observed until time T
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
14 / 42
Problem Definition
Diffusion Process: Continuous-time Independent Cascade Model
Generative Process
I Seed nodes infected initially
I
Infected node proposes infection time for uninfected neighbors
I
Uninfected node catches infection with earliest proposed time
I
Multiple node infections allowed (Splitting model) (Wang ECML12)
Likelihood
I Delay pdf f (ti |ui , uj , tj ; αu u )
j i
Y
Y
Y
S(T |ti ; αui v )
p(c|α) =
H(ti |tzi ; αuzi ui )
S(ti |tj ; αuzj ui )
i
I
j∈πi
v :lv <ti
F (t): CDF, H(t): Hazard, S(t): Survival function for f (t)
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
15 / 42
Problem Definition
Modeling the Delay Distribution
Exponential: Chances of infection die off very quickly
f (ti |tj ; α) = αe−α(ti −tj )
H(ti |tj ) = α; S(ti |tj ) = e−α(ti −tj )
Rayleigh: infection chances rise to peak and then die off quickly
1
2
1
2
f (ti |tj ; α) = α(ti − tj )e− 2 α(ti −tj )
H(ti |tj ) = α(ti − tj ); S(ti |tj ) = e− 2 α(ti −tj )
Power law: heavier tails
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
16 / 42
Problem Definition
Inference Problems for Network Diffusion
Cascade Path Inference / Parent Inference
I Observed variables in infections c o = {(ui , ti )}
I
Infer unobserved parent zi
p(z | {c o }, α) =
Y
i
H(ti |tzi ; αuzi ui )
P
j∈πi H(ti |tj ; αuj ui )
Decouples into terms involving individual infection parents zi
Network Inference
I Network connection strength matrix α is unobserved
I
Maximum likelihood estimate given {c o }
α
ˆ = arg max log p({c o }|α) = arg max log
α
Embar, Pasumarthi, Bhattacharya
α
Network Diffusion Properties
X
p(C|α)
z
IISc MLSIG 2015
17 / 42
Problem Definition
Network Diffusion Properties and Expectations
I
Property: Function f (α, z) defined on network α and cascade z
I
Model α and z as random variables
Expected property
Expectation ¯f under posterior distribution p(z, α|{c o })
¯f (C, α) = Ep(z,α|{c o }) [f (C, α)]
I
Expectation under α-marginal p(α|{c o }) for properties only of α
I
Expectation under z-marginal p(z|{c o }) for properties only of z
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
18 / 42
Bayesian Framework
Posterior Distribution of Network
I
I
Network strengths α random variable with prior p(α)
Q
IID assumption: p(α) = uv p(αuv )
Posterior Distribution
p(α | {c o }, z) =
Y
R
uv
αuv
¯ uv p(αuv )
¯ uv S
H
¯ uv p(αuv )dαuv
¯ uv S
H
Decouples into terms involving individual edge strengths αuv
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
19 / 42
Bayesian Framework
Conjugate Prior for Network
Require analytical integration wrt αuv
Conjugate prior
I
Rayleigh and Exponential special cases of Weibull distribution
I
Gamma is conjugate for Weibull (with given shape parameter)
p(αuv ) = Gamma(αuv ; a, b) =
ba a−1
exp{−bαuv }
α
Γ(a) uv
Posterior
p(α|{c o }, z) =
Y
Gamma(a + ρuv (z), b + ∆uv )
uv
ρuv (z): #u-v infections; ∆uv : Cumulative u-v infection delay
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
20 / 42
Bayesian Framework
Suitability for Network Inference
Modeling sparse networks
Given no transmission evidence, very little belief in edge existence
I
For edges with no transmission, ρuv = 0
I
For a < 1 and suitable b, posterior peaked sharply at 0
For large data volumes, mean approaches MLE
I
For ρuv ≥ 1, unimodal posterior peaked at (a + ρuv )/(b + ∆uv )
Avoiding bias
Gamma prior non-informative for a, b 1
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
21 / 42
Niceness of Properties
Network-nice Properties
Definition: nice-α
Q
A property f (α, z) is nice-α if it can be written as g(z) u,v huv (αuv , z)
R
P
or as g(z) u,v huv (αuv , z) where huv (αuv , z) p(αuv |z, {c o })dαuv can
be performed analytically ∀ u, v .
I
Decomposes over parents and individual connection strengths
I
Amenable to analytical integration with p(αuv |z, {c o })
Theorem
Let f (α,Rz) be nice-α. Then computing the z-marginal
¯fz (z) = f (α, z)p(α|z, {c o })dα is O(|E|).
α
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
22 / 42
Niceness of Properties
Cascade-nice Properties
Definition: nice-z
Q
A property P
f (α, z) is nice-z if it can be written either as g(α) i hi (zi , α)
or as g(α) i hi (zi , α)
Property decomposes over individual infection parents zi
Theorem
Let f (α, z) be nice-z. Then the α-marginal
¯fα (α) = P f (α, z)p(z|α, {c o }) can be computed in O(π|C|) time,
z
where π is max. no. of potential parents over all infections.
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
23 / 42
Niceness of Properties
Totally-nice Properties
Definition: nice-z, α
A property f (α, z) is nice-z, α if it can be written as
R
Q
Q|D| hi (zi )
guv (αuv )p(αuv |z, {c o })dαuv can be
u,v guv (αuv )
i=1 αu u where
zi i
performed analytically ∀u, v .
I
αuv and zi terms are decoupled in the expectation
I
αuv terms are Gamma integrable
Theorem
Let f (α, z) be nice-z, α. Then the expectation ¯f (α, z) can be computed
in O(π|C|)) + O(|E|) time, up to a multiplicative constant.
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
24 / 42
Network Diffusion Properties
Interesting Network Diffusion Properties
Network-centric Properties: Scores for nodes, edges, etc
I
Nodes with large path-reach (approx. LoT): not nice at all
I
Nodes with large edge-reach (weak LoT): network-nice only
I
Strong frequent edges: network-nice only
I
Network Inference: network-nice, cascade-nice, not totally nice
I
Edges that are strong or frequent but not both: totally nice
Cascade-centric Properties: Scores for individual infections,
infection paths, etc
I
Infections by strongest neighbor: cascade-nice only
I
Infection parent inference: cascade-nice only
I
Complete likelihood, Likelihood: cascade-nice only
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
25 / 42
Network Diffusion Properties
Network-centric Properties
Network-centric Properties: Details
Scores for entities in the network, e.g., nodes, edges, etc
Building Blocks
∗ : Indirect network connection strength between u and v
αuv
(r )
αuv =
X
(r −1)
αuw
αwv
∗
; αuv
=
w
R
X
(r )
αuv
r =1
ρ∗uv : Indirect cascade transmission frequency between u and v
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
26 / 42
Network Diffusion Properties
Network-centric Properties
Node-centric properties
Node influence score (approx. LoT)
X
∗
fu (α, z; a, r ) =
I(αuv
> a)I(ρ∗uv (z) ≥ r )
v
Not nice at all, even for R = 2
Node influence score for direct infections (weak LoT)
X
fu (α, z; a, r ) =
I(αuv > a)I(ρuv (z) ≥ r )
v
Network-nice, but not totally-nice
Cascade-only version : Network-nice, but not totally-nice
Network-only version : Network-nice, cascade-nice, but not totally-nice
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
27 / 42
Network Diffusion Properties
Network-centric Properties
Edge-centric properties
Edge strength-frequency distribution
X
f (α, z) =
I(a1 < αuv < a2 )I(r1 ≤ ρuv (z) < r2 )
u,v
Network-nice, but not totally nice
Edge strength distribution: Network-nice, not totally nice
Edge freq. distribution: Network-nice, cascade-nice, not totally nice
Network Inference
f (α, z) = α
Network-nice, cascade-nice, not totally nice
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
28 / 42
Network Diffusion Properties
Network-centric Properties
One Nice Edge-centric Property
Identifying strange edges
Edges that are strong but infrequent, or weak but frequent
−ρ (z)
fuv (α, z) = αuv uv
Totally nice : Expectation can be computed exactly and efficiently
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
29 / 42
Network Diffusion Properties
Cascade-centric Properties
Cascade-centric Properties: Details
Scores for entities in cascade, e.g., individual infections, infection paths
Infection property
Infections by strongest neighbor of u: arg maxv αuv
X
f (α, z) =
I(uzi = arg max αvui )
i
v
Infection parent identification
f (α, z)iu = 1 if zi = u; = 0 otherwise
Both cascade-nice, but not totally nice
Cascade property
Complete likelihood, Likelihood : Both only cascade-nice
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
30 / 42
Approximation via MCMC
Approximate Evaluation of Not-nice Properties
Use Monte Carlo marginalization for properties that are not totally-nice.
Only Network-nice
Marginalize over network efficiently; Monte Carlo sum over paths
X
¯f (α, z) ≈ 1
¯fz (z (s) ), where z (s) ∼ p(z|{c o }), s = 1 . . . S
S s
Only Cascade-nice
Marginalize over paths efficiently; Monte-Carlo integration over network
Not nice
Monte Carlo marginalization over both network and paths
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
31 / 42
Approximation via MCMC
Gibbs Sampling for Network Diffusion
Full / Uncollapsed Gibbs Sampling
Iterate over all α and z variables, sampling a new value from its
conditional distribution, given the current values of all other variables
p(zi = j|z−i , α, {c o }) ∝ αji
p(αuv |z, α−uv , {c o }) ∼ Gamma(ρuv + a, ∆uv + b)
For sparse networks, sample αuv only when ρuv > 0
Collapsed Gibbs sampling for network-nice properties
Integrate out α analytically, sample only z variables
p(zi = j|z−i , {c o }) ∝
Embar, Pasumarthi, Bhattacharya
(ρ−i
u u (z)+a)
j i
∆uj ui +b
Network Diffusion Properties
IISc MLSIG 2015
32 / 42
Experiments
Experiments: Algorithms
Bayesian Expectation
Gamma prior parameters: a = 0.00001, b = 0.1
Frequentist Plug-in
I
Take point estimate α
ˆ of network
I
I
Most likely infection parents given α
ˆ : zˆ = arg maxz p(z|ˆ
α, {c o })
Evaluate f (ˆ
α, zˆ )
I
For α
ˆ , use MONET
Exponential distribution for all experiments
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
33 / 42
Experiments
Experiments: Synthetic Data
I
Evaluate accuracy against a gold-standard
I
Analysis for various random graph models
Data Generation
I Forest Fire, Random, Hierarchical, Core-Periphery
I
1000 nodes, ∼ 2000 edges
I
αuv ∼ U(0.01, 10)
I
20 splitting cascades with 2 random seeds, ∼ 50,000 infections
Evaluation
I Parent inference: accuracy against true parent z ∗
I
Strength inference: Best achievable given true parents
I
Property: Error wrt f (α∗ , z ∗ )
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
34 / 42
Experiments
Synthetic Data Experiments: Weak LoT distribution
Bayesian recovery significantly better!
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
35 / 42
Experiments
Synthetic Data Experiments: Approx. LoT distribution
Bayesian recovery significantly better!
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
36 / 42
Experiments
Synthetic Data Experiments: One-to-one Properties
Loglikelihood
Test
Train
CorePeriphery
BE, FP
1.0e4, 0.6e4
2.8e4, 3.6e4
Hierarchical
BE, FP
6.5e3, 2.4e3
2.0e4, 2.2e4
Random
BE, FP
1.1e4, -1.5e4
2.3e4, 2.9e4
ForestFire
BE, FP
1.2e4, 926
2.8e4, 3.3e4
Network and Parent Inference
Network Inf
Parent Inf
CorePeriphery
BE
FP
0.116
2.553
0.533
0.406
Hierarchical
BE
FP
0.884
3.210
0.861
0.783
Random
BE
FP
0.147
17.483
0.757
0.646
ForestFire
BE
FP
0.329
736.821
0.770
0.674
Bayesian approach avoids overfitting for one-to-one properties
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
37 / 42
Experiments
Real Data Experiments
Meme-Tracker
I Meme diffusion between 5000 blogs, news sites (Mar 11 - Feb 12)
I
5 topics: Basketball, Alcohol, Technology, NBA, Occupy
I
Long cascades: length>30
I
80-20 train-test split; infections of new users pruned in test
Test Loglikelihood
BE
FP
Bball
-1.5e6
-3.5e6
Alcohol
-5.8e5
-8.9e5
Tech
-6.6e5
-2.6e6
NBA
-8.9e5
-1.1e7
Occupy
-5.1e5
-1.2e6
Bayesian approach generalizes much better
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
38 / 42
Experiments
Scaling Experiments
I
Map reduce implementation; 12 core server
I
Randomly sampled cascades from Meme-Tracker
Scaling (roughly) linear in no. of cores
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
39 / 42
Experiments
LIVE at Wimbledon 2014
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
40 / 42
Experiments
LIVE at Wimbledon 2014
Identify most influential Twitter handles on Wimbledon-related topics
Challenges
I
Scaling to millions of tweets and users
I
Online updates to influence scores
I
Considering textual content of tweets for analyzing influence
I
Interpretability of the scores
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
41 / 42
Experiments
Digression: IBM Debating Technologies
I
Develop technologies that assist humans to debate and reason
using current world knowledge when there are no clear yes or no
answers.
I
E.g. Should smoking be banned altogether?
I
Use cases: Government policy making, health-care, legal, finance
I
Visit our webpage and see (unfortunately not the latest) demo
online (google ‘IBM Debating Technologies’)
Embar, Pasumarthi, Bhattacharya
Network Diffusion Properties
IISc MLSIG 2015
42 / 42