Click-Through Rate prediction: TOP

Click-Through Rate prediction:
TOP-5 solution for the Avazu contest
Dmitry Efimov
April 21, 2015
Outline
Provided data
Likelihood features
FTRL-Proximal Batch algorithm
Factorization Machines
Final results
Competition
Device layer:
id, model, type
Time layer:
day, hour
Site layer:
id, domain, category
Basic features
Provided data
Connection layer:
ip, type
Banner layer:
position, C1, C14-C21
Application layer:
id, domain, category
Notations
X: m × n design matrix
mtrain = 40 428 967
mtest = 4 577 464
n = 23
y: binary target vector of size m
xj : column j of matrix X
xi : row i of matrix X
1
: sigmoid function
σ(z) =
1 + e−z
Evaluation
Logarithmic loss for yi ∈ {0, 1}:
m
L=−
1 X
(yi log(yˆi ) + (1 − yi ) log(1 − yˆi ))
m
i=1
yˆi is a prediction for the sample i
Logarithmic loss for yi ∈ {−1, 1}:
m
L=
1 X
log(1 + e−yi pi )
m
i=1
pi is a raw score from the model
yˆi = σ(pi ), ∀i ∈ {1, . . . , m}
Feature engineering
I
Blocks: combination of two or more features
I
Counts: number of samples for different feature values
I
Counts unique: number of different values of one feature
for fixed value of another feature
I
Likelihoods: min L, where θt = P(yi | xij = t)
θt
I
Others
Feature engineering (algorithm 1)





X :




x11
···
···
···
···
···
xm1
···
···
···
···
···
···
···
x1j1
···
···
···
···
···
xmj1





Z :




x1j1
···
xi1 j1
···
xir j1
···
xmj1
· · · x1js
··· ···
··· ···
··· ···
··· ···
··· ···
· · · xmjs
Step 1
· · · x1js
··· ···
· · · xi1 js
··· ···
· · · xir js
··· ···
· · · xmjs
···
···
···
···
···
···
···
x1n
···
···
···
···
···
xmn










Require:
J = {j1 , . . . , js } ⊂ {1, . . . , n}
K = {k1 , . . . , kq } ⊂ {1, . . . , n}\J
Z ← (xij ) ⊂ X , i ∈ {1, 2, . . . , m}, j ∈ J










Step 2

Z1
xi1 j1
· · · Zt :  · · ·
xir j1
···
···
···

xi1 js
··· 
xir js
· · · ZT (Z )
Step 3
···
ci1 = . . . = cir = r
yi + . . . + yir
pi1 = . . . = pir = 1
r
bi1 = . . . = bir = t


xi1 k1 · · · xi1 kq
At :  · · · · · · · · · 
xir k1 · · · xir kq
Step 4
...
BT (At )
B1
Step 5
ui1 = . . . = uir = T (At )
function SPLIT B Y R OWS(M)
get partition of the matrix M into
{Mt } by rows such that each Mt has
identical rows
···
for I = {i1 , . . . , ir } ∈ SPLIT B Y R OWS(Z )
ci1 = . . . = cir = r
yi + . . . + yir
pi1 = . . . = pir = 1
r
bi1 = . . . = bir = t
At = (xik ) ⊂ X , i ∈ I, k ∈ K
T (At ) = size(SPLIT B Y R OWS(At ) )
ui1 = . . . = uir = T (At )
Feature engineering (algorithm 2)
function SPLIT B Y R OWS(M)
get partition of the matrix M into {Mt } by rows such that
each Mt has identical rows
Require:
parameter α > 0
J ← (j1 , . . . , js ) ⊂ {1, . . . , n}
increasing sequence V ← (v1 , . . . , vl ) ⊂ {1, . . . , s}, v1 < s
y1 + . . . + ym
fi ←
, ∀i ∈ {1, . . . , m}
m
for v ∈ V do
Jv = {j1 , . . . , jv }, Z = (xij ) ⊂ X , i ∈ {1, 2, . . . , m}, j ∈ Jv
for I = {i1 , . . . , ir } ∈ SPLIT B Y R OWS(Z ) do
ci1 = . . . = cir = r
yi + . . . + yir
pi1 = . . . = pir = 1
r
w = σ(−c + α) - weight vector
fi = (1 − wi ) · fi + wi · pi , ∀i ∈ {1, . . . , m}
FTRL-Proximal model
Weight updates:
i
1 P
gr · w +
τr ||w − wr ||22 + λ1 ||w||1
= arg min
w
2 r =1
r =1
wi+1
i
P
=
i
i
P
P
1
2
= arg min w ·
(gr − τr wr ) + ||w||2
τr + λ1 ||w||1 + const ,
w
2
r =1
r =1
s
where
i
P
r =1
β+
τrj =
i
P
r =1
α
grj
2
+ λ2 , j ∈ {1, . . . , N},
λ1 , λ2 , α, β - parameters, τr = (τr 1 , . . . , τrN ) - learning rates,
gr - gradient vector for the step r
FTRL-Proximal Batch model
Require: parameters α, β, λ1 , λ2
zj ← 0 and nj ← 0, ∀j ∈ {1, . . . , N}
for i = 1 to m do
receive sample vector xi and let J = {j|xij 6= 0}
for j ∈ J 
do
if |zj | 6 λ1
 0
−1
√
β + nj
wj =
 −
zj − sign(zj )λ1
otherwise
+ λ2
α
predict yˆi = σ(xi · w) using the wj and observe yi
if yi ∈ {0, 1} then
for j ∈ J do
gj = yˆi − yi - gradient direction of loss w.r.t. wj
1 q
√
τj =
nj + gj 2 − nj
α
zj = zj + gj − τj wj
nj = nj + gj 2
Performance
Description
dataset is sorted by app id, site id,
banner pos, count1, day, hour
dataset is sorted by app domain,
site domain, count1, day, hour
dataset is sorted by person, day, hour
dataset is sorted by day, hour with 1 iteration
dataset is sorted by day, hour with 2 iterations
Leaderboard
score
0.3844277
0.3835289
0.3844345
0.3871982
0.3880423
Factorization Machine (FM)
Second-order polynomial regression:


n−1 X
n
X
yˆ = σ 
wjk x j x k 
j=1 k=j+1
Low rank approximation (FM):


n−1 X
n
X
(vj1 , . . . , vjH ) · (vk1 , . . . , vkH )x j x k 
yˆ = σ 
j=1 k=j+1
H is a number of latent factors
Factorization Machine for categorical dataset
Assign set of latent factors for each pair level-feature:


n−1 X
n
X
2
yˆi = σ 
(wxij k1 , . . . , wxij kH ) · (wxik j1 , . . . , wxik jH )
n
j=1 k=j+1
1
Add regularization term: Lreg = L + λ||w||2
2
The gradient direction:
gxij kh =
∂Lreg
2 y e−yi pi
= − · i −y p · wxik jh + λwxij kh
∂wxij kh
n 1+e i i
2
Learning rate schedule: τxij kh = τxij kh + gxij kh
q
Weight update: wxij kh = wxij kh − α · τxij kh · gxij kh
Ensembling
Model
ftrlb1
ftrlb2
ftrlb3
fm
ens
Description
dataset is sorted by app id, site id,
banner pos, count1, day, hour
dataset is sorted by app domain,
site domain, count1, day, hour
dataset is sorted by person, day, hour
factorization machine
0.6
fm · ftrlb10.1 · ftrlb20.2 · ftrlb30.1
Leaderboard
score
0.3844277
0.3835289
0.3844345
0.3818004
0.3810447
Final results
Place
Team
Leaderboard
score
1
2
3
4
5
6
7
4 Idiots
Owen
Random Walker
Julian de Wit
Dmitry Efimov
Marios and Abhishek
Jose A. Guerrero
0.3791384
0.3803652
0.3806351
0.3810307
0.3810447
0.3828641
0.3829448
Difference
between
the 1st place
—
0.32%
0.40%
0.50%
0.50%
0.98%
1.00%
Future work
I
apply the batching idea to the Factorization Machine
algorithm
I
find a better sorting for the FTRL-Proximal Batch algorithm
I
find an algorithm that can find better sorting without
cross-validation procedure
References
H.Brendan McMahan et al. ”Ad click prediction: a view from
the trenches.” In KDD, Chicago, Illinois, USA, August 2013.
Wei-Sheng Chin et al. ”A learning-rate schedule for
stochastic gradient methods to matrix factorization.” In
PAKDD, 2015.
Michael Jahrer et al. ”Ensemble of collaborative filtering
and feature engineered models for click through rate
prediction.” In KDD Cup, 2012
Steffen Rendle. ”Social network and click-through
prediction with factorization machines.” In KDD Cup, 2012
Thank you! Questions???
Dmitry Efimov
[email protected]