Part II: 
 Sample Selection Bias Selection Bias, Label Bias, and 


Selection Bias, Label Bias, and Bias in Ground Truth
!
Part II: Sample Selection Bias
Anders Søgaard, Barbara Plank and Dirk Hovy
Sample Selection Bias
The CROSS-DOMAIN GULF
Sample Selection Bias
The CROSS-DOMAIN GULF
Sample Selection Bias
The CROSS-DOMAIN GULF
Sample Selection Bias
“domain adaptation”
or “transfer learning”
domain, genre, time,…
→ differences in P(x)
The CROSS-DOMAIN GULF
Off-the-shelf POS tagger
http://cogcomp.cs.illinois.edu/demo/pos/
3
Off-the-shelf POS tagger
http://cogcomp.cs.illinois.edu/demo/pos/
3
Off-the-shelf POS tagger
http://cogcomp.cs.illinois.edu/demo/pos/
3
Off-the-shelf POS tagger
4
Off-the-shelf POS tagger
The/DT share/NN rose/VBD to/TO 10/CD $/$ a/DT unit/NN ./.
4
Off-the-shelf POS tagger
The/DT share/NN rose/VBD to/TO 10/CD $/$ a/DT unit/NN ./.
May/NNP I/PRP brrow/VBP 10bucks/UH
4
First, a few words on
terminology…
what do we
call it?
5
General ML trichotomy
6
General ML trichotomy
1. supervised ML
labeled
DATA
6
General ML trichotomy
1. supervised ML
2. semi-supervised ML
labeled
DATA
unlabeled
labeled
+
DATA
DATA
6
General ML trichotomy
1. supervised ML
2. semi-supervised ML
3. unsupervised ML
labeled
DATA
unlabeled
labeled
+
DATA
DATA
unlabeled
DATA
6
Domain Adaptation: 4
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
labeled
TARGET
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
2. semi-supervised DA
(e.g. Daumè, 2010; Chang,
Conner & Roth, 2010)
labeled
SOURCE
labeled
TARGET
labeled
TARGET
unlabeled
TARGET
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
2. semi-supervised DA
(e.g. Daumè, 2010; Chang,
Conner & Roth, 2010)
labeled
SOURCE
3. unsupervised DA
(e.g. Blitzer et al., 2007; labeled
SOURCE
McClosky et al., 2008)
labeled
TARGET
labeled
TARGET
unlabeled
TARGET
unlabeled
TARGET
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
2. semi-supervised DA
(e.g. Daumè, 2010; Chang,
Conner & Roth, 2010)
labeled
SOURCE
3. unsupervised DA
(e.g. Blitzer et al., 2007; labeled
SOURCE
4. blind/unknown DA
labeled
SOURCE
McClosky et al., 2008)
(e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014)
labeled
TARGET
labeled
TARGET
unlabeled
TARGET
unlabeled
TARGET
?
?
N
W
O
N
K
N
?? U
at test time
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
2. semi-supervised DA
(e.g. Daumè, 2010; before Chang, Conner & Roth, 2010)
labeled
SOURCE
3. unsupervised DA
(e.g. Blitzer et al., 2007; labeled
SOURCE
4. blind/unknown DA
labeled
SOURCE
labeled
TARGET
labeled
TARGET
unlabeled
TARGET
2010
McClosky et al., 2008)
2012
onwards (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014)
unlabeled
TARGET
?
?
N
W
O
N
K
N
?? U
at test time
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
2. semi-supervised DA
(e.g. Daumè, 2010; before Chang, Conner & Roth, 2010)
labeled
SOURCE
3. unsupervised DA
(e.g. Blitzer et al., 2007; labeled
SOURCE
4. blind/unknown DA
labeled
SOURCE
labeled
TARGET
labeled
TARGET
unlabeled
TARGET
2010
McClosky et al., 2008)
2012
onwards (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014)
unlabeled
TARGET
?
?
N
W
O
N
K
N
?? U
at test time
7
Domain Adaptation: 4
1. supervised DA
(e.g. Daumè, 2007)
labeled
SOURCE
2. semi-supervised DA
(e.g. Daumè, 2010; before Chang, Conner & Roth, 2010)
labeled
SOURCE
3. unsupervised DA
(e.g. Blitzer et al., 2007; labeled
SOURCE
4. blind/unknown DA
labeled
SOURCE
labeled
TARGET
labeled
TARGET
unlabeled
TARGET
2010
McClosky et al., 2008)
2012
onwards (e.g. Søgaard & Johannsen, 2012; Plank & Moschitti, 2013; Elming et al., 2014)
unlabeled
TARGET
?
?
N
W
O
N
K
N
?? U
at test time
7
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance weighting
adversarial learning
distant supervision
8
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance weighting
adversarial learning
distant supervision
8
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance weighting
adversarial learning
distant supervision
8
semi-supervised machine
learning
to address the biased selection of sentences (x)
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
implicitly adapting by adding
newly labeled data from TARGET
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
implicitly adapting by adding
newly labeled data from TARGET
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
implicitly adapting by adding
newly labeled data from TARGET
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
implicitly adapting by adding
newly labeled data from TARGET
✓ if gulf is not too wide
unlabeled
TARGET
Self-training
labeled
SOURCE
ML
unlabeled
TARGET
labeled
TARGET
Self-training
labeled
SOURCE
train
ML
unlabeled
TARGET
labeled
TARGET
Self-training
labeled
SOURCE
test
train
ML
unlabeled
TARGET
labeled
TARGET
Self-training
labeled
SOURCE
test
train
ML
label
unlabeled
TARGET
labeled
TARGET
Self-training
labeled
SOURCE
test
train
ML
add data
unlabeled
TARGET
label
labeled
TARGET
Self-training
labeled
SOURCE
re-train
train
test
ML
add data
unlabeled
TARGET
label
labeled
TARGET
Self-training
labeled
SOURCE
re-train
train
test
ML
iterate
add data
unlabeled
TARGET
label
labeled
TARGET
Self-training
Parameters & Variants:
- pool size, number of iterations
- select (only most confident)
- add with weight
- (in)delible
12
Delible self-training
L0 instead of L
(Abney, 2007)
13
Self-training
Pros
✓Simple wrapper method
✓Can correct bias to some extent (if expected error
on target is low/gulf not too wide)
Cons
‣ many parameters
‣ might introduce more bias (both selection and
label bias)
14
Co-training
• similar to self-training but with two views
• two classifiers labeling data for each other
ML1
labeled
SOURCE
ML2
15
Co-training
ML1
ML2
train
labeled
SOURCE
ML1
test
ML2
label for each other
unlabeled
TARGET
label
labeled
TARGET
Co-training
Pros
✓simple wrapper method
✓often less sensitive to mistakes than self-training
Cons
‣ computationally more expensive (ensemble)
‣ many parameters
‣ two views not always available
17
Tri-training
ML2
ML1
ML3
18
Tri-training
agree
ML2
ML1
ML3
18
Tri-training
agree
ML2
ML1
add
ML3
18
Tri-training
agree
ML2
ML1
add
Pros
✓same advantages as co-training
✓fewer parameters
ML3
Cons
‣ again, ensemble method
‣ many parameters
18
Implicit use of unlabeled data
labeled
SOURCE
train
ML
Implicit use of unlabeled data
labeled
SOURCE
train
ML
unlabeled
TARGET
unsupervised
learning
Implicit use of unlabeled data
labeled
SOURCE
train
ML
Brown clusters
(e.g., Koo et al., 2008; Turian, 2010)
unlabeled
TARGET
unsupervised
learning
Implicit use of unlabeled data
labeled
SOURCE
train
ML
Brown clusters
(e.g., Koo et al., 2008; Turian, 2010)
unlabeled
TARGET
unsupervised
learning
count/predict (distr. sim./embeddings)
(e.g., Mikolov et al., 2013; Baroni et al., 2014; Johannsen et al., 2014)
Implicit use of unlabeled data
labeled
SOURCE
train
ML
Brown clusters
(e.g., Koo et al., 2008; Turian, 2010)
add features
unlabeled
TARGET
unsupervised
learning
count/predict (distr. sim./embeddings)
(e.g., Mikolov et al., 2013; Baroni et al., 2014; Johannsen et al., 2014)
Implicit use of unlabeled data
labeled
SOURCE
train
ML
add features
unlabeled
TARGET
Implicit use of unlabeled data
labeled
SOURCE
train
ML
add features
unlabeled
TARGET
Implicit use of unlabeled data
labeled
SOURCE
train
ML
O
D
E
W
N
A
C
E
?
S
S
L
A
E
I
B
T features
E
L
P
WHAadd
M
A
S
T
C
E
R
R
O
TO C
unlabeled
TARGET
Implicit use of unlabeled data
labeled
SOURCE
train
ML
O
D
E
W
N
A
C
E
?
S
S
L
A
E
I
B
T features
E
L
P
WHAadd
M
A
S
T
C
E
R
R
O
TO C
unlabeled
TARGET
Drop!
- data points
- features
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
semi-supervised
learning
importance weighting
weighting
importance
adversarial learning
distant supervision
21
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
semi-supervised
learning
importance weighting
weighting
importance
adversarial learning
distant supervision
21
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
semi-supervised
learning
importance weighting
weighting
importance
adversarial learning
distant supervision
21
Importance weighting
Importance weighting (IW)
SOURCE train
?
unlabeled
TARGET
TARGET test
Importance weighting (IW)
SOURCE train
?
unlabeled
TARGET
TARGET test
Importance weighting (IW)
SOURCE train
?
assign instance-dependent weights (Shimodaira, 2001):
unlabeled
TARGET
TARGET test
Importance weighting (IW)
SOURCE train
?
assign instance-dependent weights (Shimodaira, 2001):
unlabeled
TARGET
TARGET test
approximation, e.g.:
!
!
domain classifier to
discriminate between
SOURCE & TARGET
(Zadrozny et al., 2004; Bickel and Scheffer,
2007; Søgaard and Haulrich, 2011)
Importance weighting (IW)
Pros
✓simple idea
✓works well if we know how our sample differs
✓also useful to combat label bias (more on this later)
Cons
‣ challenge is to find a good weight function
‣ finite sample: can overcome bias only to certain extent
24
Importance weighting
in NLP
!
Only 4 NLP studies1, of which 2 on unsupervised
DA with mixed results
Does importance weighting work for
unsupervised DA of POS taggers?
1(Jiang & Zhai, 2007; Foster et al., 2010; Søgaard & Haulrich, 2011; Plank & Moschitti, 2013)
25
(Plank, Johannsen, Søgaard, 2014) EMNLP
representation
Domain classifier
n-gram size
26
(Plank, Johannsen, Søgaard, 2014) EMNLP
Domain classifier
representation
(Søgaard & Haulrich, 2011)
n-gram size
26
(Plank, Johannsen, Søgaard, 2014) EMNLP
Domain classifier
representation
(Søgaard & Haulrich, 2011)
n-gram size
26
(Plank, Johannsen, Søgaard, 2014) EMNLP
Random weighting
• Setup: Google Web Treebank, Universal POS,
weighted structured perceptron
27
(Plank, Johannsen, Søgaard, 2014) EMNLP
Results
Token-based domain classifier
baseline
1-gram
2-gram
reviews
emails
3-gram
4-gram
96
94
92
answers
weblogs
newsgroups
on test sets; results were similar for other representations (Brown, Wiktionary)
28
(Plank, Johannsen, Søgaard, 2014) EMNLP
Results
Token-based domain classifier
baseline
96
1-gram
2-gram
reviews
emails
3-gram
Y
L
T
N
A
C
I
F
I
N
G
I
E
S
N
S
I
I
L
E
E
S
A
B
N ON
N
A
H
T
R
BETTE
4-gram
94
92
answers
weblogs
newsgroups
on test sets; results were similar for other representations (Brown, Wiktionary)
28
(Plank, Johannsen, Søgaard, 2014) EMNLP
Results
Token-based domain classifier
baseline
1-gram
2-gram
answers
reviews
emails
weblogs
avg tag ambiguity 1.09
KL-div:
0.05
OOV:
27.7
1.07
0.04
29.5
1.07
0.03
29.9
1.05
0.01
22.1
96
3-gram
Y
L
T
N
A
C
I
F
I
N
G
I
E
S
N
S
I
I
L
E
E
S
A
B
N ON
N
A
H
T
R
BETTE
4-gram
94
92
newsgroups
1.05
0.01
23.1
on test sets; results were similar for other representations (Brown, Wiktionary)
28
(Plank, Johannsen, Søgaard, 2014) EMNLP
Results
Token-based domain classifier
baseline
1-gram
2-gram
answers
reviews
emails
weblogs
avg tag ambiguity 1.09
KL-div:
0.05
OOV:
27.7
1.07
0.04
29.5
1.07
0.03
29.9
1.05
0.01
22.1
96
3-gram
Y
L
T
N
A
C
I
F
I
N
G
I
E
S
N
S
I
I
L
E
E
S
A
B
N ON
N
A
H
T
R
BETTE
4-gram
94
92
newsgroups
1.05
0.01
23.1
low
low
on test sets; results were similar for other representations (Brown, Wiktionary)
28
(Plank, Johannsen, Søgaard, 2014) EMNLP
Results
Token-based domain classifier
baseline
1-gram
2-gram
answers
reviews
emails
weblogs
avg tag ambiguity 1.09
KL-div:
0.05
OOV:
27.7
1.07
0.04
29.5
1.07
0.03
29.9
1.05
0.01
22.1
96
3-gram
Y
L
T
N
A
C
I
F
I
N
G
I
E
S
N
S
I
I
L
E
E
S
A
B
N ON
N
A
H
T
R
BETTE
4-gram
94
92
newsgroups
1.05
0.01
23.1
low
low
high OOV!
on test sets; results were similar for other representations (Brown, Wiktionary)
28
(Plank, Johannsen, Søgaard, 2014) EMNLP
Random weighting
uniform
stdexp
Zipfian
(500 runs in each plot)
29
(Plank, Johannsen, Søgaard, 2014) EMNLP
Random weighting
uniform
baseline
stdexp
Zipfian
(500 runs in each plot)
29
(Plank, Johannsen, Søgaard, 2014) EMNLP
Random weighting
uniform
significance cutoff
baseline
stdexp
Zipfian
(500 runs in each plot)
29
(Plank, Johannsen, Søgaard, 2014) EMNLP
Random weighting
significance cutoff
uniform
baseline
stdexp
Y
L
T
N
A
C
I
F
I
N
G
I
E
S
N
S
I
I
L
E
E
S
A
B
N ON
N
A
H
T
R
E
T
T
E
B
Zipfian
(500 runs in each plot)
29
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
30
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
30
Roadmap
labeled
SOURCE
unlabeled
TARGET
What if we don’t
know the target?
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
30
Roadmap
labeled
SOURCE
unlabeled
TARGET
What if we don’t
know the target?
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
30
Swamping / Feature
dropout
Motivation:
(ALVINN)
31
Swamping / Feature
dropout
Motivation:
(ALVINN)
31
Swamping / Feature
dropout
Motivation:
(ALVINN)
Problem: feature swamping (Sutton et al. 2006)
Idea: corrupt features
31
Data Corruption
1
11
1
1
32
Data Corruption
Original
1
11
1
1
32
Data Corruption
Original
1
11
1
1
1
11
1
1
1
11
1
1
1
11
1
1
32
Data Corruption
Original
Corrupted
data
1
1
1
11
1
1
11
1
1
1
1
1
1
32
Dropout
33
Dropout
vector indicating how “active” feature is
33
Dropout
vector indicating how “active” feature is
• binomial dropout (Søgaard & Johannsen, 2012): sample P from random
binomial (“hard dropout”, 0/1)
33
Dropout
vector indicating how “active” feature is
• binomial dropout (Søgaard & Johannsen, 2012): sample P from random
binomial (“hard dropout”, 0/1)
• Zifpian corruptions (Søgaard, 2013a): P is inverse Zipfian distribution (“soft
dropout/feature importance weighting”)
33
(Søgaard 2013b)
Antagonistic adversaries
• It’s the predictive features that swamp. Let adversaries focus where it hurts the most.
34
(Søgaard 2013b)
Antagonistic adversaries
• It’s the predictive features that swamp. Let adversaries focus where it hurts the most.
randomly drop predictive features, i.e.
weight more than stdev away from mean
34
(Søgaard 2013b)
Antagonistic adversaries
• It’s the predictive features that swamp. Let adversaries focus where it hurts the most.
randomly drop predictive features, i.e.
weight more than stdev away from mean
34
(Søgaard 2013b)
Antagonistic adversaries
• It’s the predictive features that swamp. Let adversaries focus where it hurts the most.
randomly drop predictive features, i.e.
weight more than stdev away from mean
34
Results Dropout
baseline
binomial
Zipfian
Adversarial
95.2
93.6
92
answers
reviews
emails
weblogs
newsgroups
GWEB data, universal POS tags, drop-out: average over 5 runs
35
Results Dropout
baseline
binomial
Zipfian
Adversarial
95.2
93.6
92
answers
reviews
emails
weblogs
newsgroups
GWEB data, universal POS tags, drop-out: average over 5 runs
35
Results Dropout
baseline
binomial
Zipfian
Adversarial
95.2
93.6
92
answers
correlation POS: 77%
reviews
82%
emails
92%
weblogs
96%
newsgroups
96%
does not help
on domains
very similar to
SRC
GWEB data, universal POS tags, drop-out: average over 5 runs
35
Another view on dropout
(Hinton et al., 2012; Wager, Wang & Liang, 2013)
36
Another view on dropout
Ensemble methods (e.g., NetFlix challenge)
(Hinton et al., 2012; Wager, Wang & Liang, 2013)
36
Another view on dropout
Ensemble methods (e.g., NetFlix challenge)
dropout ~ model averaging ~ regularization
(Hinton et al., 2012; Wager, Wang & Liang, 2013)
36
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
37
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
37
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
37
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance
weighting
importance weighting
adversarial learning
adversarial
learning
distant supervision
supervision
distant
37
distant supervision
(Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009)
Distant supervision
39
(Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009)
Distant supervision
•
Distantly supervised: use a large knowledge base (KB) to
create noisily labeled instances
39
(Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009)
Distant supervision
•
Distantly supervised: use a large knowledge base (KB) to
create noisily labeled instances
•
Idea: if entity1 and entity2 are found in the same sentence
and rel(entity1,entity2) ∈ KB ➙ positive training instance
39
(Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009)
Distant supervision
•
Distantly supervised: use a large knowledge base (KB) to
create noisily labeled instances
•
Idea: if entity1 and entity2 are found in the same sentence
and rel(entity1,entity2) ∈ KB ➙ positive training instance
•
Exploiting some kind of “world knowledge”
39
(Snow, Juraskfy, Ng, 2005; Mintz, Bills, Snow, Jurafsky, 2009)
Distant supervision
•
Distantly supervised: use a large knowledge base (KB) to
create noisily labeled instances
•
Idea: if entity1 and entity2 are found in the same sentence
and rel(entity1,entity2) ∈ KB ➙ positive training instance
•
Exploiting some kind of “world knowledge”
•
Like type-constraints in sequence tagging
(Täckström et al., 2013)
The food is good at COLING
39
Type constraints
40
Type constraints
Can it help us bridge the cross-domain gulf?
40
Type constraints
Can it help us bridge the cross-domain gulf?
-
POS tagging
-
Supersense tagging
(Plank, Johannsen, Søgaard, 2014) EMNLP
(Johannsen et al., 2014) *SEM (talk yesterday by Dirk)
40
Type constraints
Can it help us bridge the cross-domain gulf?
-
POS tagging
-
Supersense tagging
(Plank, Johannsen, Søgaard, 2014) EMNLP
(Johannsen et al., 2014) *SEM (talk yesterday by Dirk)
40
Type constraints
Can it help us bridge the cross-domain gulf?
helped?
-
POS tagging
(Plank, Johannsen, Søgaard, 2014) EMNLP
YES
-
Supersense tagging
YES
(Johannsen et al., 2014) *SEM (talk yesterday by Dirk)
40
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
✓ if gulf is not too wide
✓ OR combined with distant supervision (“extra ingredient”)
+
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
✓ if gulf is not too wide
✓ OR combined with distant supervision (“extra ingredient”)
+
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
✓ if gulf is not too wide
✓ OR combined with distant supervision (“extra ingredient”)
+
unlabeled
TARGET
Semi-supervised learning (SSL)
How can it help us to bridge the cross-domain gulf?
labeled
SOURCE
✓ if gulf is not too wide
✓ OR combined with distant supervision (“extra ingredient”)
+
unlabeled
TARGET
Talk: Thursday 28 Aug, 14:50 room Theatre
Adapting taggers to Twitter using
not-so-distant
supervision
joint work with Dirk Hovy, Ryan McDonald, Anders Søgaard
Idea
Tweet
NN
NN
JJR
#Localization #job Supplier
:
.
VB
NN
.
NNP
/
Project
Manager
-
Localisation
NN
.
NNP
NNP
NNP
NNP
NN
NNP
Vendor
-
NY,
NY,
United
States
http://bit.ly/16KigBg
#nlppeople
43
Idea
Tweet
NN
NN
JJR
#Localization #job Supplier
:
.
VB
NN
.
NNP
/
Project
Manager
-
Localisation
NN
.
NNP
NNP
NNP
NNP
NN
NNP
Vendor
-
NY,
NY,
United
States
http://bit.ly/16KigBg
#nlppeople
20%
43
Idea
Tweet
NN
NN
JJR
#Localization #job Supplier
:
.
VB
NN
.
NNP
/
Project
Manager
-
Localisation
NN
.
NNP
NNP
NNP
NNP
NN
NNP
Vendor
-
NY,
NY,
United
States
http://bit.ly/16KigBg
#nlppeople
URL
NN
NN
.
NN
NN
VBZ
DET
…
The
Supplier
/
Project
Manager
performs
the
…
43
Idea
Tweet
NN
NN
JJR
#Localization #job Supplier
:
.
VB
NN
.
NNP
/
Project
Manager
-
Localisation
NN
.
NNP
NNP
NNP
NNP
NN
NNP
Vendor
-
NY,
NY,
United
States
http://bit.ly/16KigBg
#nlppeople
URL
NN
NN
.
NN
NN
VBZ
DET
…
The
Supplier
/
Project
Manager
performs
the
…
43
Idea
Tweet
NN
NN
NN
JJR
#Localization #job Supplier
:
.
NN
VB
NN
.
NNP
/
Project
Manager
-
Localisation
NN
.
NNP
NNP
NNP
NNP
NN
NNP
Vendor
-
NY,
NY,
United
States
http://bit.ly/16KigBg
#nlppeople
URL
NN
NN
.
NN
NN
VBZ
DET
…
The
Supplier
/
Project
Manager
performs
the
…
43
Same for NER
Tweet
O
O
O
Prey
Developer
worked
O
B-PER
with Nintendo
O
O
O
on
project
http://bit.ly/17Kbsf
44
Same for NER
Tweet
O
O
O
Prey
Developer
worked
O
B-PER
with Nintendo
O
O
O
on
project
http://bit.ly/17Kbsf
URL
O
O
O
O
B-ORG
O
O
…
In
a
statement
,
Nintendo
announced
that
…
44
Same for NER
Tweet
O
O
O
Prey
Developer
worked
B-ORG
B-PER
O
O
O
with Nintendo
on
project
http://bit.ly/17Kbsf
O
URL
O
O
O
O
B-ORG
O
O
…
In
a
statement
,
Nintendo
announced
that
…
44
Setup
45
Setup
1. tag
45
Setup
2. tag
1. tag
45
Setup
2. tag
1. tag
3. project
45
Setup
2. tag
1. tag
3. project
4. add data
45
Setup
2. tag
1. tag
4. add data
3. project
augmented self-training
45
NB: URLs not required at testing time!
Setup
2. tag
1. tag
4. add data
3. project
augmented self-training
45
POS
Train
Test
+
46
POS Results
93
91.6
92.4
WSJ+Gimpel baseline
not-so-distant supervision
89
87.5
86
88.4
87.4
88.5
88.8
89.8
82
79
75
Foster
Lowlands
Ritter
Test-average
47
POS Results
93
91.6
92.4
WSJ+Gimpel baseline
not-so-distant supervision
89
87.5
86
88.4
87.4
88.5
88.8
89.8
82
79
75
Foster
Lowlands
Ritter
Test-average
47
POS Results
93
91.6
92.4
WSJ+Gimpel baseline
not-so-distant supervision
89
87.5
86
88.4
87.4
88.5
88.8
89.8
82
plain self-training
79
75
Foster
Lowlands
Ritter
Test-average
47
POS Results
93
91.6
92.4
WSJ+Gimpel baseline
not-so-distant supervision
not-so-distant supervision
89
87.5
86
88.4
87.4
88.5
88.8
89.8
82
plain self-training
79
75
Foster
Lowlands
Ritter
Test-average
47
Projection Examples
Snohomish
initial tag
ADJ
projected
NOUN
Bakery
NOUN
NOUN
Salmon-Safe
NOUN
ADJ
parks
NOUN
NOUN
48
Limitations
!
NOUN
If I gave you one wish that will become true.
NOUN
VERB
What’s your wish ?... ? i wish i’ll get 3 wishes
from you :p URL
49
Limitations
!
NOUN
If I gave you one wish that will become true.
NOUN
NOUN
VERB
What’s your wish ?... ? i wish i’ll get 3 wishes
from you :p URL
49
Error Analysis
•
improvements due to richer linguistic context
Man
ARK
Our
•
Utd
PRT NOUN
NOUN NOUN
ARK
Our
Radio
NOUN
NOUN
Edit
VERB
NOUN
somewhat arbitrary differences
Nokia D5000
ARK NOUN NUM
Our NOUN NOUN
love
his
version
ARK VERB DET NOUN
Our VERB PRON NOUN
50
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance weighting
adversarial learning
distant supervision
51
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance weighting
adversarial learning
distant supervision
51
Roadmap
labeled
SOURCE
unlabeled
TARGET
?
?
labeled
N
W
O
N
K
N
SOURCE ?? U
semi-supervised learning
importance weighting
adversarial learning
distant supervision
51
References
Books
Steven Abney. Semisupervised Learning for Computational Linguistics. 2007.
Anders Søgaard. Semi-supervised learning and domain adaptation for NLP. Morgan & Claypool, 2013.
Papers
Baroni et al. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL 2014.
Blitzer et al. Biographies, Bollywood, Boom-boxes, and Blenders: Domain Adaptation for Sentiment Classification. ACL 2007.
Blum & Mitchell. Combining Labeled and Unlabeled Data with Co-training. 1998.
Chang, Connor, Roth. The Necessity of Combining Adaptation Methods. In EMNLP, 2010.
Jiang & Thai. Instance Weighting for Domain Adaptation in NLP. In ACL, 2007.
Daumè. Frustratingly Easy Domain Adaptation. In ACL, 2007.
Elming, Plank, Hovy. Robust Cross-Domain Sentiment Analysis for Low-Resource Languages. WASSA 2014.
Foster et al. Discriminative instance weighting for domain adaptation in statistical machine translation. EMNLP 2010.
Hinton et al. Improving neural networks by preventing co-adaptation of feature detectors. 2012.
Hovy, Plank, Søgaard. When POS data sets don’t add up. Combating sample bias. LREC 2014.
Johannsen et al. More or less supervised super-sense tagging of Twitter. *SEM 2014.
Koo et al. Simple semi-supervised dependency parsing. ACL 2008.
McClosky et al. When is Self-training Effective for Parsing? In COLING, 2008.
Mikolov et al. 2013. Efficient estimation of word representations in vector space. Mintz et al. Distant supervision for relation extraction without labeled data. In ACL, 2009.
Pang & Yang. A Survey on Transfer Learning. In IEEE, 2012.
Plank, Hovy, Søgaard. Learning POS taggers with inter-annotator agreement loss. In EACL 2014.
Plank, Hovy, McDonald, Søgaard. Adapting POS taggers to Twitter with not-so-distant supervision. COLING 2014.
Plank, Johannsen & Søgaard. Importance Weighting for Unsupervised domain adaptation of POS taggers: A negative result. EMNLP 2014.
Plank & Moschitti. Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction. ACL 2013
Shimodaira. Improving predictive inference under covariate shift by weighting the log- likelihood function. Journal of Statistical Planning and Inference, 2000.
Snow, Jurafsky, Ng. Learning syntactic patterns for automatic hypernym discovery. NIPS 2005
Sutton et al. 2006. Reducing weight undertraining in structured discriminative learning. NAACL 2006.
Søgaard. Zipfian corruptions for robust POS tagging. In NAACL, 2013.
Søgaard. Part-of-speech tagging with antagonistic adversaries. In ACL, 2013.
Søgaard & Haulrich. 2011. Sentence-level instance-weighting for graph-based and transition-based dependency parsing. IWPT
Søgaard & Johannsen. Robust learning in random subspaces: equipping NLP for OOV effects. COLING 2012.
Søgaard, Østerskov & Rishøj. Semisupervised dependency parsing using generalized tri-training. In ACL 2010.
Turian et al. Word representations: A simple and general method for semi-supervised learning. ACL 2010.
Wager, Wang, Liang. Dropout Training as Adaptive Regularization. NIPS 2013.
Zadrozny. Learning and evaluating classifiers under sample selection bias. ICML 2004.
Zhou and Li. Tri-Training: Exploiting Unlabeled Data Using Three Classifiers. In IEEE 2005.
52