Download Report

E0 270 Machine Learning
Due: Jan 27, 2015
Assignment 1
1. Suppose that of all emails generated, 40% (group A) are unambiguously valid emails and 40% (group
B) are unambiguously spam. The remaining 20% are in the ‘grey zone’; of these, half (group C)
are considered to be spam by 40% of the population, and the other half (group D) are considered
to be spam by 90% (group D) of the population. You are asked to build an email classifier for this
problem that classifies emails as valid or spam, and are told that both types of classification errors
(valid emails classified as spam and spam emails classified as valid) will incur a unit cost/loss, while
correct classifications incur no cost.
(a) What is the Bayes error for this problem?
(b) Your friend designs a classifier that predicts valid for all emails in group A and spam for all other
emails. What is the expected loss of this classifier?
(c) Another friend designs a classifier that predicts valid for 90% of the emails in group A, 50% of
the emails in group C, and 20% of the emails in group D, and predicts spam for all other emails.
What is the expected loss of this classifier?
2. Suppose you are working for a telephone services company and are trying to predict which customers
are likely to renew their current plan subscriptions. The attributes you have for each customer are the
following: age bracket (0–25, 26–60, over 60), income group (low, high), and plan usage status (low,
high). From past data, you know that 70% of customers renew their subscriptions, and that customers
who renew their subscriptions and those who don’t renew have somewhat different distributions across
the above attributes as shown below:
Customers who renew
Age
Income Plan use
0–25
low
low
0–25
low
high
0–25
high
low
0–25
high
high
26–60
low
low
26–60
low
high
26–60
high
low
26–60
high
high
60+
low
low
60+
low
high
60+
high
low
60+
high
high
Total
Customers who don’t renew
Age
Income Plan use
%
0–25
low
low
20
0–25
low
high
8
0–25
high
low
1
0–25
high
high
1
26–60
low
low
4
26–60
low
high
40
26–60
high
low
6
26–60
high
high
2
60+
low
low
10
60+
low
high
4
60+
high
low
2
60+
high
high
2
Total
100
%
4
28
3
5
10
15
5
10
1
7
4
8
100
Assume again for simplicity that both types of prediction errors (renewal predicted as non-renewal,
and vice-versa) incur equal (unit) cost/loss, while correct predictions incur no cost.
(a) What would be an optimal prediction for a customer who falls in the age group 0–25, is in the
low income group, and has high plan usage?
(b) What would be an optimal prediction for a customer who falls in the age group 26–60, is in the
low income group, and has high plan usage?
(c) What would be an optimal prediction for a customer who falls in the age group over 60, is in the
high income group, and has low plan usage?
1
2
E0 270 Machine Learning, January Term 2015, Assignment 1
3. Consider a cost-sensitive binary classification task in which misclassifying a negative instance as positive
incurs a cost of c ∈ (0, 1), and misclassifying a positive instance as negative incurs a cost of 1 − c. This
can be formulated as a learning task with instance space X , label and prediction spaces Y = Yb = {±1},
and cost-sensitive loss `c : {±1} × {±1}→{0, c, 1 − c} defined as


if y = −1 and yb = 1
c
`c (y, yb) = 1 − c
if y = 1 and yb = −1


0
if yb = y.
Let D be a joint probability distribution on X ×{±1}, with marginal distribution µ on X and conditional
label probabilities given by η(x) = P(y = 1|x), and for any classifier h : X →{±1}, define
ercD [h] = E(x,y)∼D `c (y, h(x)) .
Derive a Bayes optimal classifier in this setting, i.e. a classifier h∗ : X →{±1} with
ercD [h∗ ] =
inf
h:X →{±1}
ercD [h] .
How does your derivation change if the loss incurred on misclassifying a positive instance as negative
is a and that for misclassifying a negative instance as positive is b for some arbitrary a, b > 0?
4. Consider a binary classification task with a ‘reject’ option, in which a classifier
can choose to reject
instances about which it is unsure, with a corresponding cost of c ∈ 0, 21 ; misclassifications incur
a cost of 1 as usual. This can be formulated as a learning task with instance space X , label space
Y = {±1}, prediction space Yb = {−1, +1, reject}, and loss `reject(c) : {±1} × {−1, +1, reject}→{0, 1, c}
defined as


if yb = reject
c
`reject(c) (y, yb) = 1
if yb ∈ {±1} and yb 6= y


0
if yb = y.
Let D be a joint probability distribution on X ×{±1}, with marginal distribution µ on X and conditional
label probabilities given by η(x) = P(y = 1|x), and for any classifier h : X →{−1, +1, reject}, define
reject(c)
erD
[h] = E(x,y)∼D `reject(c) (y, h(x)) .
Derive a Bayes optimal classifier in this setting, i.e. a classifier h∗ : X →{−1, +1, reject} with
reject(c)
erD
[h∗ ] =
inf
h:X →{−1,+1,reject}
reject(c)
erD
[h] .
What happens if c ≥ 21 ?
5. Let X = Rd and Y = Yb = {±1}. Assume that given the label y ∈ {±1}, the features are conditionally
independent, and moreover, are normally distributed, so that the class-conditional densities fy (x) can
be written as
d
Y
fy (x) =
fy (xk )
k=1
with
fy (xk ) = √
(x − µ )2 1
k
y,k
exp −
,
2
2σy,k
2πσy,k
where µy,k , σy,k are the mean and variance parameters of the normal distributions. Let p = P(y = 1).
Find an expression for the conditional class probability function η(x) in this setting. Use this to construct the Bayes optimal classifier (with respect to the usual 0-1 loss). How would you proceed if the parameters µy,k , σy,k , p are unknown, but you are given a training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈
(X × {±1})m where the examples (xi , yi ) are drawn i.i.d. from the above joint distribution (which can
be viewed as generating a labeled example (x, y) by first drawing a label y according to p and then
drawing an instance x according to fy (x))?
E0 270 Machine Learning, January Term 2015, Assignment 1
3
6. Programming Exercise: Logistic Regression. Write a piece of MATLAB code to implement
logistic regression on a given binary classification data set (X, y), where X is an m × d matrix (m
instances, each of dimension d) and y is an m-dimensional vector (with yi ∈ {±1} being a binary
label associated with the i-th instance in X): your program should take the training data (X, y) as
input and output a d-dimensional weight vector w and bias (threshold) term b representing the learnt
classifier. For this problem, you are provided a spam classification data set1 , where each instance is an
email message represented by 57 features and is associated with a binary label indicating whether the
email is spam (+1) or non-spam (−1). The data set is divided into training and test sets. The goal is
to learn from the training set a classifier that can classify new email messages in the test set.
(a) Use your implementation of logistic regression to learn a classifier from 10% of the training data,
then 20% of the training data, then 30% and so on upto 100% (separate files containing r% of
the training data are provided in the folder Problem-6\Spambase\TrainSubsets). In each case,
measure the classification error on the training examples used, as well as the error on the given test
set (you can use the provided code classification error.m to help you compute the errors).
Plot a curve showing both the training error and the test error (on the y-axis) as a function of the
number of training examples used (on the x-axis). Such a plot is often called a learning curve.
(Note: the training error should be calculated only on the subset of examples used for training,
not on all the training examples available in the given data set.)
(b) Incorporate a L2 -regularizer in your implementation of logistic regression (taking the regularization parameter λ as an additional input). Run the regularized version of logistic regression on the
entire training set for different values of λ in the range {10−7 , 10−6 , 10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1}.
Plot the training and test error achieved by each value of λ (λ on the x-axis and the classification
error on the y-axis) and identify the value of λ that achieves the lowest test error (resolving ties
in favor of the smallest value of λ). Now, select λ from the same range through 5-fold crossvalidation on the training set (the train and test data for each fold are provided in a separate
folder Problem-6\Spambase\CrossValidation); report the average cross-validation error (across
the five folds) for each value of λ and identify the value of λ that achieves the lowest average crossvalidation error (again resolving ties in favor of the smallest value of λ). Does the cross-validation
procedure select the right value of λ?
Hints: The glmfit function in MATLAB, which implements (unregularized) logistic regression, can be
used for sanity check in this problem; note that the input labels to this function must be in {0, 1} and
not in {±1}. For implementing your own version of logistic regression, the fminunc MATLAB function
for unconstrained optimization will be useful. Also, do not forget to include the bias (threshold) term
in the logistic regression model.
7. Consider a regression task on the unit square: X = [0, 1] × [0, 1] and Y = Yb = R. Suppose examples
(x, y) are drawn from a joint probability distribution D on X × R whose marginal density on X is given
by
µ(x) = 2x1 ∀x = (x1 , x2 ) ∈ X
and conditional distribution of y given x is
y|x ∼ N x1 − x2 + 1, 1 .
(a) Find an optimal regression model for D (under squared loss). What is the minimum achievable
squared error for D, ersq,∗
D ?
(b) Consider the regression model f : X →R given by
f (x) = x1 − x2
∀x = (x1 , x2 ) ∈ X .
Find the squared error of f w.r.t. D, ersq
D [f ].
1 UCI
Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Spambase.
4
E0 270 Machine Learning, January Term 2015, Assignment 1
8. Consider a regression task on an instance space X , with Y = Yb = R, where performance is measured
via the absolute loss `abs : R × R→R+ defined as `abs (y, yb) = |b
y − y|. What is the form of a Bayes
optimal prediction model in this setting?
9. Programming Exercise: Least Squares Regression. Write a small piece of MATLAB code that
uses built-in functions in MATLAB to implement (linear) least-squares regression (with and without
regularization). As before, the input to your code is a training data set (X, y), where X is an m × d
matrix of training examples and y is now an m-dimensional vector of real-valued labels associated with
the instances in X. You are provided with two synthetic data sets for this problem, each split into
training and test sets.
(a) Data Set 1 (1-dimension). Run (unregularized) least-squares regression on the training set provided and measure the mean squared error of the learnt linear function on both the training and
test examples (you can use the provided code mean squared error.m to compute the errors). In
a single figure, draw a plot of the learnt linear function (input instance on the x-axis and the
predicted value on the y-axis), along with a scatter plot depicting the true label associated with
each test instance. (See folder Problem-9\Data-set-1 for the relevant files.)
(b) Data Set 2 (250-dimension). Repeat step (a) of Problem 6 on this data set with (unregularized)
least squares regression and step (b) of Problem 6 with L2 -regularized least squares regression.
In both cases, use the mean squared error as the performance measure. For the regularized
least squares regression use the range {10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1, 10, 100} for selecting the
regularization parameter λ. (See folder Problem-9\Data-set-2 for the relevant files.)
Hints: Do not forget to include a bias (threshold) term b in the model. This can be done as follows:
add an extra column of 1’s to (the right of) the training data matrix X and create an augmented matrix
˜ of dimension m × (n + 1); then the solution to the unregularized least squares regression (with the
X
˜ > X)
˜ −1 X
˜ > y; in the case of the L2 -regularized least squares
bias term included) is given by [w;
b bb] = (X
˜ >X
˜ + λmR)−1 X
˜ > y,
regression (with the bias term included), the solution takes the form [w;
b bb] = (X
where R is the n × n identity matrix augmented with an extra column of 0’s to the right and an extra
row of 0’s to the bottom (i.e., R is a (n + 1) × (n + 1) matrix with Rij = 1 if i = j < n + 1 and 0
˜ >X
˜ is not invertible, you can take the pseudoinverse of the matrix instead
otherwise). (If the matrix X
of the usual matrix inverse; the MATLAB function pinv can be used for pseudoinverse computation.)
10. Reading Exercise. We have so far used the 0-1 classification error as an intuitive measure of goodness of a binary classifier. It is however important to note that in many real-world applications, there
are several other popular performance measures that are better suited to evaluate a binary classifier.
For example, in a medical diagnostic application such as cancer diagnosis, a misprediction on a patient with cancer (positive instance) is often more critical than a misprediction on a patient without
cancer (negative instance); hence, performance measures based on the specificity and sensitivity of a
classifier, which penalize mispredictions on positive and negative instances differently, are popular in
these domains. Similarly, in a document retrieval task, it is often the case that the number of relevant
(positive) documents are far lesser than the number of irrelevant (negative) documents, and hence a
default classifier that predicts a negative label on all instances will yield very low 0-1-error; in such
class imbalanced settings, performance measures based on the precision and recall of a classifier are
considered more suited than the 0-1-error. Read the following material which contains details of some
of these performance measures used in different application areas.
(a) D.G. Altman and J.M. Bland. Diagnostic tests 1: Sensitivity and specificity. British Medical Journal, 308(6943):1552, 1994. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2540489/
pdf/bmj00444-0038.pdf.
(b) C.D. Manning, P. Raghavan, and H. Sch¨
utze. Introduction to information retrieval. Cambridge university press, 2008. Section 8.3. http://nlp.stanford.edu/IR-book/pdf/08eval.
pdf.
(c) H. He and E.A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge
and Data Engineering, 21(9):1263-1284, 2009. Section 4.1.