E0 270 Machine Learning Due: Jan 27, 2015 Assignment 1 1. Suppose that of all emails generated, 40% (group A) are unambiguously valid emails and 40% (group B) are unambiguously spam. The remaining 20% are in the ‘grey zone’; of these, half (group C) are considered to be spam by 40% of the population, and the other half (group D) are considered to be spam by 90% (group D) of the population. You are asked to build an email classifier for this problem that classifies emails as valid or spam, and are told that both types of classification errors (valid emails classified as spam and spam emails classified as valid) will incur a unit cost/loss, while correct classifications incur no cost. (a) What is the Bayes error for this problem? (b) Your friend designs a classifier that predicts valid for all emails in group A and spam for all other emails. What is the expected loss of this classifier? (c) Another friend designs a classifier that predicts valid for 90% of the emails in group A, 50% of the emails in group C, and 20% of the emails in group D, and predicts spam for all other emails. What is the expected loss of this classifier? 2. Suppose you are working for a telephone services company and are trying to predict which customers are likely to renew their current plan subscriptions. The attributes you have for each customer are the following: age bracket (0–25, 26–60, over 60), income group (low, high), and plan usage status (low, high). From past data, you know that 70% of customers renew their subscriptions, and that customers who renew their subscriptions and those who don’t renew have somewhat different distributions across the above attributes as shown below: Customers who renew Age Income Plan use 0–25 low low 0–25 low high 0–25 high low 0–25 high high 26–60 low low 26–60 low high 26–60 high low 26–60 high high 60+ low low 60+ low high 60+ high low 60+ high high Total Customers who don’t renew Age Income Plan use % 0–25 low low 20 0–25 low high 8 0–25 high low 1 0–25 high high 1 26–60 low low 4 26–60 low high 40 26–60 high low 6 26–60 high high 2 60+ low low 10 60+ low high 4 60+ high low 2 60+ high high 2 Total 100 % 4 28 3 5 10 15 5 10 1 7 4 8 100 Assume again for simplicity that both types of prediction errors (renewal predicted as non-renewal, and vice-versa) incur equal (unit) cost/loss, while correct predictions incur no cost. (a) What would be an optimal prediction for a customer who falls in the age group 0–25, is in the low income group, and has high plan usage? (b) What would be an optimal prediction for a customer who falls in the age group 26–60, is in the low income group, and has high plan usage? (c) What would be an optimal prediction for a customer who falls in the age group over 60, is in the high income group, and has low plan usage? 1 2 E0 270 Machine Learning, January Term 2015, Assignment 1 3. Consider a cost-sensitive binary classification task in which misclassifying a negative instance as positive incurs a cost of c ∈ (0, 1), and misclassifying a positive instance as negative incurs a cost of 1 − c. This can be formulated as a learning task with instance space X , label and prediction spaces Y = Yb = {±1}, and cost-sensitive loss `c : {±1} × {±1}→{0, c, 1 − c} defined as if y = −1 and yb = 1 c `c (y, yb) = 1 − c if y = 1 and yb = −1 0 if yb = y. Let D be a joint probability distribution on X ×{±1}, with marginal distribution µ on X and conditional label probabilities given by η(x) = P(y = 1|x), and for any classifier h : X →{±1}, define ercD [h] = E(x,y)∼D `c (y, h(x)) . Derive a Bayes optimal classifier in this setting, i.e. a classifier h∗ : X →{±1} with ercD [h∗ ] = inf h:X →{±1} ercD [h] . How does your derivation change if the loss incurred on misclassifying a positive instance as negative is a and that for misclassifying a negative instance as positive is b for some arbitrary a, b > 0? 4. Consider a binary classification task with a ‘reject’ option, in which a classifier can choose to reject instances about which it is unsure, with a corresponding cost of c ∈ 0, 21 ; misclassifications incur a cost of 1 as usual. This can be formulated as a learning task with instance space X , label space Y = {±1}, prediction space Yb = {−1, +1, reject}, and loss `reject(c) : {±1} × {−1, +1, reject}→{0, 1, c} defined as if yb = reject c `reject(c) (y, yb) = 1 if yb ∈ {±1} and yb 6= y 0 if yb = y. Let D be a joint probability distribution on X ×{±1}, with marginal distribution µ on X and conditional label probabilities given by η(x) = P(y = 1|x), and for any classifier h : X →{−1, +1, reject}, define reject(c) erD [h] = E(x,y)∼D `reject(c) (y, h(x)) . Derive a Bayes optimal classifier in this setting, i.e. a classifier h∗ : X →{−1, +1, reject} with reject(c) erD [h∗ ] = inf h:X →{−1,+1,reject} reject(c) erD [h] . What happens if c ≥ 21 ? 5. Let X = Rd and Y = Yb = {±1}. Assume that given the label y ∈ {±1}, the features are conditionally independent, and moreover, are normally distributed, so that the class-conditional densities fy (x) can be written as d Y fy (x) = fy (xk ) k=1 with fy (xk ) = √ (x − µ )2 1 k y,k exp − , 2 2σy,k 2πσy,k where µy,k , σy,k are the mean and variance parameters of the normal distributions. Let p = P(y = 1). Find an expression for the conditional class probability function η(x) in this setting. Use this to construct the Bayes optimal classifier (with respect to the usual 0-1 loss). How would you proceed if the parameters µy,k , σy,k , p are unknown, but you are given a training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × {±1})m where the examples (xi , yi ) are drawn i.i.d. from the above joint distribution (which can be viewed as generating a labeled example (x, y) by first drawing a label y according to p and then drawing an instance x according to fy (x))? E0 270 Machine Learning, January Term 2015, Assignment 1 3 6. Programming Exercise: Logistic Regression. Write a piece of MATLAB code to implement logistic regression on a given binary classification data set (X, y), where X is an m × d matrix (m instances, each of dimension d) and y is an m-dimensional vector (with yi ∈ {±1} being a binary label associated with the i-th instance in X): your program should take the training data (X, y) as input and output a d-dimensional weight vector w and bias (threshold) term b representing the learnt classifier. For this problem, you are provided a spam classification data set1 , where each instance is an email message represented by 57 features and is associated with a binary label indicating whether the email is spam (+1) or non-spam (−1). The data set is divided into training and test sets. The goal is to learn from the training set a classifier that can classify new email messages in the test set. (a) Use your implementation of logistic regression to learn a classifier from 10% of the training data, then 20% of the training data, then 30% and so on upto 100% (separate files containing r% of the training data are provided in the folder Problem-6\Spambase\TrainSubsets). In each case, measure the classification error on the training examples used, as well as the error on the given test set (you can use the provided code classification error.m to help you compute the errors). Plot a curve showing both the training error and the test error (on the y-axis) as a function of the number of training examples used (on the x-axis). Such a plot is often called a learning curve. (Note: the training error should be calculated only on the subset of examples used for training, not on all the training examples available in the given data set.) (b) Incorporate a L2 -regularizer in your implementation of logistic regression (taking the regularization parameter λ as an additional input). Run the regularized version of logistic regression on the entire training set for different values of λ in the range {10−7 , 10−6 , 10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1}. Plot the training and test error achieved by each value of λ (λ on the x-axis and the classification error on the y-axis) and identify the value of λ that achieves the lowest test error (resolving ties in favor of the smallest value of λ). Now, select λ from the same range through 5-fold crossvalidation on the training set (the train and test data for each fold are provided in a separate folder Problem-6\Spambase\CrossValidation); report the average cross-validation error (across the five folds) for each value of λ and identify the value of λ that achieves the lowest average crossvalidation error (again resolving ties in favor of the smallest value of λ). Does the cross-validation procedure select the right value of λ? Hints: The glmfit function in MATLAB, which implements (unregularized) logistic regression, can be used for sanity check in this problem; note that the input labels to this function must be in {0, 1} and not in {±1}. For implementing your own version of logistic regression, the fminunc MATLAB function for unconstrained optimization will be useful. Also, do not forget to include the bias (threshold) term in the logistic regression model. 7. Consider a regression task on the unit square: X = [0, 1] × [0, 1] and Y = Yb = R. Suppose examples (x, y) are drawn from a joint probability distribution D on X × R whose marginal density on X is given by µ(x) = 2x1 ∀x = (x1 , x2 ) ∈ X and conditional distribution of y given x is y|x ∼ N x1 − x2 + 1, 1 . (a) Find an optimal regression model for D (under squared loss). What is the minimum achievable squared error for D, ersq,∗ D ? (b) Consider the regression model f : X →R given by f (x) = x1 − x2 ∀x = (x1 , x2 ) ∈ X . Find the squared error of f w.r.t. D, ersq D [f ]. 1 UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Spambase. 4 E0 270 Machine Learning, January Term 2015, Assignment 1 8. Consider a regression task on an instance space X , with Y = Yb = R, where performance is measured via the absolute loss `abs : R × R→R+ defined as `abs (y, yb) = |b y − y|. What is the form of a Bayes optimal prediction model in this setting? 9. Programming Exercise: Least Squares Regression. Write a small piece of MATLAB code that uses built-in functions in MATLAB to implement (linear) least-squares regression (with and without regularization). As before, the input to your code is a training data set (X, y), where X is an m × d matrix of training examples and y is now an m-dimensional vector of real-valued labels associated with the instances in X. You are provided with two synthetic data sets for this problem, each split into training and test sets. (a) Data Set 1 (1-dimension). Run (unregularized) least-squares regression on the training set provided and measure the mean squared error of the learnt linear function on both the training and test examples (you can use the provided code mean squared error.m to compute the errors). In a single figure, draw a plot of the learnt linear function (input instance on the x-axis and the predicted value on the y-axis), along with a scatter plot depicting the true label associated with each test instance. (See folder Problem-9\Data-set-1 for the relevant files.) (b) Data Set 2 (250-dimension). Repeat step (a) of Problem 6 on this data set with (unregularized) least squares regression and step (b) of Problem 6 with L2 -regularized least squares regression. In both cases, use the mean squared error as the performance measure. For the regularized least squares regression use the range {10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1, 10, 100} for selecting the regularization parameter λ. (See folder Problem-9\Data-set-2 for the relevant files.) Hints: Do not forget to include a bias (threshold) term b in the model. This can be done as follows: add an extra column of 1’s to (the right of) the training data matrix X and create an augmented matrix ˜ of dimension m × (n + 1); then the solution to the unregularized least squares regression (with the X ˜ > X) ˜ −1 X ˜ > y; in the case of the L2 -regularized least squares bias term included) is given by [w; b bb] = (X ˜ >X ˜ + λmR)−1 X ˜ > y, regression (with the bias term included), the solution takes the form [w; b bb] = (X where R is the n × n identity matrix augmented with an extra column of 0’s to the right and an extra row of 0’s to the bottom (i.e., R is a (n + 1) × (n + 1) matrix with Rij = 1 if i = j < n + 1 and 0 ˜ >X ˜ is not invertible, you can take the pseudoinverse of the matrix instead otherwise). (If the matrix X of the usual matrix inverse; the MATLAB function pinv can be used for pseudoinverse computation.) 10. Reading Exercise. We have so far used the 0-1 classification error as an intuitive measure of goodness of a binary classifier. It is however important to note that in many real-world applications, there are several other popular performance measures that are better suited to evaluate a binary classifier. For example, in a medical diagnostic application such as cancer diagnosis, a misprediction on a patient with cancer (positive instance) is often more critical than a misprediction on a patient without cancer (negative instance); hence, performance measures based on the specificity and sensitivity of a classifier, which penalize mispredictions on positive and negative instances differently, are popular in these domains. Similarly, in a document retrieval task, it is often the case that the number of relevant (positive) documents are far lesser than the number of irrelevant (negative) documents, and hence a default classifier that predicts a negative label on all instances will yield very low 0-1-error; in such class imbalanced settings, performance measures based on the precision and recall of a classifier are considered more suited than the 0-1-error. Read the following material which contains details of some of these performance measures used in different application areas. (a) D.G. Altman and J.M. Bland. Diagnostic tests 1: Sensitivity and specificity. British Medical Journal, 308(6943):1552, 1994. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2540489/ pdf/bmj00444-0038.pdf. (b) C.D. Manning, P. Raghavan, and H. Sch¨ utze. Introduction to information retrieval. Cambridge university press, 2008. Section 8.3. http://nlp.stanford.edu/IR-book/pdf/08eval. pdf. (c) H. He and E.A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009. Section 4.1.
© Copyright 2024