k-nearest neighbors regression Niina Haataja University of Tampere 30.4.2015 Introduction ● ● ● k-nearest neighbors algorithm (k-NN) is used for both classification and regression It's a simple non-parametric method and therefore easy to use withouth checking any pre-assumptions of the variables The independent variables (predictor) and dependent variable (response) can be either continuous or categorical 2 Method in a nutshell ● ● ● The predicted response value is the average of the values of its k nearest neighbors Neighborhood is calculated from the predictors k is a parameter chosen by the statistician; the best choice depends on the data – Larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct 3 Finding the neighbors ● ● ● ● There are several ways to calculate the distances For categorical variables, you can use for example the Hamming Distance For continuous variables there are for example Euclidean, Manhattan, Minkowski and Mahalabonis distances. Let's have a closer look at the Euclidean distance... 4 Euclidean distance ● Euclidean distance is the "ordinary" (i.e straight line) distance between two points in Euclidean space 5 The possibility of weighting ● ● ● Since k-NN predictions are based on the intuitive assumption that objects close in distance are potentially similar, it makes good sense to discriminate between the k-nearest neighbors when making predictions Nearer neighbors contribute more to the average (=predicted value) than the more distant ones For a simple example, give each neighbor a weight of 1/d, where d is the distance to the neighbor 6 Example ● ● Let's consider a completely made up data set of maths exam score, physics exam score and activity percent (how much home work student has made and how often he/she is present in the class) Then we try to predict the maths exam score for a student who is 60 percent active and got 140 points from the physics exam. 7 Example data set Maths score Physics score Activity percent 150 160 85 103 101 30 168 154 72 129 140 56 160 180 80 189 164 65 106 112 25 175 160 62 149 158 70 138 107 28 166 180 86 140 125 52 112 150 49 161 134 31 8 Example part A ● ● Let's try to predict maths score first just by the physics score, and after that, add the activity percent as a predictor as well Distance D is calculated as the Euclidean distance 9 Distance Maths score Physics score Distance 150 160 sqrt((160-140)^2) = 20 103 101 sqrt((101-140)^2) = 39 168 154 14 129 140 0 160 180 40 189 164 24 106 112 28 175 160 20 149 158 18 138 107 33 166 180 40 140 125 15 112 150 10 161 134 6 10 Predicting ● ● When k=1, we predict the maths score with just the closest neighbor. Then the predicted value would be 129. When k=3, we predict the maths score with the average of the three closest neighbors: – (129 + 161 + 112) / 3 = 134 11 Example part B ● Now we have also the activity percent as a predictor and we need calculate new distances. 12 Distance Maths score Physics score Activity percent Distance 150 160 85 sqrt((160-140)^2 + (85-60)^2) = 32,0 103 101 30 sqrt((101-140)^2 + (30-60)^2) = 49,2 168 154 72 18,4 129 140 56 4,0 160 180 80 44,7 189 164 65 24,5 106 112 25 44,8 175 160 62 20,1 149 158 70 20,6 138 107 28 46,0 166 180 86 47,7 140 125 52 17,0 112 150 49 14,9 161 134 31 29,6 13 Predicting ● ● When k=1, we predict the maths score with just the closest neighbor. Then the predicted value would be 129 (same as before). When k=3, we predict the maths score with the average of the three closest neighbors: – ● (129 + 112 + 140) / 3 = 127 When k=5, we use the top five neighbors: – (129 + 112 + 140 + 168 + 175) / 5 = 145 14 Exercise for you ● ● ● ● Predict the maths score for a student who is 75 percent active and got 180 from the physics exam. Use k=1, k=3 and k=5. What do you think about the value of k; what makes a good prediction? When would you use weights on the predictors? 15 Sources ● ● k-NN has been used for decades, so it's easy to find more information For example, have a look at: – Altman, N. S. (1992). "An introduction to kernel and nearestneighbor nonparametric regression". The American Statistician 46 (3): 175–185. 16 Thank you! 17
© Copyright 2024