k-nearest neighbors regression

k-nearest neighbors regression
Niina Haataja
University of Tampere
30.4.2015
Introduction
●
●
●
k-nearest neighbors algorithm (k-NN) is
used for both classification and regression
It's a simple non-parametric method and
therefore easy to use withouth checking
any pre-assumptions of the variables
The independent variables (predictor) and
dependent variable (response) can be
either continuous or categorical
2
Method in a nutshell
●
●
●
The predicted response value is the average
of the values of its k nearest neighbors
Neighborhood is calculated from the
predictors
k is a parameter chosen by the statistician;
the best choice depends on the data
–
Larger values of k reduce the effect of noise on
the classification, but make boundaries between
classes less distinct
3
Finding the neighbors
●
●
●
●
There are several ways to calculate the
distances
For categorical variables, you can use for
example the Hamming Distance
For continuous variables there are for
example Euclidean, Manhattan,
Minkowski and Mahalabonis distances.
Let's have a closer look at the Euclidean
distance...
4
Euclidean distance
●
Euclidean distance is the "ordinary" (i.e
straight line) distance between two points
in Euclidean space
5
The possibility of weighting
●
●
●
Since k-NN predictions are based on the intuitive
assumption that objects close in distance are
potentially similar, it makes good sense to
discriminate between the k-nearest neighbors
when making predictions
Nearer neighbors contribute more to the average
(=predicted value) than the more distant ones
For a simple example, give each neighbor a
weight of 1/d, where d is the distance to the
neighbor
6
Example
●
●
Let's consider a completely made up data
set of maths exam score, physics exam
score and activity percent (how much
home work student has made and how
often he/she is present in the class)
Then we try to predict the maths exam
score for a student who is 60 percent
active and got 140 points from the physics
exam.
7
Example data set
Maths score
Physics score
Activity percent
150
160
85
103
101
30
168
154
72
129
140
56
160
180
80
189
164
65
106
112
25
175
160
62
149
158
70
138
107
28
166
180
86
140
125
52
112
150
49
161
134
31
8
Example part A
●
●
Let's try to predict maths score first just by
the physics score, and after that, add the
activity percent as a predictor as well
Distance D is calculated as the Euclidean
distance
9
Distance
Maths score
Physics score
Distance
150
160
sqrt((160-140)^2) = 20
103
101
sqrt((101-140)^2) = 39
168
154
14
129
140
0
160
180
40
189
164
24
106
112
28
175
160
20
149
158
18
138
107
33
166
180
40
140
125
15
112
150
10
161
134
6
10
Predicting
●
●
When k=1, we predict the maths score
with just the closest neighbor. Then the
predicted value would be 129.
When k=3, we predict the maths score
with the average of the three closest
neighbors:
–
(129 + 161 + 112) / 3 = 134
11
Example part B
●
Now we have also the activity
percent as a predictor and we need
calculate new distances.
12
Distance
Maths score
Physics score
Activity percent Distance
150
160
85
sqrt((160-140)^2 + (85-60)^2) = 32,0
103
101
30
sqrt((101-140)^2 + (30-60)^2) = 49,2
168
154
72
18,4
129
140
56
4,0
160
180
80
44,7
189
164
65
24,5
106
112
25
44,8
175
160
62
20,1
149
158
70
20,6
138
107
28
46,0
166
180
86
47,7
140
125
52
17,0
112
150
49
14,9
161
134
31
29,6
13
Predicting
●
●
When k=1, we predict the maths score
with just the closest neighbor. Then the
predicted value would be 129 (same as
before).
When k=3, we predict the maths score
with the average of the three closest
neighbors:
–
●
(129 + 112 + 140) / 3 = 127
When k=5, we use the top five neighbors:
–
(129 + 112 + 140 + 168 + 175) / 5 = 145
14
Exercise for you
●
●
●
●
Predict the maths score for a student
who is 75 percent active and got 180
from the physics exam.
Use k=1, k=3 and k=5.
What do you think about the value of k;
what makes a good prediction?
When would you use weights on the
predictors?
15
Sources
●
●
k-NN has been used for decades,
so it's easy to find more information
For example, have a look at:
–
Altman, N. S. (1992). "An
introduction to kernel and nearestneighbor nonparametric regression".
The American Statistician 46 (3):
175–185.
16
Thank you!
17