CSC321 Lecture 2 A Simple Learning Algorithm : Linear Regression

CSC321 Lecture 2
A Simple Learning Algorithm : Linear Regression
Roger Grosse and Nitish Srivastava
January 7, 2015
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
1 / 20
Outline
In this lecture we will
See a simple example of a machine learning model, linear regression.
Learn how to formulate a supervised learning problem.
Learn how to train the model.
It’s not a neural net algorithm, but it will provide a lot of useful intuition
for algorithms we will cover in this course.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
2 / 20
A Machine Learning Problem
Suppose we are given some data about basketball players 30
6.8
6.3
6.4
6.2
..
.
25
Avg points scored per game
Height in feet
Avg Points Scored
Per Game
9.2
11.7
15.8
8.6
..
.
20
15
10
5
5.8
6.0
6.2
6.4
6.6
6.8
Height in feet
7.0
7.2
7.4
7.6
What is the predicted number of points scored by a new player who is 6.5
feet tall ?
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
3 / 20
Formulate as a Supervised Learning Problem
We are given labelled examples (the training set):
Inputs: {x (1) , x (2) , . . . , x (N) } - Height in feet.
Targets: {t (1) , t (2) , . . . , t (N) } - Avg points scored per game.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
4 / 20
Formulate as a Supervised Learning Problem
We are given labelled examples (the training set):
Inputs: {x (1) , x (2) , . . . , x (N) } - Height in feet.
Targets: {t (1) , t (2) , . . . , t (N) } - Avg points scored per game.
Choose a model ≡ Make an assumption about the data’s behaviour.
Let’s say we choose y = wx + b
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
4 / 20
Formulate as a Supervised Learning Problem
We are given labelled examples (the training set):
Inputs: {x (1) , x (2) , . . . , x (N) } - Height in feet.
Targets: {t (1) , t (2) , . . . , t (N) } - Avg points scored per game.
Choose a model ≡ Make an assumption about the data’s behaviour.
Let’s say we choose y = wx + b
We call w the weight and b the bias. These are the trainable
parameters.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
4 / 20
Formulate as a Supervised Learning Problem
We are given labelled examples (the training set):
Inputs: {x (1) , x (2) , . . . , x (N) } - Height in feet.
Targets: {t (1) , t (2) , . . . , t (N) } - Avg points scored per game.
Choose a model ≡ Make an assumption about the data’s behaviour.
Let’s say we choose y = wx + b
We call w the weight and b the bias. These are the trainable
parameters.
Learning: Extract knowledge from the data to learn the model.
{x (1) , x (2) , . . . , x (N) }
{t (1) , t (2) , . . . , t (N) }
Roger Grosse and Nitish Srivastava
w, b
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
4 / 20
Formulate as a Supervised Learning Problem
We are given labelled examples (the training set):
Inputs: {x (1) , x (2) , . . . , x (N) } - Height in feet.
Targets: {t (1) , t (2) , . . . , t (N) } - Avg points scored per game.
Choose a model ≡ Make an assumption about the data’s behaviour.
Let’s say we choose y = wx + b
We call w the weight and b the bias. These are the trainable
parameters.
Learning: Extract knowledge from the data to learn the model.
{x (1) , x (2) , . . . , x (N) }
{t (1) , t (2) , . . . , t (N) }
w, b
Inference: Given a new x and the learned model, make a prediction y .
x
Roger Grosse and Nitish Srivastava
w, b
y
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
4 / 20
Learning
Design an objective function (or loss function) that is Minimized when the model does what you want it to do.
Easy to minimize (smooth, well-behaved).
Here we want y = wx + b to be close to t, for every training case.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
5 / 20
Learning
Design an objective function (or loss function) that is Minimized when the model does what you want it to do.
Easy to minimize (smooth, well-behaved).
Here we want y = wx + b to be close to t, for every training case.
Therefore one choice could be,
N
L(w , b) =
1X
(wx (i) + b − t (i) )2
2
i=1
This is called squared loss.
Need to find w , b such that L(w , b) is minimized.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
5 / 20
Learning
L(w , b) =
1X
(wx (i) + b − t (i) )2
2
i
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
6 / 20
Learning
1X
(wx (i) + b − t (i) )2
2
i
X
=
(wx (i) + b − t (i) )x (i)
L(w , b) =
∂L
∂w
∂L
∂b
Roger Grosse and Nitish Srivastava
i
=
X
wx (i) + b − t (i)
i
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
6 / 20
Learning
1X
(wx (i) + b − t (i) )2
2
i
X
=
(wx (i) + b − t (i) )x (i)
L(w , b) =
∂L
∂w
∂L
∂b
i
=
X
wx (i) + b − t (i)
i
Since L is a nonnegative quadratic function in w and b, any critical point
is a minimum. Therefore, we minimize L by setting
∂L
∂L
= 0,
= 0.
∂w
∂b
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
6 / 20
Learning
!
w
X
x (i) · x (i)
!
X
+b
i
x (i)
!
−
X
i
t (i) x (i)
!
w
X
i
Roger Grosse and Nitish Srivastava
x
=0
i
(i)
!
+ bN −
X
t
(i)
=0
i
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
7 / 20
Learning
!
w
X
x (i) · x (i)
!
X
+b
i
x (i)
!
−
X
i
t (i) x (i)
!
w
X
i
x
=0
i
(i)
!
+ bN −
X
t
(i)
=0
i
Now we have 2 linear equations and 2 unknowns w and b. Solve!
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
7 / 20
Inference
30
25
20
15
10
5
5.8
Roger Grosse and Nitish Srivastava
6.0
6.2
6.4
6.6
6.8
7.0
7.2
7.4
7.6
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
8 / 20
Inference
30
25
20
15
10
5
5.8
6.0
6.2
6.4
6.6
6.8
7.0
7.2
7.4
7.6
To make a prediction about a new player, just use y = wx + b.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
8 / 20
Multi-variable Linear Regression
Multi-variable : Instead of x ∈ R, we have x = (x1 , x2 , . . . , xM ) ∈ RM .
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
9 / 20
Multi-variable Linear Regression
Multi-variable : Instead of x ∈ R, we have x = (x1 , x2 , . . . , xM ) ∈ RM .
For example,
Height in feet (x1 )
6.8
6.3
6.4
6.2
..
.
Weight in pounds (x2 )
225
180
190
180
..
.
Roger Grosse and Nitish Srivastava
Avg Points Scored Per Game (t)
9.2
11.7
15.8
8.6
..
.
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
9 / 20
Multi-variable Linear Regression
Multi-variable : Instead of x ∈ R, we have x = (x1 , x2 , . . . , xM ) ∈ RM .
For example,
Height in feet (x1 )
6.8
6.3
6.4
6.2
..
.
Weight in pounds (x2 )
225
180
190
180
..
.
Avg Points Scored Per Game (t)
9.2
11.7
15.8
8.6
..
.
Choose a model y = w1 x1 + w2 x2 + . . . + wM xM + b = w> x + b
Parameters to be learned : w, b.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
9 / 20
Multi-variable Linear Regression
Multi-variable : Instead of x ∈ R, we have x = (x1 , x2 , . . . , xM ) ∈ RM .
For example,
Height in feet (x1 )
6.8
6.3
6.4
6.2
..
.
Weight in pounds (x2 )
225
180
190
180
..
.
Avg Points Scored Per Game (t)
9.2
11.7
15.8
8.6
..
.
Choose a model y = w1 x1 + w2 x2 + . . . + wM xM + b = w> x + b
Parameters to be learned : w, b.
Objective function L(w, b) =
N
X
(w> x(i) + b − t (i) )2
i=1
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
9 / 20
Multi-variable Linear Regression
We can use more general basis functions (also called “features”).
y = w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x)
Parameters to be learned : w.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
10 / 20
Multi-variable Linear Regression
We can use more general basis functions (also called “features”).
y = w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x)
Parameters to be learned : w.
For example, 1-D Polynomial fitting
φ0 (x) = 1
φ1 (x) = x
φ2 (x) = x 2
φ3 (x) = x 3
..
.
. = ..
φM (x) = x M
=bias
z }| {
y = w0 φ0 (x) +w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x)
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
10 / 20
Multi-variable Linear Regression
We can use more general basis functions (also called “features”).
y = w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x)
Parameters to be learned : w.
For example, 1-D Polynomial fitting
φ0 (x) = 1
φ1 (x) = x
φ2 (x) = x 2
φ3 (x) = x 3
..
.
. = ..
φM (x) = x M
=bias
z }| {
y = w0 φ0 (x) +w1 φ1 (x) + w2 φ2 (x) + . . . + wM φM (x) = w> Φ(x)
Note : Linear regression means linear in parameters w, not linear in x.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
10 / 20
Learning Multi-variable Linear Regression
Feature matrix:
Vector of predictions:

Φ(x (1) )>
 Φ(x (2) )> 


Φ=

..


.


wT Φ(x (1) )
 wT Φ(x (2) ) 


Φw = 

..


.

Φ(x (N) )>
wT Φ(x (N) )
Objective function
L(w) =
1X T
(w Φ(x (i) ) − t (i) )2
2
i
1
= ||Φw − t||2
2
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
11 / 20
Learning Multi-variable Linear Regression
Optimum occurs where
∇w L(w) = Φ> (Φw − t) = 0
Therefore,
Φ> Φw − Φ> t = 0
−1
w = Φ> Φ
Φ> t
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
12 / 20
Learning Multi-variable Linear Regression
Optimum occurs where
∇w L(w) = Φ> (Φw − t) = 0
Therefore,
Φ> Φw − Φ> t = 0
−1
w = Φ> Φ
Φ> t
Question : When will Φ> Φ be invertible ?
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
12 / 20
Fitting polynomials
1
t
0
−1
0
x
1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
13 / 20
Fitting polynomials
y = w0
M =0
1
t
0
−1
0
x
1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
14 / 20
Fitting polynomials
y = w0 + w1 x
M =1
1
t
0
−1
0
x
1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
15 / 20
Fitting polynomials
y = w0 + w1 x + w2 x 2 + w3 x 3
M =3
1
t
0
−1
0
x
1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
16 / 20
Fitting polynomials
y = w0 + w1 x + w2 x 2 + w3 x 3 + . . . + w9 x 9
M =9
1
t
0
−1
0
x
1
-Pattern Recognition and Machine Learning, Christopher Bishop.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
17 / 20
Model selection
Underfitting : The model is too simple - does not fit the data.
M =0
1
t
0
−1
0
x
1
Overfitting : The model is too complex - fits perfectly, does not generalize.
M =9
1
t
0
−1
0
Roger Grosse and Nitish Srivastava
x
1
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
18 / 20
Model selection
Need to select a model which is neither too simple, nor too complex.
M =3
1
t
0
−1
0
x
1
Later in this course, we will see talk more about controlling model
complexity.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
19 / 20
Next class
Another machine learning model, an early neural net : Perceptron.
- Frank Rosenblatt, with the image sensor (left) of the Mark I Perceptron40
Reminder - Do the quizzes for video lectures A and B by 11.59pm next
Monday.
Roger Grosse and Nitish Srivastava
CSC321 Lecture 2A Simple Learning Algorithm : Linear Regression
January 7, 2015
20 / 20