Download Report

STAB57H3: Introduction to Statistics
Winter, 2015
Instructor: Jabed Tomal
Department of Computer and Mathematical Sciences
University of Toronto Scarborough
Toronto, ON
Canada
March 18, 2015
Jabed Tomal (U of T)
Statistics
March 18, 2015
1 / 31
Relationships Among Variables:
In science, biological science, social science, and business,
scientists/researchers are concerned in knowing
relationships among variables.
Jabed Tomal (U of T)
Statistics
March 18, 2015
2 / 31
Relationships Among Variables:
Some examples are:
1
In business, it might be important to know the relationship
between sales of a product and amount of advertising
expenditure.
2
A company manager might be interested in knowing the
relationship between performance of an employee on a job and
employee’s aptitude tests score.
3
In environmental physics, a researcher might be interested in
predicting global temperature using the amount of carbon dioxide
placed into the atmosphere.
4
Goal: Predicting the length of hospital stay of a surgical patient.
Our interest might be in knowing the relationship between the time
stay in the hospital and severity of the operation.
Jabed Tomal (U of T)
Statistics
March 18, 2015
3 / 31
Relationships Among Variables:
Example: Grade Point Average
The director of admissions of a small college selected 120 students at
random from the new freshman class in a study to determine whether
a student’s grade point average (GPA) at the end of the freshman year
(Y ) can be predicted from the ACT test score (X ). The results of the
study follow.
i
Xi
Yi
:
:
:
1
21
3.897
2
14
3.885
3
28
3.778
···
···
···
118
28
3.914
119
16
1.860
120
28
2.948
1
Is there any relationship exits between the two variables grade
point average and ACT test score?
2
If a relationship exists between the two variables, can grade point
average be predicted using ACT test score?
Jabed Tomal (U of T)
Statistics
March 18, 2015
4 / 31
Relationships Among Variables:
Example: Property Assessments
The data that follow show assessed value for property tax purposes
(X1 , in thousand dollars) and sales price (X2 , in thousand dollars) for a
sample of 15 parcels of land for industrial development sold recently in
“arm’s length” transactions in a tax district.
i
X1i
X2i
:
:
:
1
13.9
28.6
2
16.0
34.7
3
10.3
21.0
···
···
···
13
14.9
35.1
14
12.9
30.0
15
15.8
36.2
1
Are the two variables associated with each other?
2
What is the strength of association? Weak, moderate, or strong?
Jabed Tomal (U of T)
Statistics
March 18, 2015
5 / 31
Relationships Among Variables:
Two primary goals of analyzing relationships among variables
are:
1
to identify whether or not a relationship exists among variables,
and (That is to identify whether there exists weak, moderately
weak, moderately strong or strong relationships among variables.
Are the variables negatively associated or positively associated?)
2
to identify the form of the relationships. (linear relationship or
non-linear relationship?)
Jabed Tomal (U of T)
Statistics
March 18, 2015
6 / 31
Relationships Among Variables:
Notations:
1
Let Π be a population of interest. (Example: Students enrolled in
STAB57.)
2
X (π) be a measurement taken on subject π ∈ Π. (Example: Time
spent (in hours) per week studying course materials by a student
enrolled in STAB57.)
3
Y (π) be another measurement taken on subject π ∈ Π. (Example:
Midterm grade of a particular student enrolled in STAB57.)
Jabed Tomal (U of T)
Statistics
March 18, 2015
7 / 31
Relationships Among Variables:
1
A linear relationship between two variables X and Y can be
expressed as
Y = α + βX ,
where α and β are constants real numbers. Here, β = 0 indicates
no linear relationship between X and Y .
2
A non-linear relationship between two variables X and Y can be
expressed as
Y = α + β exp(X ),
where α and β are constants real numbers. Again, β = 0 indicates
no non-linear relationship between X and Y .
Jabed Tomal (U of T)
Statistics
March 18, 2015
8 / 31
Relationships Among Variables:
The Definition of Relationship:
1
Consider we observe a set of values of Y against a particular
value of X = x. (Example: Students who studied one hour per
week will have a set of midterm marks.)
2
Hence, we can think of a distribution of Y conditioned on X = x.
3
Two variables X and Y are related, if there is a change in the
conditional distribution of Y given X = x, as x changes.
(Example: As the time spent (in hours) per week studying course
materials changes from 1 to 2, the average midterm marks
changes from 35 to 55.)
Jabed Tomal (U of T)
Statistics
March 18, 2015
9 / 31
Relationships Among Variables:
The Strength of Relationship:
1
If we see large changes in the conditional distribution of Y given
X = x, as x changes, then we say a strong relationship exists.
2
If we see small changes in the conditional distribution of Y given
X = x, as x changes, then we say a weak relationship exists.
Jabed Tomal (U of T)
Statistics
March 18, 2015
10 / 31
Relationships Among Variables:
Exercise 10.1.1 Prove that discrete random variables X and Y are
unrelated if and only if X and Y are independent.
Jabed Tomal (U of T)
Statistics
March 18, 2015
11 / 31
Relationships Among Variables:
The Role of Statistical Models:
1
The relationship between two variables is completely described by
the set of conditional distributions of Y given X .
2
Example: Consider Y is global temperature and X is amount of
carbon dioxide placed into the atmosphere. Then, perhaps the
conditional distribution of Y given X = x can be expressed as:
Y |X = x ∼ N(µ(x) = α + βx, σ 2 ),
where the conditional mean of Y changes with the change of x,
i.e., µ(x) = E(Y |X = x) = α + βx. The conditional variance of Y
is fixed and independent of x, i.e., var(Y |X = x) = σ 2 .
3
Here, α and β are called the intercept and slope parameters,
respectively.
4
Assuming that the statistical model is correct, the two variables Y
and X are unrelated to each other if and only if β = 0.
Jabed Tomal (U of T)
Statistics
March 18, 2015
12 / 31
Relationships Among Variables:
Response and Predictor Variables:
1
If we expect a change in the variable Y for a change in the
variable X , then we say Y depends on X . Hence, Y and X are
called dependent and independent variables, respectively.
Assuming that the relationship is unidirectional, the variable Y and
X are called response and predictor variables, respectively.
2
Example: The midterm marks (Y ) and the time (X ) spent per
week studying the course materials can be termed as the
response and predictor variables, respectively.
Jabed Tomal (U of T)
Statistics
March 18, 2015
13 / 31
Relationships Among Variables:
The Role of Statistical Models:
1
We might have more than one predictor variables corresponding
to a response variable. Such relationship can be simplified by a
statistical model as following.
2
Let Y be the response and X1 , X2 , · · · , Xk be k predictor variables.
The statistical model is
Y |X1 = x1 , · · · , Xk = xk ∼ N(µ(x) = β0 + β1 x1 + · · · + βk xk , σ 2 ),
where the conditional mean of Y changes with the change of x,
i.e., µ(x) = E(Y |X = x) = β0 + β1 x1 + · · · + βk xk . The conditional
variance of Y is fixed and independent of x, i.e.,
var(Y |X = x) = σ 2 .
3
Assuming that the statistical model is correct, the variables Y and
X are unrelated if and only if β1 = · · · = βk = 0.
Jabed Tomal (U of T)
Statistics
March 18, 2015
14 / 31
Relationships Among Variables:
Example (one response and more than on predictors): A hospital
administrator wished to study the relation between patient satisfaction
(Y ) and patient’s age (X1 , in years), severity of illness (X2 , an index),
and anxiety level (X3 , an index). The administrator randomly selected
46 patients and collected the data presented below, where larger
values of Y , X2 , and X3 are, respectively, associated with more
satisfaction, increased severity of illness, and more anxiety.
i
X1i
X2i
X3i
Yi
Jabed Tomal (U of T)
:
:
:
:
:
1
50
51
2.3
48
2
36
46
2.3
57
3
40
48
2.2
66
Statistics
···
···
···
···
···
44
45
51
2.2
68
45
37
53
2.1
59
46
28
46
1.8
92
March 18, 2015
15 / 31
Relationships Among Variables:
Regression Models:
1
Let Y be the response and X1 , X2 , · · · , Xk be k predictor variables.
Then regression assumption specifies that the relationship
between the response and the predictors is expressed using the
conditional distribution of Y given X1 , X2 , · · · , Xk
Y |X1 , · · · , Xk ∼ N(β0 + β1 X1 + · · · + βk Xk , σ 2 ),
that is E(Y |X) = β0 + β1 X1 + · · · + βk Xk .
2
This model can be re-expressed as
Y = β0 + β1 X1 + · · · + βk Xk + Z ,
where Z ∼ N(0, σ 2 ).
Jabed Tomal (U of T)
Statistics
March 18, 2015
16 / 31
Relationships Among Variables:
Cause-Effect Relationships:
1
Consider we have a response Y and a predictor X . If the
conditional distribution of Y given X = x changes for changes in
x, then we say that the two variables are related to each other. In
a simple linear regression set up we write that
E(Y |X = x) = β0 + β1 x,
where β0 and β1 are called the intercept and slope parameters,
respectively.
2
If the changes in Y can be attributed as a result of the changes in
X only, then we say there exists a cause-effect relationship
between Y and X .
Jabed Tomal (U of T)
Statistics
March 18, 2015
17 / 31
Relationships Among Variables:
Cause-Effect Relationships:
1
Through extensive research, scientists have established that there
exits a cause-effect relationship between persons smoking status
and coronary heart disease.
Jabed Tomal (U of T)
Statistics
March 18, 2015
18 / 31
Relationships Among Variables:
Confounding Variables:
1
Consider there exists a relationship between a response Y and a
predictor X . Suppose the relationship is as following
E(Y |X = x) = β0 + β1 x,
where β0 and β1 are called the intercept and slope parameters,
respectively.
2
If another variable Z is related both with Y and X , then we
consider Z a confounding variable. Inclusion of the variable Z in
the model shows a change in the relationship between Y and X
as following:
E(Y |X = x, Z = z) = β0∗ + β1∗ x + β2∗ z,
where β0∗ 6= β0 and β1∗ 6= β1 .
Jabed Tomal (U of T)
Statistics
March 18, 2015
19 / 31
Relationships Among Variables:
Example: Confounding Variables:
1
We want to establish a relationship between grade point average
and gender. Consider female students secured higher GPA than
male students.
2
On the other hand, the most of the male students and a few of the
female student hold a part-time job.
3
Inclusion of the variable part-time job status in the model might
redefine the relationship between grade point average and
gender.
4
Here, part-time job status is a confounding variable.
Jabed Tomal (U of T)
Statistics
March 18, 2015
20 / 31
Relationships Among Variables:
Experiments:
1
In an experiment, we randomly sample n (n1 + n2 + n3 ) items from
a population Π as we want to make inferences regarding the
population. Random sampling will help eliminating selection bias.
2
Consider, we have one response Y and one predictor X which
has 3 levels x1 , x2 and x3 (say). We randomly assign x1 , x2 and x3
to n1 , n2 and n3 items, respectively. Such random assignment of
X values will help eliminating any confounding effects of other
variables.
3
We then observe the values of the response variable Y .
4
Statistical inferences based on data collected via an experiment
has the capability of inferring that a cause-effect relationships
exist.
Jabed Tomal (U of T)
Statistics
March 18, 2015
21 / 31
Relationships Among Variables:
Example: In a small-scale experimental study of the relation between
of degree of brand liking (Y ) and moisture content (X1 ) and sweetness
(X2 ) of the product, the following results were obtained from the
experiment based on a completely randomized design. The data are
coded below:
i
X1i
X2i
Yi
Jabed Tomal (U of T)
:
:
:
:
1
4
2
64
2
4
4
73
3
4
2
61
···
···
···
···
Statistics
14
10
4
95
15
10
2
94
16
10
4
100
March 18, 2015
22 / 31
Relationships Among Variables:
Observational Studies:
1
In an observational study, the sample items are not randomly
selected from the population Π. Hence, the inferences regarding
the population might be flawed.
2
The levels of the predictor variables are not randomly assigned to
the items. There might present confounding effects of other
variables.
3
Statistical inferences based on data collected via an observational
studies do not necessarily imply a cause-effect relationships
between variables.
4
While experiments reside at the top of the hierarchy, the
observational studies reside at the bottom.
Jabed Tomal (U of T)
Statistics
March 18, 2015
23 / 31
Relationships Among Variables:
Example: Property Assessments
The data that follow show assessed value for property tax purposes
(X1 , in thousand dollars) and sales price (X2 , in thousand dollars) for a
sample of 15 parcels of land for industrial development sold recently in
“arm’s length” transactions in a tax district.
i
X1i
X2i
:
:
:
Jabed Tomal (U of T)
1
13.9
28.6
2
16.0
34.7
3
10.3
21.0
Statistics
···
···
···
13
14.9
35.1
14
12.9
30.0
15
15.8
36.2
March 18, 2015
24 / 31
Relationships Among Variables:
Experimental Design (Design of Experiments):
1
Let us consider a simple set up where the goal is to determine
whether a cause-effect relationship exists between a response Y
and a predictor X (also called factor) defined on a population Π.
2
We randomly select a sample of experimental units (before we
called items) π1 , π2 , · · · , πn from the population Π.
3
The values x1 , x2 , · · · , xk of X are called levels. When the possible
values of X is large or infinite, we select (perhaps randomly) a set
of finite values of X which spans the entire range of X well.
4
Each of the levels of X is then randomly assigned to ni
(i = 1, 2, · · · k ) experimental units. Finally, the values of Y are
observed corresponding to the sampled experimental units.
Jabed Tomal (U of T)
Statistics
March 18, 2015
25 / 31
Relationships Among Variables:
Experimental Design (Design of Experiments):
1
After random assignment to the experimental units, each level of
the factor variable is called a treatment.
2
If the conditional distribution of Y against a particular level xi
shows large variability, then we choose ni to be large.
Jabed Tomal (U of T)
Statistics
March 18, 2015
26 / 31
Relationships Among Variables:
Example
A rental car company wants to investigate whether the type of car
rented affects the length of the rental period. An experiment is run for
one week at a particular location, and 10 rental contracts are selected
at random for each car type. The results are shown in the following
table.
Type of Car
Sub-compact
Compact
Midsize
Full size
1
3
1
4
3
5
3
1
5
3
4
3
7
Observations
7 6 5 3
7 5 6 3
5 7 1 2
5 10 3 4
2
2
4
7
1
1
2
2
6
7
7
7
Is there evidence to support a claim that the type of car rented
affects the length of the rental contract?
Jabed Tomal (U of T)
Statistics
March 18, 2015
27 / 31
Relationships Among Variables:
Control Treatment, the Placebo:
1
In experimental design, we often choose a level for the predictor
variable to be zero, i.e., X = 0, where X represents the doses of
any treatment. The zero value of the predictor variable is called a
control treatment and serves as a baseline against which we
assess the effect of the treatment.
2
In medical experiments, we might assign zero dose of a drug (the
so-called sugar pill) to some patients to measure the efficacy of
the drug in alleviating disease symptoms. Such control treatment
is also known as placebo. Sometimes, patients feel better with a
placebo and the effect is called placebo effect.
Jabed Tomal (U of T)
Statistics
March 18, 2015
28 / 31
Relationships Among Variables:
Blinding:
1
Blinding: Blinding is a situation when a patient does not know
whether he/she is receiving a placebo or a drug.
2
Double-blinding: Double-blinding is a situation when both patients
and the experimenter do not know the identity of the treatment
assignments.
Jabed Tomal (U of T)
Statistics
March 18, 2015
29 / 31
Relationships Among Variables:
Exercise 10.1.3: Suppose that a census is conducted on a population
and the joint distribution of (X , Y ) is obtained as in the following table.
X =1
X =2
Y =1
0.15
0.12
Y =2
0.18
0.09
Y =3
0.40
0.06
Determine whether or not a relationship exists between Y and X .
Jabed Tomal (U of T)
Statistics
March 18, 2015
30 / 31
Relationships Among Variables:
Exercise 10.1.14: Suppose we have a quantitative response variable
Y and two categorical predictor variables W and X , both taking values
in {0, 1}. Suppose the conditional distributions of Y are given by
Y |W = 0, X = 0 ∼ N(3, 5)
Y |W = 1, X = 0 ∼ N(3, 5)
Y |W = 0, X = 1 ∼ N(4, 5)
Y |W = 1, X = 1 ∼ N(4, 5)
Does W have a relationship with Y ? Does X have a relationship with
Y ? Explain your answers.
Jabed Tomal (U of T)
Statistics
March 18, 2015
31 / 31