Assoc Prof Dr Zamalia Mahmud Faculty of Computer and Mathematical Sciences

Assoc Prof Dr Zamalia Mahmud
Center of Studies for Statistics and Decision Science
Faculty of Computer and Mathematical Sciences
November 2011
1
Why is Statistics useful in research?
• It helps us to make sense of the information
• It helps us to understand how decisions are
made
• It helps us to determine the cause and effect
of a phenomena
• It helps us to arrive at a conclusion
• It lends us the tools and techniques for
collecting, analyzing and interpreting data
2
1
How to prepare yourself to be competent in statistical
data analysis?
•
•
•
•
•
•
Take up several statistics courses
Learn statistical packages
Read statistics books
Learn the relevant statistical techniques
Get to know your data very well.
Be prepared to analyze data with minimum help
from a statistics consultant.
• Be prepared to deal with your fear and anxiety
towards statistics
3
What do you need to know prior to doing
data analysis?
• Know how to collect the right data using the
appropriate instrument
• Know the nature of data to be collected
• Know the type of data to be collected
• Know the levels of measurement of the data to
be collected
• Know the types of variables associated with the
data to be collected
4
2
Continue...
• Know how to relate the data with your
research questions
• Know how to relate research questions with
the types of analyses to be done
• Know how to relate research hypotheses with
the types of analyses to be done
• Know how to recognize the variables to be
measured from the research questions
5
Always begin your research inquiries with
measurable Research Questions
• Is there a significant relationship between
smoking habit and lung infection?
• Does job-related stress affect lecturers’
performance at the University?
• Is there a significant difference in the job
satisfaction level between Maxis and
Celcom employees?
6
3
• Does increase in motivation cause job satisfaction
level of employees to increase?
• Does motivation moderate the relationship
between job loyalty and job satisfaction?
• Is there a significant difference in the knowledge
on occupational safety and health between male
and female employees?
• What are the factors that motivate individuals to
work at the private hospitals?
7
Examples of Research Objectives
• To determine if there is a relationship
between smoking habit and lung infection.
• To determine if job-related stress affect
workers’ performance at the private
hospitals.
• To determine if there is a difference in the
job satisfaction level between Maxis and
Celcom employees.
• To identify factors that motivate individuals
to work at the private hospitals.
8
4
Examples of Research Hypotheses
• Hypothesis is a statement that proposes an
explanation, which can be tested through data
obtained from further observation or
experimentation.
• Two types of hypothesis: Null and Alternative
Example:
• There is a relationship between smoking habit
and lung infection.
• There is a difference in the mean score job
satisfaction between Maxis and Celcom
employees.
9
Why are Research Questions, Research Objectives and
Research Hypotheses important in your analysis?
• It helps you to stay focus on what is to be
measured
• It helps you to focus on the pertinent
variables to be measured
• It helps you to do the correct and
appropriate analysis for your research
10
5
Data Sources
Primary
Secondary
Data Collection
Data Compilation
Print or Electronic
Observation
Survey
Experimentation
Chap 1-11
Types of Data
Data
Categorical
Numerical
Examples:


Marital Status
Political Party

Eye Color
(Defined categories)
Discrete
Examples:


Number of workers
Number of defective
items
(Counted items)
Continuous
Examples:
Weight
Voltage
(Measured characteristics)


Chap 1-12
6
Types of Samples Used
Samples
Does not
require
sampling frame
Require
sampling frame
Non-Probability
Samples
Judgement
Snowball
Quota
Convenience
Probability Samples
Simple
Random
Stratified
Systematic
Cluster
Chap 1-13
Levels of Measurement
• RATIO
• INTERVAL
• ORDINAL
• NOMINAL
14
7
NOMINAL
ORDINAL
INTERVAL
RATIO
A scale in which the numbers or
letters assigned to objects serve as
labels for identification.
Categories only.
Data cannot be arranged in an
ordering scheme.
Gender: Male (1)
Female (2)
A scale that arranges objects or
alternatives according to their
magnitude or rank-order
Monthly Salary:
< RM1000 (1)
RM1001 – RM2000 (2)
RM2001 – RM3000 (3)
We know “excellent” is better than
“good”, but we do not know by how
much.
Overall ratings of the hospital
services:
Excellent (4)
Good
(3)
Fair
(2)
Poor
(1)
Marital Status: Married
(1)
Single
(2)
Widowed (3)
It has a rank-order feature of ordinal
scales, but it also includes the
additional characteristic of equal
distances,
or
equal
intervals,
between numbers on the scale.
It has no true zero point (i.e., zero is
not the starting point)
Temperature :
-2O C
0 O C – freezing point
2O C
It has all the above properties plus
It has a true zero point.
Weight:
20.8 kg
45.5 kg
87.5 kg
Likert Scale: 1 – Strongly disagree
2 – Disagree
3 – Neutral; 4 - Agree;
5 – Strongly disagree
Height:
150 cm
126.8 cm
106. 5 cm
15
16
8
Descriptive and Inferential Statistics
Statistical Analysis
Descriptive
Inferential
Combines the methods of
descriptive statistics with the
theory of probability for the
purpose of learning what
samples of data tell about the
characteristics of populations
from which they where drawn
Used to describe the basic
features of the data obtained in
a study: tables, charts, graphs
6/5/2004
20
Prof Madya Dr Rasimah Aripin
17
Strategy for Data Analysis
QUALITATIVE
Percentage, mode,
median, charts and
tables (No measure
for Variation)
FREQUENCY
DISTRIBUTION
FOR EVERY VARIABLE
DESCRIPTIVE
STATISTICS/CHARTS
MEASURES
OF
ASSOCIATION
Cross-tabulation,
nonparametric
measures of
association
HYPOTHESIS
TESTING/
MODELLING
Non
Parametric
Methods
6/5/2004
ESTIMATION,
PREDICTION,
FORECASTING
Prof Madya Dr Rasimah Aripin
QUANTITATIVE
Mean, median,
variance, graphs,
and many more
Correlation Analysis
Parametric
Methods
36
18
9
Type of Descriptive Statistics
Descriptive
for Different
byStatistics
Measurement
Scales Types of
Measurement
Type of Descriptive
Statistics
Type of
measurement
Two
Categories
Frequency Table
Proportion (percentage)
Mode
Nominal
More than
two categories
Frequency table
Category proportion (%)
Mode
Rank Order
Median
Ordinal
Interval
Arithmetic Mean, Inde x
numbers, variance, standard
deviation, range,
Percentiles
Ratio
19
Selection
of Univariate/Bivariate
Techniques
Classification
of Univariate Technique
for Testing of the Mean and Median
Univariate/Bivariate
Technique s
Interval/Ratio Data
One Sample
Nominal/Ordinal Data
Two or More Sample
One Sample
• t test
• Frequency
• z test
• Chi-Square
• K-S
• Runs
Independent
• t test
• Z test
• One-way
ANOVA
5/25/98
Related
• Paired t test
Independent
• Chi-Square
• Mann-Whitney
• Median
• K-S
• K-W ANOVA
Two or More Sample
Related
• Sign
• Wilcoxon
• McNemar
20
10
Measures of Association
by Measurement Scale
MEASURES OF ASSOCIATION
Scales
Coefficients
Research Questions
Interval/Ratio
Pearson’s r
Simple Regression
Is moisture content
related to temperature?
Ordinal
Spearman Rank
Kendall’s Rank
Is preference related to
convenience of locations?
Nominal
Phi-Coefficient
Contingency Coeff
Is gender associated
with brand preference?
6/5/2004
Prof Madya Dr Rasimah Aripin
41
21
Selection of Multivariate Techniques
Multivariate Techniques
Dependent
Techniques
One Dependent
Variable
• Cross -tabulation
• analysis of variance
and covariance
• Multiple regression
• Two -group discriminant
analysis
• Conjoint analysis
6/5/2004
Interdependent
Techniques
More than One
Dependent Variable
• Multivariate analysis
of variance and
covariance
• Canonial Correlation
• Multiple discriminant
analysis
Variable
Interdependence
• Factor
Analysis
Prof Madya Dr Rasimah Aripin
Interobject
Similarity
•
Cluster
Analysis
• Multidimensio
nal scaling
22
68
11
BASIC DATA ANALYSIS
• DESCRIPTIVE
STATISTICS
• FREQUENCY &
PERCENTAGE
TABLES
• CROSSTABULATION
• DATA
TRANSFORMATION
• DATA
COMPUTATION
• GRAPHICAL
REPRESENTATION
23
Gender
Valid
Female
Male
Total
Frequency
216
258
474
Percent
45.6
54.4
100.0
Valid Percent
45.6
54.4
100.0
Cumulativ e
Percent
45.6
100.0
24
12
Descriptives
Beginning Salary
Gender
Female
Male
Mean
95% Conf idence
Interv al f or Mean
Median
Variance
St d. Dev iation
Minimum
Maximum
Range
Interquart ile Range
Skewness
Kurt osis
Mean
95% Conf idence
Interv al f or Mean
Median
Variance
St d. Dev iation
Minimum
Maximum
Range
Interquart ile Range
Skewness
Kurt osis
Lower Bound
Upper Bound
Lower Bound
Upper Bound
St at ist ic
$13,091.97
$12,698.26
St d. Error
$199.74
$13,485.67
$12,375.00
8617742.738
$2,935.60
$9,000
$30,000
$21,000
$3,118.75
1.767
5.352
$20,301.40
$19,184.30
.166
.330
$567.27
$21,418.49
$15,750.00
83024550.57
$9,111.78
$9,000
$79,980
$70,980
$7,687.50
2.390
8.488
.152
.302
25
Clerical
Custodial
Manager
Total
Em ploy ment Category
Count
%
363
76.6%
27
5.7%
84
17.7%
474
100.0%
26
13
27
CONTINGENCY TABLE
The results of a crosstabulation between two
categorical variables (smoking habit and hospitalization)
28
14
Graphical Methods
Pie Chart of Employment Category
Bar Chart f o Employment Category
100
Manager
17.7%
80
Custodial
77
5.7%
60
40
Cleric al
76.6%
Percent
20
18
6
0
Cleric al
Custodial
Manager
Employment Category
Bar Chart
Pie Chart
29
Comparative Histogram
30
15
Box-and-Whisker Plot
Normal Q-Q Plot
31
HYPOTHESIS TESTING
• WHAT IS A HYPOTHESIS?
An unproven proposition or supposition that
tentatively explains certain facts or phenomena.
• NULL HYPOTHESIS
A conservative statement which communicates the
notion that any change from what has been
thought to be true or observed in the past will be
due entirely to error.
32
16
• ALTERNATIVE HYPOTHESIS
A statement indicating the opposite of the
null hypothesis.
• SIGNIFICANCE LEVEL
The critical probability in choosing between the
null and alternative hypothesis; the probability
level (say,  = 0.05 or 0.01) that is too low to
warrant support of a null hypothesis.
33
• CRITICAL VALUE or p-VALUE
The value that lie exactly on the boundary of
the region of rejection.
34
17
p-Value Solution
Calculate the p-value and compare to 
(For a two sided test the p-value is always two sided)
Do not reject H0
Reject H0
/2 = .025
Reject H0
/2 = .025
.0068
.0068
-1.96
Z = -2.47
0
1.96
P(Z  2.47)  P(Z  2.47)
 2(.0068)  0.0136
p-value = .0136:
Z = 2.47
Reject H0 since p-value = .0136 <  = .05
35
TEST OF DIFFERENCES
Investigation of hypotheses that
state two (or more) groups differ with
respect to measures on a variable.
e.g.
To determine if male and female
employees differ in their attitude
towards their job.
36
18
TEST OF DIFFERENCES
COMMON BIVARIATE TESTS OF DIFFERENCE
Types of
Measurement
Interval and Ratio
Differences among
two independent
groups
Independent
groups:
t-test or Z-test
Ordinal
Mann-Whitney U-test
Wilcoxon test
(Non-parametric)
Nominal
Z-test (two props.)
Chi-square test
(Non-parametric)
Differences among
three or more
Independent
groups
One-way
ANOVA
Kruskal-Wallis
Test
(Non-parametric)
Chi-square test
(Non-parametric)
37
TEST OF ASSOCIATION
• CHI-SQUARE (2) TEST OF INDEPENDENCE
A test conducted to investigate if there is
an association/relationship between two
nominal, two ordinal or between nominal
and ordinal variables.
38
19
39
Example
H0 : There is no association between gender
and preference for colours
H1 : There is association between gender
and preference for colours
40
20
Another example
H0 : There is no association between gender
and employment category
H1 : There is association between gender
and employment category
41
Case Processing Summary
Valid
N
Gender * Employ ment
Category
Percent
474
N
100.0%
Cases
Missing
Percent
0
.0%
Total
N
Percent
474
100.0%
Gender * Employment Category Crosstabulation
Gender
Female
Male
Total
Count
Expected Count
% wit hin Gender
% wit hin Employ ment
Category
Count
Expected Count
% wit hin Gender
% wit hin Employ ment
Category
Count
Expected Count
% wit hin Gender
% wit hin Employ ment
Category
Employ ment Category
Clerical
Custodial
Manager
206
0
10
165.4
12.3
38.3
95.4%
.0%
4.6%
Total
216
216.0
100.0%
56.7%
.0%
11.9%
45.6%
157
197.6
60.9%
27
14.7
10.5%
74
45.7
28.7%
258
258.0
100.0%
43.3%
100.0%
88.1%
54.4%
363
363.0
76.6%
27
27.0
5.7%
84
84.0
17.7%
474
474.0
100.0%
100.0%
100.0%
100.0%
100.0%
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
N of Valid Cases
Value
79.277a
95.463
474
df
2
2
Asy mp. Sig.
(2-sided)
.000
.000
a. 0 cells (.0%) hav e expect ed count less than 5. The
minimum expected count is 12.30.
42
21
CORRELATION COEFFICIENT (r)
It is a statistical measure of the covariation
of or association between two variables. It
indicates the strength of the relationship
between two variables.
Correlation coefficient (r) ranges from
+1.0 to -1.0.
43
If r = +1.0  perfect positive linear relationship
If r = -1.0  perfect negative linear relationship
If r = 0  no correlation
If r = -0.92  a relatively strong inverse relationship
i.e., the greater the value measured by variable X, the
less the value measured by variable Y.
If r =+0.92  a relatively strong positive relationship
i.e., the greater the value measured by variable X, the
more the value measured by variable Y.
44
22
Testing the Significance of the correlation coefficient
H0 :  = 0 (No correlation exist between two variables)
H1 :   0 (Correlation exist between two variables)
45
REGRESSION ANALYSIS
A technique used for measuring the
linear association between a dependent
and independent variable.
Regression analysis attempts to predict the
values of a continuous, interval-scaled
dependent variable from the specific values
of the independent variable.
46
23
Simple Linear Regression Model
The population regression model:
Population
Y intercept
Dependent
Variable
Population
Slope
Coefficient
Random
Error
term
Independent
Variable
Yi  β0  β1Xi  ε i
Linear component
Random Error
component
47
Simple Linear Regression Model
(continued)
Y
Yi  β0  β1Xi  ε i
Observed Value
of Y for Xi
εi
Predicted Value
of Y for Xi
Slope = β1
Random Error
for this Xi value
Intercept = β0
Xi
X
48
24
Simple Linear Regression
Equation
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted)
Y value for
observation i
Estimate of
the regression
intercept
Estimate of the
regression slope
ˆ  b b X
Y
i
0
1 i
Value of X for
observation i
The individual random error terms ei have a mean of zero
49
Least Squares Method
• b0 and b1 are obtained by finding the values
of b0 and b1 that minimize the sum of the
squared differences between Y and Yˆ :
ˆ )2  min (Y  (b  b X ))2
min (Yi Y
i
i
0
1 i
50
25
Interpretation of the
slope and the Intercept
• b0 is the estimated average value of
Y when the value of X is zero
• b1 is the estimated change in the
average value of Y as a result of a
one-unit change in X
51
Simple Linear Regression
Example
• A real estate agent wishes to examine the
relationship between the selling price of a
home and its size (measured in square feet)
• A random sample of 10 houses is selected
– Dependent variable (Y) = house price in
$1000s
– Independent variable (X) = size
52
26
Sample Data for House Price Model
House Price in $1000s
(Y)
Size in sq. ft.
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
53
Graphical Presentation
• House price model: scatter plot and
regression line
House Price ($1000s)
450
Intercept
= 98.248
400
350
Slope
= 0.10977
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet
house price  98.24833  0.10977 (size)
54
27
Interpretation of the Intercept, bo
house price  98.24833  0.10977 (size)
• b0 is the estimated average value of Y when
the value of X is zero (if X = 0 is in the range
of observed X values)
– Here, no houses had 0 square feet, so b0 =
98.24833 just indicates that, for houses within
the range of sizes observed, $98,248.33 is the
portion of the house price not explained by
square feet
55
Interpretation of the slope, b1
• b1 measures the estimated change in the
average value of Y as a result of a oneunit change in X
– Here, b1 = .10977 tells us that the average value
of a house increases by .10977($1000) = $109.77,
on average, for each additional one square foot
of size
56
28
Predictions using
Regression Analysis
Predict the price for a house
with 2000 square feet:
houseprice  98.25  0.1098 (sq.ft.)
 98.25  0.1098(2000)
 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
57
SPSS Output – House Price Model
58
29
Coefficient of Determination, r2
• The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
• The coefficient of determination is also
called r-squared and is denoted as r2
0  r2  1
59
SPSS Output – Coeff. Of Determination (r2 values)
58.1% of the variation in
house prices is
explained by variation in
square feet
60
30
Inference about the Slope: t Test
• t test for a population slope
– Is there a linear relationship between X and Y?
• Null and alternative hypotheses
H0: β1 = 0
H1: β1  0
(no linear relationship)
(linear relationship does exist)
• Test statistic
b β
t 1 1
Sb1
d.f.  n  2
where:
b1 = regression slope
coefficient
β1 = hypothesized slope
Sb1 = standard
error of the slope
61
Inference about the Slope: t Test
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Estimated Regression Equation:
houseprice  98.25  0.1098 (sq.ft.)
The slope of this model is 0.1098
Does size of the house affect its
sales price?
62
31
Inferences about the Slope: t Test
From SPSS output:
Sb1
b1
H0: β1 = 0
H1: β1  0
P-value
b  β1
0.10977  0
t 1

 3.32938
Sb1
0.03297
63
Inferences about the Slope: t Test
(continued)
Test Statistic: t = 3.329
H0: β1 = 0
H1: β1  0
From SPSS output:
Intercept
Square Feet
Coefficients
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
d.f. = 10-2 = 8
/2=.025
/2=.025
Conclusion:
Reject H0
Do not reject H0
-tα/2
-2.3060
0
tα/2
Reject H0
2.3060 3.329
Reject H0
There is sufficient evidence that
size of house affects house
64
price
32
Inferences about the Slope: t Test
(continued)
P-value = 0.01039
H0: β1 = 0
H1: β1  0
From Excel output:
Intercept
Square Feet
Coefficients
Standard Error
t Stat
P-value
98.24833
0.10977
58.03348
1.69296
0.12892
0.03297
3.32938
0.01039
This is a two-tail test, so
the p-value is
Decision: P-value < α so
P(t > 3.329)+P(t < -3.329)
= 0.01039
Conclusion:
Reject H0
There is sufficient evidence that
square footage or house size
65
affects house price
(for 8 d.f.)
F-Test for Significance
•
F Test statistic:
where
F
MSR
MSE
MSR 
SSR
k
MSE 
SSE
n  k 1
where F follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
(k = the number of independent variables in the regression model)
66
33
67
End of Presentation
Contact:
Center of Studies for Statistics and Decision Science
Faculty of Computer and Mathematical Sciences
[email protected];
[email protected]
Tel: 03-55435367; Fax: 03-55435501
Hp: 012-2197985
34