Data Analysis and Surveying 101: Basic research methods and biostatistics

Data Analysis and Surveying 101:
Basic research methods and biostatistics
as they apply to the
Theresa Jackson Hughes, MPH
American College Health Association
December 2006
What we will cover today
 Research Methods
•
•
•
•
•
Sampling Frame and Sampling
Generalizability
Bias
Reliability and Validity
Levels of measurement
 Biostatistics
•
•
•
•
Statistical significance
Other key terms
Appropriate statistical tests
Fun examples from the Spring 2005 dataset!
Get excited! It’s data time!!!
Research Methods
 “To do successful research, you don't need
to know everything, you just need to know
of one thing that isn't known.”
• Arthur Schawlow
 “That's the nature of research - you don't
know what in hell you're doing.”
• Harold "Doc" Edgerton
 “If we knew what it was we were doing, it
would not be called research, would it?”
• Albert Einstein
What exactly is research?
 “Scientific research is systematic,
controlled, empirical, and critical
investigation of natural phenomena guided
by theory and hypotheses about the
presumed relations among such
phenomena.”
• Kerlinger, 1986
 Research is an organized and systematic
way of finding answers to questions
Important Components of
Empirical Research
 Problem statement, research questions,
purposes, benefits
 Theory, assumptions, background literature
 Variables and hypotheses
 Operational definitions and measurement
 Research design and methodology
 Instrumentation, sampling
 Data analysis
 Conclusions, interpretations,
recommendations
Sampling
 What is your population of interest?
• To whom do you want to generalize your
results?
 All students (18 and over)
 Undergraduates only
 Greeks
 Athletes
 Other
 Can you sample the entire
population?
Sampling
 A sample is “a smaller (but hopefully
representative) collection of units from a
population used to determine truths about
that population” (Field, 2005)
 Why sample?
• Resources (time, money) and workload
• Gives results with known accuracy that can be
calculated mathematically
 The sampling frame is the list from which
the potential respondents are drawn
• Registrar’s office
• Class rosters
• Must assess sampling frame errors
Types of Samples
 Probability (Random) Samples 
• Simple random sample
• Systematic random sample
• Stratified random sample
 Proportionate
 Disproportionate
• Cluster sample
 Non-Probability Samples
• Convenience sample
• Purposive sample
• Quota
Sample Size
 Depends on expected response rate
• Average 85% for paper
 FINAL SAMPLE DESIRED / .85 = SAMPLE
• Average 25% for web
 FINAL SAMPLE DESIRED / .25 = SAMPLE
Size of Campus
<600
Final Desired N
All students
600-2,999
600
3,000-9,999
700
10,000-19,999
800
20,000-29,000
900
≥30,000
1,000
Bias and Error
Bias and Error
 Systematic Error or Bias: unknown or
unacknowledged error created during
the design, measurement, sampling,
procedure, or choice of problem
studied
• Error tends to go in one direction
 Examples: Selection, Recall, Social
desirability
 Random
• Unrelated to true measures
 Example: Momentary fatigue
Reliability and Validity
 Reliability
• The extent to which a test is repeatable and
yields consistent scores
• Affected by random error/bias
 Validity
• The extent to which a test measures what it is
supposed to measure
• A subjective judgment made on the basis of
experience and empirical indicators
• Asks "Is the test measuring what you think it’s
measuring?“
• Affected by systematic error/bias
Reliability vs. Validity
 In order to be valid, a test must be reliable;
but reliability does not guarantee validity.
Levels of Measurement
Levels of Measurement
 Nominal
• Gender
 Interval
• Body Mass Index (BMI)
 Male, Female
• Vaccinations
 Yes, No, Unsure
 Ordinal
• Personal health status
 Excellent, Very good,
Good, Fair, Poor
• Last 30 days
 Never used, Not in
last 30 days, 1-2 days,
3-5 days, 6-9 days,
10-19 days, 20-29
days, All 30 days
 Ratio
• Number of drinks
• Number of sexual
partners
• Perception percentages
• Blood alcohol
concentration (BAC)
Biostatistics
 “It is commonly believed that anyone who
tabulates numbers is a statistician. This is
like believing that anyone who owns a
scalpel is a surgeon.”
• R. Hooke
 “Torture numbers, and they'll confess to
anything.”
• Gregg Easterbrook
 “98% of all statistics are made up.”
• Author Unknown
Types of Statistics
 Descriptive statistics
• Describe the basic features of data in a
study
• Provide summaries about the sample
and measures
 Inferential statistics
• Investigate questions, models, and
hypotheses
• Infer population characteristics based on
sample
• Make judgments about what we observe
Descriptive Statistics









Mode
Median
Mean
Central Tendency
Variation
Range
Variance
Standard Deviation
Frequency
Descriptive Statistics
Examples
 Categorical Variables (Nominal/Ordinal)
Q1 Gen health
Valid
Mis sing
Total
1 excellent
2 very good
3 good
4 fair
5 poor
6 don't know
Total
Sys tem
Frequency
9145
23767
16442
3737
565
132
53788
323
54111
Percent
16.9
43.9
30.4
6.9
1.0
.2
99.4
.6
100.0
Valid Percent
17.0
44.2
30.6
6.9
1.1
.2
100.0
Cumulative
Percent
17.0
61.2
91.8
98.7
99.8
100.0
Descriptive Statistics
Examples
 Categorical Variables (Nominal/Ordinal)
Q49 Year in school * Q46 Sex Crosstabulation
Q49
Year in
s chool
1 1st year undergrad
2 2nd year under
3 3rd year under
4 4th year under
5 5th year or more under
6 graduate
7 adult special
8 other
Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Q46 Sex
1 female
2 male
7366
4154
14.5%
8.2%
6755
3678
13.3%
7.2%
6195
3333
12.2%
6.6%
5192
2676
10.2%
5.3%
1380
985
2.7%
1.9%
5088
3246
10.0%
6.4%
203
105
.4%
.2%
266
145
.5%
.3%
32445
18322
63.9%
36.1%
Total
11520
22.7%
10433
20.6%
9528
18.8%
7868
15.5%
2365
4.7%
8334
16.4%
308
.6%
411
.8%
50767
100.0%
Descriptive Statistics
Examples
 Continuous Variables (Interval/Ratio)
Descriptive Statistics
Q48 Weight in pounds
HT_INCH Height in
Inches
Q13 How many drinks
Q12 Hours alcohol
BAC Blood Alcohol
Content
Valid N (lis twis e)
N
51935
Range
534
Minimum
52
Maximum
586
Mean
153.16
Std. Deviation
35.791
Variance
1281.031
52017
56.00
48.00
104.00
67.2035
4.01241
16.099
53374
53326
88
65
0
0
88
65
4.42
2.99
4.401
2.726
19.370
7.430
50604
2.47
.00
2.47
.0731
.08357
.007
50218
Hypotheses
 Null hypotheses
• Presumed true until statistical evidence
in the form of a hypothesis test indicates
otherwise
 There is no effect/relationship
 There is no difference in means
 Alternative hypotheses
• Tested using inferential statistics
 There is an effect/relationship
 There is a difference in means
Alpha, Beta, Power, Effect
Size
 Alpha – probability of
making a Type I error
• Reject null when null is
true
• Level of significance, p
value
 Beta – probability of
making a Type II error
• Fail to reject null when null
is false
 Power – probability of
correctly rejecting null
• 1 – Beta
 Effect Size
• Measure of the strength of
the relationship between
two variables
Reject
null
Fail to
Reject
null
Null is
true
Null is
false
Alpha
Type I
error
1 – Beta
Power
1 – Alpha
CORRECT
NONREJECTION
CORRECT
REJECTION
Beta
Type II
error
Let’s test some
hypotheses!!!
Test of the mean of one
continuous variable
 College students report drinking an average of 5
drinks the last time they “partied”/socialized
• Hypotheses
 Ho: µ = 5
 HA: µ ≠ 5
• Test: Two-tailed t-test
• Result: Reject null
One-Sample Statistics
How many drinks
N
53374
Mean
4.42
Std. Deviation
4.401
Std. Error
Mean
.019
One-Sample Test
Tes t Value = 5
How many drinks
t
-30.352
df
53373
Sig. (2-tailed)
.000
Mean
Difference
-.578
95% Confidence
Interval of the
Difference
Lower
Upper
-.62
-.54
Test of a single proportion of
one categorical variable
 20% of college students report their health is
excellent
• Hypotheses
 Ho: p = 20
 HA: p ≠ 20 (one-tailed)
• Test: Z-test for a single proportion
• Result: Reject null
Binomial Test
Gen health
Group 1
Group 2
Total
Category
<= 1
>1
N
9145
44643
53788
Obs erved
Prop.
.170
.830
1.000
Tes t Prop.
.2
Asymp. Sig.
(1-tailed)
.000 a, b
a. Alternative hypothesis s tates that the proportion of cas es in the first group < .2.
b. Bas ed on Z Approximation.
Test of a relationship between
two continuous variables
 There is a relationship between the number of
drinks students report drinking the last time they
drank and the number of sex partners they have had
within the last school year
• Hypotheses
 Ho: ρ = 0
 HA: ρ ≠ 0
• Test: Pearson Product Moment Correlation
• Result: Reject null
Correlations
How many drinks
Partners you had
Pears on Correlation
Sig. (2-tailed)
N
Pears on Correlation
Sig. (2-tailed)
N
How many
drinks
1
Partners
you had
.238**
.000
53374
52576
.238**
1
.000
52576
52896
**. Correlation is s ignificant at the 0.01 level (2-tailed).
Test of the difference
between two means
 Men and women report significantly different
numbers of sexual partners over the past 12
months
• Hypotheses
 µ1 = µ2
 µ1 ≠ µ2
• Test: Independent Samples t-test OR One-way ANOVA
• Result: Reject null
Group Statistics
Partners you had
Sex
female
male
N
32687
18474
Mean
1.34
1.82
Std. Deviation
2.017
3.627
Std. Error
Mean
.011
.027
Independent Samples Test
Levene's Test for
Equality of Variances
F
Partners you had
Equal variances
ass umed
Equal variances
not as sumed
867.978
Sig.
.000
t-tes t for Equality of Means
95% Confidence
Interval of the
Difference
Lower
Upper
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
51159
.000
-.483
.025
-.532
-.434
-16.704 25065.988
.000
-.483
.029
-.540
-.426
t
-19.360
df
Test of the difference
between two or more means
 Mean BAC reported differs across student residences
• Hypotheses
 µ1 = µ 2 = µ 3 = µ4 = µ 5 = µ 6
 µi ≠ µj for at least one pair i, j
• Test: One-way ANOVA
• Result: Reject null
Descriptives
Blood Alcohol Content
residence hall
frat/sorority hous e
other univers ity housing
off campus
with parents
other
Total
N
21285
781
3620
18151
4279
2266
50382
Mean
.0741
.1127
.0622
.0773
.0606
.0579
.0731
Std. Deviation
.08215
.09278
.07357
.08539
.08490
.08296
.08357
Std. Error
.00056
.00332
.00122
.00063
.00130
.00174
.00037
95% Confidence Interval for
Mean
Lower Bound Upper Bound
.0730
.0752
.1062
.1193
.0598
.0646
.0760
.0785
.0581
.0631
.0545
.0613
.0724
.0738
Minimum
.00
.00
.00
.00
.00
.00
.00
Maximum
1.27
.75
1.41
2.47
1.17
1.26
2.47
ANOVA
Blood Alcohol Content
Between Groups
Within Groups
Total
Sum of
Squares
3.188
348.695
351.884
df
5
50376
50381
Mean Square
.638
.007
F
92.123
Sig.
.000
Test of the difference
between two or more means
Multiple Comparisons
Dependent Variable: Blood Alcohol Content
Games-Howell
(I) Currently live
residence hall
frat/sorority hous e
other univers ity housing
off campus
with parents
other
(J) Currently live
frat/sorority hous e
other univers ity housing
off campus
with parents
other
residence hall
other univers ity housing
off campus
with parents
other
residence hall
frat/sorority hous e
off campus
with parents
other
residence hall
frat/sorority hous e
other univers ity housing
with parents
other
residence hall
frat/sorority hous e
other univers ity housing
off campus
other
residence hall
frat/sorority hous e
other univers ity housing
off campus
with parents
Mean
Difference
(I-J)
-.03865*
.01190*
-.00316*
.01350*
.01623*
.03865*
.05055*
.03548*
.05215*
.05488*
-.01190*
-.05055*
-.01506*
.00160
.00433
.00316*
-.03548*
.01506*
.01667*
.01940*
-.01350*
-.05215*
-.00160
-.01667*
.00273
-.01623*
-.05488*
-.00433
-.01940*
-.00273
*. The mean difference is significant at the .05 level.
Std. Error
.00337
.00135
.00085
.00141
.00183
.00337
.00354
.00338
.00356
.00375
.00135
.00354
.00138
.00178
.00213
.00085
.00338
.00138
.00144
.00185
.00141
.00356
.00178
.00144
.00217
.00183
.00375
.00213
.00185
.00217
Sig.
.000
.000
.003
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.947
.323
.003
.000
.000
.000
.000
.000
.000
.947
.000
.809
.000
.000
.323
.000
.809
95% Confidence Interval
Lower Bound Upper Bound
-.0483
-.0290
.0081
.0157
-.0056
-.0007
.0095
.0175
.0110
.0215
.0290
.0483
.0404
.0606
.0258
.0451
.0420
.0623
.0442
.0656
-.0157
-.0081
-.0606
-.0404
-.0190
-.0111
-.0035
.0067
-.0017
.0104
.0007
.0056
-.0451
-.0258
.0111
.0190
.0125
.0208
.0141
.0247
-.0175
-.0095
-.0623
-.0420
-.0067
.0035
-.0208
-.0125
-.0035
.0089
-.0215
-.0110
-.0656
-.0442
-.0104
.0017
-.0247
-.0141
-.0089
.0035
Test for a relationship
between two categorical variables
 Is there an association between being a member
of a fraternity/sorority and ever being diagnosed
with depression?
• Hypotheses
 Ho: There is no association between being a member of a
fraternity/sorority and ever being diagnosed with
depression.
 HA: There is an association between being a member of a
fraternity/sorority and ever being diagnosed with
depression.
• Test: Chi-square test for independence
• Result: Fail to reject null
Test for relationship
between two categorical variables
Ever - Depression * Frat or sorority? Crosstabulation
Ever - Depress ion
yes
no
Total
Count
Expected Count
Count
Expected Count
Count
Expected Count
Frat or s orority?
yes
no
681
7692
715.6
7657.4
3744
39657
3709.4
39691.6
4425
47349
4425.0
47349.0
Chi-Square Tests
Pears on Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Ass ociation
N of Valid Cas es
Value
2.185 b
2.122
2.211
2.185
df
1
1
1
1
Asymp. Sig.
(2-s ided)
.139
.145
.137
Exact Sig.
(2-s ided)
Exact Sig.
(1-s ided)
.141
.073
.139
51774
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 715.
62.
Total
8373
8373.0
43401
43401.0
51774
51774.0
Important Points to
Remember
 An significant association does not
indicate causation
 Statistical significance is not always
the same as practical significance
 Multiple factors contribute to whether
your results are significant
 It gets easier and easier as you
practice! 
Questions???