Download Report

HOW TO GO ABOUT DATA GENERATION AND ANALYSIS // INDEFINITE
GAMES AND DISOUNT RATES ELICITATION
(Lecture 12, 2009_03_11)
### Please recall my remarks in Lecture 1 about the nature of the lecture notes.
### Outlines of „lit reviews” ?
### Review (in somewhat reversed order):
On to public good provision (in the lab)
1
2
3
4
Results see lectures notes …
5
6
This is a chapter in Cherry et al (2008), Environmental Economics, Experimental
Methods, Routlege.
-
Typically, parameterized/designed so that each player has a dominant
strategy of not contributing (to the public account)
In one-shot (single-round) VCM experiments, subjects contribute –
contrary to the theoretical prediction – about 40% - 60 %
In finitely-repeated VCM experiments, subjects contribute about the same
initially but contributions then decline towards zero (but rarely ever zero)
“Thus, there seem to be motives for contributing that outweigh the
incentive to free ride” (CFV 194)
Possible “motives”: “pure altruism”, “warm-glow” (also called, “impure
altruism”), “conditional cooperation”, “confusion”
“Confusion” describes individuals’ failure to identify (in the laboratory setup) the dominant strategy of no contribution,
-
Finding:
every study that looks for confusion finds that it plays a significant role in
observed contributions.
o “The level of confusion in all experiments is both substantial and
troubling.” (p. 196)
o “The experiments provide evidence that confusion is a confounding
factor in investigations that discriminate among motives for public
contributions, … “ (p. 196)
-
Solutions:
o Increase monetary rewards in VCM experiments ! (inadequate
monetary rewards having been identified as potential cause of
contributions provided out of confusion)
o Make sure instructions are understandable ! (poorly prepared
instructions having been identified as possible source of confusion)
o Make sure, more generally, that subjects manage to identify the
dominant strategy ! (the inability of subjects to decipher the
dominant strategy having been identified as a possible source of
confusion)
o “Our results call into question the standard, “context-free”
instructions used in public good games.” (p. 208)
7
-
Some graphs:
8
A SHORT LIST OF PRACTICAL POINTS ON HOW TO GO ABOUT DATA
GENERATION AND ANALYSIS (see p. 64 of Friedman & Cassar whose chapter 5
I will send you if you send me message):
All true!
Also true (well, at least in my view):
You have a problem if you have to torture the data to make them confess!
Ideally, you should be able to tell a story (your story!) by way of descriptive data
(graphs, summary statistics). Tufte (1983) is a good book indeed. But even
simpler, take note of what you like in articles you read.
And, yes, it is very important to go through the „qualitative phase“ and really
understand the data. Such an analysis will help you spot unexpected
(irr)regularities (recording mistakes, computer malfunctions, human choice
idiosyncracies, etc.) [Because of the ease with which programs like STATA can
be used nowadays, some experimenters are tempted to skip this step. Bad idea!]
Importantly, IT IS WORTH BEARING IN MIND THAT STATISTICAL TESTS ARE
UNABLE TO DETECT FLAWS IN EXPERIMENTAL DESIGN AND
IMPLEMENTATION … (Ondrej’s write-up p. 7)
A little excursion of a more general nature:
The following text has been copied and edited from a wonderful homepage that
I recommend to you enthusiastically: http://www.socialresearchmethods.net/
It’s an excellent resource for social science reseach methods.
Four interrelated components that influence the conclusions you might reach
9
from a statistical test in a research project. <snip>
The four components are:
sample size, or the number of units (e.g., people) accessible to the study
effect size, or the salience of the treatment relative to the noise in
measurement
alpha level (α, or significance level), or the odds that the observed result is
due to chance
power, or the odds that you will observe a treatment effect when it occurs
Given values for any three of these components, it is possible to compute the
value of the fourth. <snip>
Figure 1 shows the basic decision matrix involved in a statistical conclusion. All
statistical conclusions involve constructing two mutually exclusive hypotheses,
termed the null (labeled H0) and alternative (labeled H1) hypothesis. Together,
the hypotheses describe all possible outcomes with respect to the inference. The
central decision involves determining which hypothesis to accept and which to
reject.
For instance, in the typical case, the null hypothesis might be:
H0: Program Effect = 0
while the alternative might be
H1: Program Effect <> 0
<snip>
Figure 1 below is a complex figure that you should take some time studying. <snip>
Type I [false positive] is the same as the α or significance level and labels the odds of finding a
difference or effect by chance alone. (Is there a psychological, or reporting, bias here?)
Type II [false negative] suggest that you find that the program was not demonstrably effective.
(There may be a psychological bias here too but probably a healthy one.)
10
H0 (null hypothesis) true
H0 (null hypothesis) false
H1 (alternative hypothesis)
false
H1 (alternative hypothesis)
true
In reality...
In reality...
There is no relationship
There is no difference, no
gain
Our theory is wrong
We accept the null
hypothesis (H0)
We reject the alternative
hypothesis (H1)
We say...
"There is no relationship"
"There is no difference, no
gain"
"Our theory is wrong"
There is a relationship
There is a difference or
gain
Our theory is correct
1-α
β
(e.g., .95)
(e.g., .20)
THE CONFIDENCE LEVEL
TYPE II ERROR
The odds of saying there is
no relationship, difference,
gain, when in fact there is
none
The odds of saying there is
no relationship, difference,
gain, when in fact there is
one
The odds of correctly not
confirming our theory
The odds of not confirming
our theory when it’s true
95 times out of 100 when
there is no effect, we’ll say
there is none
20 times out of 100, when
there is an effect, we’ll say
there isn’t
11
We reject the null
hypothesis (H0)
We accept the alternative
hypothesis (H1)
We say...
"There is a relationship"
"There is a difference or
gain"
"Our theory is correct"
α
1-β
(e.g., .05)
(e.g., .80)
TYPE I ERROR
POWER
(SIGNIFICANCE LEVEL)
The odds of saying that there
is an relationship, difference,
gain, when in fact there is
one
The odds of saying there is
an relationship, difference,
gain, when in fact there is not
The odds of confirming our
theory incorrectly
5 times out of 100, when
there is no effect, we’ll say
there is on
We should keep this small
when we can’t afford/risk
wrongly concluding that our
program works
The odds of confirming our
theory correctly
80 times out of 100, when
there is an effect, we’ll say
there is
We generally want this to be
as large as possible
Figure 1. The Statistical Inference Decision Matrix
See worksheet for L 11.
the lower the α, the lower the power; the higher the α, the higher the power
the lower the α, the less likely it is that you will make a Type I Error (i.e., reject
the null when it’s true)
the lower the α, the more "rigorous" the test
an α of .01 (compared with .05 or .10) means the researcher is being relatively
careful, s/he is only willing to risk being wrong 1 in a 100 times in rejecting the null
when it’s true (i.e., saying there’s an effect when there really isn’t)
an α of .01 (compared with .05 or .10) limits one’s chances of ending up in the bottom
row, of concluding that the program has an effect. This means that both your
statistical power and the chances of making a Type I Error are lower.
an α of .01 means you have a 99% chance of saying there is no difference when
there in fact is no difference (being in the upper left box)
increasing α (e.g., from .01 to .05 or .10) increases the chances of making a Type I
Error (i.e., saying there is a difference when there is not), decreases the chances of
making a Type II Error (i.e., saying there is no difference when there is) and
decreases the rigor of the test
increasing α (e.g., from .01 to .05 or .10) increases power because one will be
rejecting the null more often (i.e., accepting the alternative) and, consequently, when
the alternative is true, there is a greater chance of accepting it (i.e., power)
12
A little excursion of a more general nature:
Robert M. Becker at Cornell University illustrates these concepts masterfully, and
entertainingly, by way of the OJ Simpson trial (note that this is actually a very
nice illustration of the advantages of contextualization although there may be
order effects here /):
http://www.socialresearchmethods.net/OJtrial/ojhome.htm
H0: OJ Simpson was innocent
(although our theory is that in fact he was guilty as charged)
HA: Guilty as charged (double murder)
Can H0 be rejected, at a high level of confidence? I.e. …
Type I error? Returning a guilty verdict when the defendant is innocent.
Type II error? Returning a not guilty verdict when the defendant is guilty.
The tradeoff (The Jury’s Dilemma): Do we want to make sure
we put guilty people in jail (that would mean, to choose a higher α =
to have less stringent demands on evidence needed)
or we keep innocent people out of jail (that would mean , to choose a lower α =
to have higher demands on evidence needed)
13
14
Ondrej’s appendix to my book …
1. Introduction
1.1 Descriptive statistics
Descriptive statistics - tools for presenting various characteristics of subjects’ behavior as
well as their personal characteristics in the form of tables and graphs, and with methods
of summarizing the characteristics by measures of central tendency, variability, and so
on.
One normally observes variation in characteristics between (or across) subjects, but
sometimes also within subjects – for example, if subjects’ performance varies from round
to round of an experiment.
Inferential statistics - formal statistical methods of making inferences (i.e., conclusions)
or predictions regarding subjects’ behavior.
Types of variables (Stevens 1946)
categorical variables (e.g., gender, or field of study)
ordinal variables (e.g., performance rank)
interval variables (e.g., wealth or income bracket)
ratio variables (e.g., performance score, or the number of subjects choosing
option A rather than option B).
Different statistical approaches may be required by different types of variables.
1.1.1 Measures of central tendency and variability
Measures of central tendency
- the (arithmetic) mean (the average of a variable’s values)
- the mode (the most frequently occurring value(s) of a variable)
- the median (the middle-ranked value of a variable)
– useful when the variable’s distribution is asymmetric or contains outliers
Measures of variability
- the variance (the average of the squared deviations of a variable’s values from the
variable’s arithmetic mean)
- an unbiased estimate of the population variance, ŝ2=ns2/(n-1), where s2 is the sample
variance as defined in words directly above, and n is the number of observations on the
variable under study)
- the standard deviation (the square root of the variance)
- the range (the difference between a variable’s highest and lowest value)
- the interquartile range (the difference between a variable’s values at the first quartile
(i.e., the 25th percentile) and the third quartile (i.e., the 75th percentile)
- Furthermore, … measures assessing the shape of a variable’s distribution – such as the
degree of symmetry (skewness) and peakedness (kurtosis) of the distribution – useful
15
when comparing the variable’s distribution to a theoretical probability distribution (such
as the normal distribution, which is symmetric and moderately peaked).
1.1.2 Tabular and graphical representation of data
ALWAYS inspect the data by visual means before conducting formal statistical tests!
And do it on as disaggregated level as possible!
1.2 Inferential statistics
We use a sample statistic such as the sample mean to make inferences about a (unknown)
population parameter such as the population mean.1
Difference between the two is the sampling error, it decreases with larger sample size.
Sample statistics draw on measures of central tendency and variability, so the fields of
descriptive and inferential statistics are closely related: A sample statistic can be used for
summarizing sample behavior as well as for making inferences about a corresponding
population parameter.
1.2.1 Hypothesis testing (as opposed to estimation of population parameters – see 2.1.1.)
classical hypothesis testing model
H0, of no effect (or no difference) versus
H1, of the presence of an effect (or presence of a difference)
where H1 is stated as either nondirectional (two-tailed) if no prediction about the direction
of the effect or difference, or directional (one-tailed) if prediction (researchers sometimes
speak of two-tailed and one-tailed statistical tests, respectively).
A more conservative approach is to use a nondirectional (two-tailed) H1.
Can we reject H0 in favor of H1?
Example: Two groups of subjects facing different experimental conditions:
Does difference in experimental conditions affects subjects’ average performance?
H0: µ1 = µ2 and H1: µ1 ≠ µ2, or H1: µ1 > µ2 or H1: µ1 < µ2, if we have theoretical or
practical reasons for entertaining a directional research hypothesis,
where µI denotes the mean performance of subjects in Population i from which Sample i
was drawn. How confident are we about our conclusion?
1
As further discussed below, random sampling is important for making a sample representative of the
population we have in mind, and consequently for drawing valid conclusions about population parameters
based on sample statistics. Recall the problematic recruiting procedure in Hoelzl Rustichini (2005) and
Harrison’s et al (2005) critique of the unbalanced subject pools in Holt & Laury (2002).
16
1.2.2 The basics of inferential statistical tests
-
compute a test statistic based on sample data
compare to the theoretical probability distribution of the test statistic constructed
assuming that H0 is true
If the computed value of the test statistic falls in the extreme tail(s) of the
theoretical probability distribution – the tail(s) being delimited from the rest of the
distribution by the so called critical value(s) – conclude that H0 is rejected in
favor of H1; otherwise conclude that H0 of no effect (or no difference) cannot be
rejected. By rejecting H0, we declare that the effect on (or difference in) behavior
observed in our subject sample is statistically significant, meaning that the effect
(or difference) is highly unlikely due to chance (i.e., random variation) but rather
due to some systematic factors.
By convention, level of statistical significance (or significance level), α,
often set at 5% (α=.05), sometimes at 1% (α=.01) or 10% (α=.10).
Alternatively, one may instead (or additionally) wish to report the exact probability value
(or p-value), p, at which statistical significance would be declared.
The significance level at which H0 is evaluated and the type of H1 (one-tailed or twotailed) ought to be chosen (i.e., predetermined) by the researcher prior to conducting the
statistical test or even prior to data collection.
The critical values of common theoretical probability distributions of test statistics, for
various significance levels and both types of H1, are usually listed in special tables in
appendices of statistics (text) books and in Appendix X of Ondrej’s chapter.
1.2.3 Type I and Type II errors, power of a statistical test, and effect size
Lowering α (for a given H1)
- increases the probability of a Type II error, β, which is committed when a false H0 is
erroneously accepted despite H1 being true.
- decreases the power of a statistical test, 1- β, the probability of rejecting a false H0.
Thus, in choosing a significance level at which to evaluate H0, one faces a tradeoff
between the probabilities of committing the above statistical errors.
Other things equal, the larger the sample size and the smaller the sampling error, the
higher the likelihood of rejecting H0 and hence the higher the power of a statistical test.
The probability of committing a Type II error as well as the power of a statistical test can
only be determined after specifying the value of the relevant population parameter(s)
under H1.
17
Other things equal, the test’s power increases the larger the difference between the values
of the relevant population parameter(s) under H0 and H1.
This difference, when expressed in standard deviation units of the variable under study, is
sometimes called the effect size (or Cohen’s d index).
Especially in the context of parametric statistical tests, some scientists prefer to do a
power-planning exercise prior to conducting an experiment: After specifying a minimum
effect size they wish to detect in the experiment, they determine such a sample size that
yields what they deem to be sufficient power of the statistical test to be used.
Note, however, that one may not know a priori which statistical test is most appropriate
and thus how to perform the calculation. In addition, existing criteria for identifying what
constitutes a large or small effect size are rather arbitrary (Cohen (1977) proposes that d
greater than 0.8 (0.5, 0.2.) standard deviation units represents a large (medium, small)
effect size).
Other things equal, however, the smaller the (expected) effect size, the larger the sample
size required to yield a sufficiently powerful test capable of detecting the effect. See, e.g.,
[S] pp. 164-173 and pp. 408-412 for more details.
Criticisms of the classical hypothesis testing model:
Namely, with a large enough sample size, one can almost always obtain a statistically
significant effect, even for a negligible effect size (by similar token, of course, a
relatively large effect size may turn out statistically insignificant in small samples).
Yet if one statistically rejects H0 in a situation where the observed effect size is
practically or theoretically negligible, one is in a practical sense committing a Type I
error. For this reason, one should strive to assess whether or not the observed effect size –
i.e., the observed magnitude of the effect on (or difference in) behavior – is of any
practical or theoretical significance. To do so, some researchers prefer to report what is
usually referred to as the magnitude of treatment effect, which is also a measure of effect
size (and is in fact related to Cohen’s d index). We discuss the notion of treatment effect
in Sections 2.2.1 and 2.3.1, and see also [S] pp.1037-1061 for more details.
Another criticism: improper use, particularly in relation to the true likelihood of
committing a Type I and Type II error. Within the context of a given research hypothesis,
statistical comparisons and their significance level should be specified prior to
conducting the tests. If additional unplanned tests are conducted, the overall likelihood of
committing a Type I error in such an analysis is inevitably inflated well beyond the α
significance level prespecified for the additional tests. For explanation, and possible
remedies, see Ondrej’s text,
Alternatives:
the minimum-effect hypothesis testing model
the Bayesian hypothesis testing model
18
See Cohen (1994), Gigerenzer (1993) and [S] pp. 303-350 for more details, also text.
1.3 The experimental method and experimental design
How experimental economists and other scientists design experiments to evaluate
research hypotheses.
Proper design and execution of your experiment ensure reliability of your data and hence
also the reliability of your subsequent statistical inference. (Statistical tests are unable to
detect flaws in experimental design and implementation.)
A typical research hypothesis involves a prediction about a causal relationship between
an independent and a dependent variable (e.g.. effect of financial incentives on risk
aversion, or on effort, etc.)
A common experimental approach to studying the relationship is to compare the behavior
of two groups of subjects: the treatment (or experimental) group and the control (or
comparison) group.
The independent variable is the experimental conditions – manipulated by the
experimenter – that distinguish the treatment and control groups (one can have more than
one treatment group and hence more than two levels of the independent variable).
The dependent variable is the characteristic of subjects’ behavior predicted by the
research hypothesis to depend on the level of the independent variable (one can also have
more than one dependent variable).
In turn, one uses an appropriate inferential statistical test to evaluate whether there indeed
is a statistically significant difference in the dependent variable between the treatment
and control groups.
What we describe above is commonly referred to as true experimental designs,
characterized by a random assignment of subjects into the treatment and control groups
(i.e., there exists at least one adequate control group) and by the independent variable
being exogenously manipulated by the experimenter in line with the research hypothesis.
These characteristics jointly limit the influence of confounding factors and thereby
maximize the likelihood of the experiment having internal validity, which is achieved to
the extent that observed differences in the dependent variable can be unambiguously
attributed to a manipulated independent variable.
Confounding factors are variables systematically varying with the independent variable
(e.g., yesterday’s seminar?), which may produce a difference in the dependent variable
that only appears like a causal effect. Unlike true experiments, other types of experiments
conducted outside the laboratory – such as what is commonly referred to as natural
experiments exercise less or no control over random assignment of subjects and
exogenous manipulation of the independent variable, and hence are more prone to the
potential effect of confounding variables and have lower internal validity.
19
Random assignment of subjects conveniently maximizes the probability of obtaining
control and treatment groups equivalent with respect to potentially relevant individual
differences such as demographic characteristics. As a result, any difference in the
dependent variable between the treatment and control groups is most likely attributable to
the manipulation of the independent variable and hence to the hypothesized causal
relationship. Nevertheless, the equivalence of the control and treatment groups is rarely
achieved in practice, and one should control for any differences between the control and
treatment groups if deemed necessary (e.g., as illustrated in Harrison et al 2005, in their
critique of Holt & Laury 2002).
Similarly, one should not simply assume that a subject sample is drawn randomly and
hence is representative of the population under study. Consciously or otherwise, we often
deal with nonrandom samples. Volunteer subjects, or subjects selected based on their
availability at the time of the experimental sessions, are unlikely to constitute true
random samples but rather convenience samples. As a consequence, the external validity
of our results – i.e., the extent to which our conclusions generalize beyond the subject
sample(s) used in the experiment – may suffer.2
Choosing an appropriate experimental design often involves tradeoffs. One must pay
attention to the costs of the design in terms of the number of subjects and the amount of
money and time required, to whether the design will yield reliable results in terms of
internal and external validity, and to the practicality of implementing the design.
In other words, you may encounter practical, financial or ethical limitations preventing
you from employing the theoretically best design in terms of the internal and external
validity.
1.4 Selecting an appropriate inferential statistical test
-
Determine whether the hypothesis (and hence your data set) involves one or more
samples.
Single sample: use a single-sample statistical test to test for the absence or
presence of an effect on behavior, along the lines described in the first example in
Section 1.2.1.
Two samples: use a two-sample statistical test for the absence or presence of a
difference in behavior, along the lines described in the second example in Section
1.2.1.
Most common single- and two-sample statistical tests in Sections 2 to 6; other
statistical tests and procedures intended for two or more samples are not discussed
in this book but can be reviewed, for example, in [S] pp. 683-898.
2
In the case of convenience samples, one usually does not know the probability of a subject being selected.
Consequently one cannot employ methods of survey research that use known probabilities of subjects’
selection to correct for the nonrandom selection and thereby to make the sample representative of the
population. One should rather employ methods of correcting for how subjects select into participating in
the experiment (see, e.g., Harrison et al., UCF WP 2005, forthcoming in JEBO?).
20
When making a decision on the appropriate two-sample test, one first needs to determine
whether the samples – usually the treatment and control groups/conditions in the context
of the true experimental design described in Section 1.3 – are independent or dependent.
Independent samples design (or between-subjects design, or randomized-groups design) –
where subjects are randomly assigned to two or more experimental and control groups –
one employs a test for two (or more) independent samples.
Dependent samples design (or within-subjects design, or randomized-blocks design) –
where each subject serves in each of the k experimental conditions, or, in the matchedsubjects design, each subject is matched with one subject from each of the other (k-1)
experimental conditions based on some observable characteristic(s) believed to be
correlated with the dependent variable – one employs a test for dependent samples.
One needs to ensure internal validity of the dependent samples design by controlling for
order effects (so that differences between experimental conditions do not arise solely
from the order of their presentation to subjects), and, in the matched-subjects design, by
ensuring that matched subjects are closely similar with respect to the matching
characteristic(s) and (within each pair) are assigned randomly to the experimental
conditions.
Finally, in factorial designs, one simultaneously evaluates the effect of several
independent variables (factors) and conveniently also their interactions, which usually
requires using a test for factorial analysis of variance or other techniques (which are not
discussed in this book but can be reviewed, e.g., in [S] pp.900-955).
Sections 2 to 6: we discuss the most common single- and two-sample parametric and
nonparametric inferential statistical tests.
The parametric label is usually used for tests that make stronger assumptions about the
population parameter(s) of the underlying distribution(s) for which the tests are
employed, as compared to non-parametric tests that make weaker assumptions (for this
reason, the non-parametric label may be slightly misleading since nonparametric tests are
rarely free of distributional and other assumptions).
Some researchers instead prefer to make the parametric-nonparametric distinction based
on the type of variables analyzed by the tests, with nonparametric tests analyzing
primarily categorical and ordinal variables with lower informational content (see Section
1.1).
Behind the alternative classification is the widespread (but not universal) belief that a
parametric test is generally more powerful than its nonparametric counterpart provided
the assumption(s) underlying the former test are satisfied, but that a violation of the
assumption(s) calls for transforming the data into a format of (usually) lower
informational content and analyzing the transformed data by a nonparametric test.
21
Alternatively, … use parametric tests even if some of their underlying assumptions are
violated, but make adjustments to the test statistic to improve its reliability.
While the reliability and validity of statistical conclusions depends on using appropriate
statistical tests, one often cannot fully validate the assumptions underlying specific tests
and hence faces the risk of making wrong inferences. For this reason, one is generally
advised to conduct both parametric and nonparametric tests to evaluate a given statistical
hypothesis, and – especially if results of alternative tests disagree – to conduct multiple
experiments evaluating the research hypothesis under study and jointly analyze their
results by using meta-analytic procedures. See, e.g., [S] pp.1037-1061 for further details.
Of course, that’s only possible if you have enough resources.
2. t tests for evaluating a hypothesis about population mean(s)
2.1 Single-sample t test (and z test)
The single-sample t test and z test are similar parametric statistical tests usually employed
for interval or ratio variables (see Section 1.1). They evaluate the null hypothesis of
whether, in the population that our sample represents, the variable under study has a
population mean equal to a specific value, µ.
If the value of the variable’s population variance is known and the sample size is
relatively large (usually deemed at least n>25 but sometimes n>40) one can employ a
single-sample z test, the test statistic of which follows the standard normal probability
distribution. However, since the population variance is rarely known (and hence must be
estimated using the sample standard deviation) and experimental datasets tend to be
rather small, it is usually more appropriate to use the t test which relaxes the two
assumptions and is based on the t probability distribution (also called the Student’s t
distribution, which approaches the standard normal probability distribution as n
approaches infinity).
The assumptions behind the t test are that
(1) the subject sample has been drawn randomly from the population it represents, and
(2) in the population the sample represents, the variable under study is normally
distributed. When the normality assumption is violated (see Section 6.1 for tests of
normality), the test’s reliability may be compromised and one may prefer to instead use
alternative nonparametric tests, such as the Wilcoxon signed-ranks test (see Section 3.1)
or the binomial sign test (see section 3.3 where we also discuss its application referred to
as the single-sample test for the median).
2.1.1 Estimation of confidence intervals for the single-sample t test
As mentioned in Section 1.2.1, another methodology of inferential statistics besides
hypothesis testing is estimation of population parameters. A common method is interval
22
estimation of the so called confidence interval for a population parameter – a range of
values that, with a high degree of confidence, contains the true value of the parameter.
For example, a 95% confidence interval contains the population parameter with the
probability of 0.95.
Confidence interval estimation uses the same sample information as hypothesis testing
(and in fact tends to be viewed as part of the classical hypothesis testing model).
2.2 t test for two independent samples
The assumptions behind the t test for two independent samples are that (1) each sample
has been drawn randomly from the population it represents, (2) in the populations that the
samples represent, the variable under study is normally distributed, and (3) in the
populations that the samples represent, the variances of the variable are equal (also
known as the homogeneity of variance assumption).
…
When the normality assumption underlying the t test is violated, one may prefer to
instead use a nonparametric test, such as the Mann-Whitney U test (see Section 4.1) or
the chi-square test for r x c tables (see section 4.3, where we also discuss its application
referred to as the median test for two independent samples. However, nonparametric tests
usually sacrifice information by transforming the original interval or ratio variable into an
ordinal or categorical format. For this reason, some researchers prefer to use the t test
even when the normality assumption is violated (also because the t test actually tends to
perform relatively well even with its assumptions violated) but use more conservative
(i.e., larger in magnitude) critical values to avoid inflating the likelihood of committing a
Type I error. Nevertheless, one should attempt to understand why and to what extent the t
test’s assumptions are violated. For example, the presence of outliers may cause violation
of both the normality and variance homogeneity assumption.
Before using the t test, one may wish to verify the homogeneity of variance assumption.
To do so, one can use an F test for two population variances, or Hartley’s Fmax test for
homogeneity of variance, or other tests. Both the abovementioned tests rest on normality
of the underlying distributions from which the samples are drawn, and the former is more
appropriate in the case of unequal sample sizes. See, e.g., [S] pp.403-408 and 722-725 for
a detailed discussion. One may alternatively prefer to use nonparametric tests, such as the
Siegel-Tuckey test for equal variability (see, e.g., [S] pp.485-498) or the Moses test for
equal variability (see, e.g., [S] pp.499-512).
When the t test is employed despite the homogeneity of variance assumption being
violated – which may be a key result in itself, among other things possibly signaling the
presence of outliers, or that the control and treatment groups differ along some
underlying dimension, or that the treatment group subjects react heterogeneously to the
treatment – the likelihood of committing a Type I error increases. The literature offers
several remedies that make the t test more conservative, such as adjusting upwards the
23
(absolute value of) critical value(s) of the t test, or adjusting downwards the number of
degrees of freedom. Alternatively, one may prefer to instead use a nonparametric test that
does not rest on the homogeneity of variance assumption, such as the KolmogorovSmirnov test for two independent samples (see Section 4.2).
2.2.1 Measuring the magnitude of treatment effect
As discussed in Section 1.2.3, one should be cautious when the observed effect on (or
difference in) behavior is of little practical or theoretical relevance and statistical
significance of the effect arises primarily from having a large enough sample size.
Besides judging the practical (economic) significance of an observed effect, one may
wish to evaluate its magnitude, in a manner more or less independent of sample size, by
determining what is referred to as the magnitude of treatment effect – the fraction of
variation in the dependent variable (i.e., the variable under study) attributable to variation
in the independent variable (usually the treatment variation in experimental conditions
between the treatment and control groups). For a given experimental design, there
frequently exist multiple measures of the magnitude of treatment effect which tend to
yield different results. …
2.3 t test for two dependent samples
2.3.1 Measuring the magnitude of treatment effect
2.4 Examples of using t tests
Example 1:
Kovalchik et al. (2005) compare the behavior between two samples of young students
(Younger) and elderly adults (Older). In one of the tasks, the authors elicit willingness-topay (WTP) for a group of subjects in the role of buyers and willingness-to-accept (WTA)
for another group of subjects in the role of sellers of a real mug – see the data in the last
column of their Table 2. The authors claim that there is no significant difference between
WTP and WTA for either the Younger or the Older (and also for the pooled sample), and
that there is no significant difference between the Younger and the Older in either their
WTP or their WTA.
Although the authors give no details as to how they conducted the statistical inferential
tests, the third column of Table 2 offers us sufficient information to evaluate the
hypotheses for the mug’s WTP/WTA using the t test for two independent samples.
Before doing so, it is worth noting that although the we do not have full information on
the shape of the WTP and WTA distributions, the normality assumption underlying the t
test might well be violated especially for the WTP of the Older and the WTA of the
Younger, since for either group there is a large difference between the sample mean and
the sample median, which indicates distribution asymmetry. Similarly, judging from the
rather large differences in the four standard deviations, the homogeneity of variance
24
assumption behind the t test might also be violated. We nevertheless calculate the
sequence of t tests for illustration purposes.
Evaluating first the null hypothesis that the mean WTP and WTA do not differ from each
other for the population of the Younger (Older), say, against a two-tailed H1 at the 5%
significance level, we calculate the tYounger (tOlder) test statistic as follows (using the
formula for equal sample sizes, disregarding that there is an extra subject in the third
group since this will not affect the t test statistic in any essential way):
tYounger =
(3.88 − 2.24)
4.88 2
1.75 2
+
(26 − 1) (25 − 1)
= 1.578
t Older =
(2.48 − 3.25)
1.7 2
3.04 2
+
(25 − 1) (25 − 1)
= −1.083
Next, we evaluate the null hypothesis that the mean WTP (WTA) does not differ between
the Younger and Older populations, again against a two-tailed H1 at the 5% significance
level. We calculate the tWTP (tWTA) test statistic as follows (using again the formula for
equal sample sizes):
tWTA =
(2.48 − 3.88)
2
2
1.7
4.88
+
(25 − 1) (26 − 1)
= −1.352
tWTP =
(3.25 − 2.24)
3.04 2
1.75 2
+
(25 − 1) (25 − 1)
= 1.411
We compare the four computed t tests statistics with the two-tail upper and lower critical
values, t0.975(df) and t0.025(df), where there are either 49 or 48 degrees of freedom
depending on the particular comparison (remember that df=n1+n2–2). Based on Table A2
in Appendix X, t0.975(49)=- t0.025(49) and t0.975(48)=- t0.025(48) lie between 2.000 and 2.021, so
we can safely reject the null hypothesis of no difference in all four cases above. This
confirms the conclusions of Kovalchik et al. (2005). [Or does it?]
3. Nonparametric alternatives to the single-sample t test
3.1 Wilcoxon signed-ranks test
3.2 Chi-square goodness-of-fit test
3.3 Binomial sign test for a single sample
3.3.1 Single-sample test for the median
3.3.2 z test for a population proportion
3.4 Examples of using single-sample tests
Example 1:
25
Ortmann et al. (2000) study trust and reciprocity in an investment setting and how it is
affected by the form of presenting other subjects’ past behavior and by a questionnaire
prompting strategic thinking. The authors do not find significant differences in
investment behavior across their five between-subjects treatments (i.e., they have one
control group called the Baseline and four treatment groups varying in history
presentation and in the presence or absence of the questionnaire), using the two-tailed
Mann-Whitney U test (see Table 3).
We can entertain an alternative research hypothesis and test whether, in the authors’ fifth
treatment which would be theoretically most likely to decrease trust, investment differs
from the theoretical prediction of zero. For this purpose, we analyze the data from the
authors’ Tables A5 and A5R for two different subject groups participating in the fifth
treatment, which, as the authors report, differ in their behavior as indicated by the twotailed Mann-Whitney U test (see Table 3, p=0.02). In the left and right bar graphs below,
you can see a clear difference in the shape of the sample distributions of investment for
the A5 and A5R subjects, respectively, namely that the investment of zero (ten) is much
more common for A5 (A5R) subjects. For that reason (and predominantly for illustration
purposes), we focus on whether investment behavior of the A5 subjects adheres to the
above stated theoretical prediction.
1
0
Density
.5
0
0
5
10
0
5
10
T5and5R
Graphs by Treatment
We know from Ortmann et al.’s (2000) Table 2 that the (sample) mean and median
investment of the A5 subjects is 2.2 and 0.5 units, respectively. For start, we can conduct
a two-sided single-sample t test at the 5% significance level (using, for example, the
“ttest” command in Stata) to investigate whether, in the population that the A5 subjects
26
represent, the mean investment is zero. The t test yields a highly significant result
(p=0.011). However, we see from the above bar graph that the normality assumption
underlying the t test is very likely violated.
Since the above displayed distribution of investment is so asymmetric, one would also
have troubles justifying the use of the Wilcoxon signed-ranks test. Yet a more general
problem is the corner-solution nature of the theoretical prediction, implying that ΣR–
calculated for the Wilcoxon signed-ranks test would inevitably be zero, which ultimately
makes the test unsuitable for evaluating the research hypothesis. Similarly, note that,
based on the corner-solution theoretical prediction, the expected cell frequencies for the
chi-square goodness-of-fit test would be zero for all positive-investment categories
(cells), which clearly makes the chi-square test unsuitable, too (not only due to the
statistical problem that assumption (3) of the test would be strongly violated). A similar
problem arises also for the binomial sign test (and its large-sample approximations)
where it would have been impossible to evaluate the null hypothesis of whether, in the
population that A5 subjects represent, the proportion of zero investments is equal to π=1
(to see why, check the formula for calculating the binomial probabilities).
One could adopt an alternative (and often appealing) research hypothesis to determine
whether subjects’ investment behavior in fact deviates from what would be expected
under random choice, meaning that all 11 choice categories (0,1,…,10) would be chosen
equally often in the population. In principle, one could test this hypothesis using the chisquare goodness-of-fit test. However, note that even if we pooled the A5 and A5R subject
groups – giving a total of 34 subjects (see Table 2), this would give us only slightly above
3 subjects in each of the eleven expected frequency cells (i.e., 34/11). Hence the test
reliability could be compromised due to the violation of its assumption (3).
4. Nonparametric alternatives to the t test for two independent samples
4.1 Mann-Whitney U test (Wilcoxon rank-sum test)3
4.2 Kolmogorov-Smirnov test for two independent samples4
4.3 Chi-square test for r x c tables5
4.3.1 Fisher exact test
3
Note that there exist two versions of the test that yield comparable results: we describe the version
developed by Mann and Whitney (1947) which is also referred to as the Mann-Whitney-Wilcoxon test,
while the other version, usually referred to as the Wilcoxon rank-sum test or the Wilcoxon-Mann-Whitney
test, was developed independently by Wilcoxon (1949).
4
The test was developed by Smirnov (1939) and for that reason is sometimes referred to as the Smirnov
test, but because of its similarity to the Kolmogorov-Smirnov goodness-of-fit test for a single sample (see
Section 6.1), the test described here is most commonly named the Kolmogorov-Smirnov test for two
independent samples.
5
The test is an extension of the chi-square goodness-of-fit test (see Section 3.2) to two-dimensional
contingency tables.
27
4.3.2 z test for two independent proportions
4.3.3 Median test for independent samples
4.3.4 Additional notes on the chi-square test for r x c tables
4.4. Computer-intensive tests
Computer-intensive (or data-driven) tests have become a widely used alternative to
traditional parametric and nonparametric tests. The tests are also referred to as
permutation tests, randomization (or rerandomization) tests, or exact tests. They employ
resampling of data – a process of repeatedly randomly drawing subsamples of
observations from the original dataset – to construct the sampling distribution of a test
statistic based directly on the data, rather than based on an underlying theoretical
probability distribution as in the classical hypothesis testing approach. As a consequence,
computer-intensive tests have the appeal of relying on few if any distributional
assumptions. The tests differ from each other mainly in the nature of their resampling
procedure. Most computer-intensive tests are employed for comparing behavior between
two independent random samples and hence are discussed briefly in this section (you can
find more details in the references cited below, in modern statistical textbooks, or in
manuals of statistical packages in which the tests are frequently pre-programmed).
-
randomization test for two independent samples (sampling without replacement)
bootstrap ((re)sampling with replacement)
jackknife ((re)sampling with replacement)
4.5 Examples of using tests for two independent samples
Example 1:
Abbink and Hennig-Schmidt (2006) use the Mann-Whitney U test in the context of a
bribery experiment to compare behavior of two independent groups of subjects under
neutral and loaded (corruption-framed) experimental instructions. They find that bribe
offers (averaged across 30 rounds of the experiment) do not significantly differ across the
two groups of subjects, with a one-tail p-value of 0.39.
This result may look surprising given that the median (averaged) bribe offers differ rather
widely – 1.65 for the loaded group and 3.65 for the neutral group, as we calculate from
the authors’ Table 2. We plot the two bribe offer distributions below (using the command
“kdensity” in the Stata software) to illustrate that the difference in medians results from
the bi-modal nature of the distributions (note that the medians fall in the trough of the
distributions) as well as their different shape (compare the two humps).
28
.06
.08
Density
.1
.12
.14
.16
Kernel density estimate
-2
0
2
4
6
8
loaded
kernel = epanechnikov, bandwidth = 1.15
.05
Density
.1
.15
Kernel density estimate
-2
0
2
neutral
4
6
kernel = epanechnikov, bandwidth = 1.05
If the distributions’ shape is indeed different, this would violate assumption (3) of the
Mann-Whitney U test. Conducting the F test for two population variances (for example,
using the command “sdtest” in Stata), we find that the variances of the two bribe offer
29
distributions do not differ significantly from each other, with a two-tail p-value of 0.72.
However, note that, given the bi-modality of the sample distributions, first, the normality
assumption of the F test is likely violated, so the reliability of the test might be
compromised, and second, variance is unlikely to be the most appropriate criterion when
comparing the shape of the two distributions.
One might wish to compare the above distributions of averaged briber offers using the
Kolmogorov-Smirnov test for two independent samples which does not rely on the shape
of the population distributions being identical. When we conduct the two-tail test (for
example, using the command “ksmirnov” in Stata), it yields no significant difference
between the two distributions at any conventional significance level (p=0.936).
Given the end-game effect noted by the authors as well as a potential “warming-up”
effect in early rounds of the experiment (when different forms of learning might be going
on as compared to later rounds), we examine whether leaving out the first and last five
rounds from the above analysis influences the results. Although the medians are now
somewhat different – 1.175 for the loaded group and 3.425 for the neutral group – the
bimodal shape of the distributions is again present and the difference in their shape – as
judged by either the Mann-Whitney U test or the Kolmogorov-Smirnov test for two
independent samples – remains statistically highly insignificant (p-values not reported).
Example 2:
Kovalchik et al.’s (2005) compare the behavior between two samples of young students
(Younger) and elderly adults (Older). In their second gambling experiment, subjects
sequentially make six choices between two decks of cards, one of which contains cards
with lower average payoff and higher variance – the authors call the card deck risky. The
sample distributions for the Younger and Older of the risky deck (ranging from 0 to 6)
are depicted in the authors’ Figure 3, where only the percentages of subjects choosing the
risky deck six times differ (at the 10% significance level) across the two groups of
subjects, as revealed by the Mann-Whitney U test. Based on the figure and Table 1, the
authors conclude that there is no difference in the choice behavior of the Younger and the
Older.
We use the data from Figure 3 to illustrate that, instead of using the Mann-Whitney U
separately for each given number of risky deck choices, one can make an alternative (and
perhaps more suitable) comparison of the entire distributions of the number of risky deck
choices by using the chi-square test for r x c tables. In particular, the chi-square test for
homogeneity can be used to evaluate the null hypothesis of whether the two independent
samples of the Younger and Older, each having the variable’s values organized in the
seven categories represented by the number of risky deck choices, are homogenous with
respect to the proportion of observations in each category.
In the tables below, we display, first, the observed frequencies of risky deck choices, then
the expected frequencies calculated as outlined in Section 4.3, and finally the calculation
of the chi-square test statistic using the formula given in Section 4.3. Note that since 4 out
30
of the 14 expected frequencies in the middle table fall below 5, one could argue that
assumption (3) of the chi-square test is not met, but two of the 4 cases are only marginal
and so we proceed with conducting the chi-square test, if only for illustration purposes.
Observed
frequencies:
# risky choices
Younger
Older
Column total
6
11
4
15
5
10
9
19
4
9
8
17
3
8
4
12
2
4
5
9
1
4
0
4
0
5
6
11
Row total
51
36
87
Expected
frequencies:
# risky choices
Younger
Older
Column total
6
8.7931
6.2069
15
5
11.138
7.8621
19
4
9.9655
7.0345
17
3
7.0345
4.9655
12
2
5.2759
3.7241
9
1
2.3448
1.6552
4
0
6.4483
4.5517
11
Row total
51
36
87
(Obseved-Expected)^2/Expected
6
5
# risky choices
Younger
0.5539
0.1163
Older
0.7847
0.1647
Column total
1.3386
0.281
4
0.0935
0.1325
0.2261
3
0.1325
0.1877
0.3203
2
0.3085
0.4371
0.7456
1
1.1684
1.6552
2.8235
0
0.3253
0.4608
0.7861
Row total
2.6983939
3.8227247
6.5211185
We compare the computed value of the χ2 test statistic (in yellow, i.e., the sum of the 14
cells in the last contingency table) with the appropriate critical value from the χ2(6)
probability distribution since df=(2-1)(7-1)=6, as tabulated in Table A4 in Appendix X.
Since the two-tail critical value for the 5% significance level, χ2.95(6)=12.59, is far greater
than the χ2 test statistic of 6.52 (rounded to 2 decimal places), we reject at the 5%
significance level the null hypothesis of no difference between the distributions of risky
deck choices for the Younger and Older (the same conclusion would clearly be reached
also at the 10% significance level for which the two-tail critical value is χ2.90(6)=10.64).
Example 3:
Kovalchik et al.’s (2005) compare the behavior between two samples of young students
(Younger) and elderly adults (Older). In their last task, subjects play the p-beauty contest
game with p=2/3. Subjects’ choices (guesses) are displayed in the authors’ Figure 5 using
a stem-and-leaf diagram, based on which the authors argue that the Younger and Older
subjects behave similarly in the game. We show how one can compare the distribution of
choices of the Older and Younger more formally by using the Kolmogorov-Smirnov test
for two independent samples (as done for the same p-beauty contest game, for example,
in Grosskopf and Nagel, forthcoming in GEB).
Before we do that, we plot the sample distributions of choices for the Younger and Older
subjects, noting informally that normality is likely violated especially for the latter group.
Also, verifying the homogeneity assumption by conducting the F test for two population
variances (for example, using the command “sdtest” in Stata), we find that the variances
31
of the two choice distributions differ significantly from each other, with a two-tail pvalue of 0.036. Thus neither the t test for two independent samples nor the MannWhitney U test would be entirely appropriate, while the Kolmogorov-Smirnov test is
more appropriate in that it does not rely on the shape of the population distributions being
identical.
0
.01
Density
.02
.03
.04
Kernel density estimate
0
20
40
Younger
60
80
kernel = epanechnikov, bandwidth = 4.25
0
.005
Density
.01
.015
.02
.025
Kernel density estimate
0
20
40
60
80
100
Older
kernel = epanechnikov, bandwidth = 7.29
32
When comparing the choice distributions of the Younger and Older using a two-tailed
Kolmogorov-Smirnov test for two independent samples (for example, using the
command “ksmirnov” in Stata), the test indicates no significant difference between the
two choice distributions at any conventional significance level (p=0.337), which confirms
the argument of Kovalchik et al. (2005). We note that a similar result would be obtained
using the Mann-Whitney U test (using, for example, the Stata command “ranksum”) for
which the two-tail p-value is 0.745.
5. Nonparametric alternatives to the t test for two dependent samples
5.1 Wilcoxon matched-pairs signed-ranks test6
5.2 Binomial sign test for two dependent samples7
5.3 McNemar test
5.4 Examples of using tests for two dependent samples
Example 1:
Blume and Ortmann (2005) study the effect of pre-play communication on behavior in
median- and minimum-effort games. Here we evaluate a “convergence” research
hypothesis in the loose sense of whether, in the two treatments with pre-play
communication (i.e., Median sessions M1Me-M8Me and Minimum sessions M1MinM8Min in Figures 1 and 2, respectively), there is an overall upward drift in subjects’
choices towards the pareto-efficient choice between the first and last round. For various
reasons, this alternative definition of choice convergence may be less appropriate than
that used by Blume and Ortmann, but it will serve the purpose of illustrating the use of
tests for two dependent samples.
6. A brief discussion of other statistical tests
6.1 Tests for evaluating population skewness and kurtosis
6.2 Tests for evaluating population variability
6
7
This test is a two-sample extension of the Wilcoxon signed-ranks test (see Section 3.1).
This test is a two-sample extension of the binomial sign test for a single sample (see Section 3.3).
33
A LITTLE ASIDE OF RELEVANCE HERE:
From: "dan friedman" <[email protected]>
To: "ESA Experimental Methods Discussion" <[email protected]>
Hi Timothy (and Karim and Karl)-As Karim says, this is an old and sensitive topic.
Your reasons for randomly pairing are sensible. Karim's suggestion on smaller groups
does help increase the number of (pretty much) independent observations, but at the same
time it tends to undermine your goal to study experienced players in a one-shot game.
(with smaller groups, the repeated interactions might alter incentives.)* It is a dilemma,
especially for experimentalists with limited budgets.
Independence is a strong assumption in general.
When your observations are not independent, but you run the usual parametric or
nonparametric stats as if they were independent, then you still get unbiased estimates but
the significance level is overstated. How overstated it is hard to say.
My own standard approach to such matters is to report bounds:
run the tests with individual actions, report the significance as a lower bound on the pvalue, and also run tests on session averages, providing an upper bound on the p-value. In
some cases, you can run regressions with individual subject effects and/or session effects
(fixed or random) that plausibly capture most of the problem, but I would still regard
these as overstating the significance level.
I'd regard the position taken by Karl and Karim as fairly conservative, but not the most
conservative. There might be lab or experimenter or instruction or subject pool effects, so
even session averages aren't quite guaranteed to be independent. Most people don't worry
about that, at least formally, but everyone is happier to see a result replicated in a
different lab.
My bottom line: think through the econometrics before you start, but in the end, you must
create a lab environment that corresponds to what you want to test. If that requires
random matching (or mean-matching), so be it. Just be prepared to deal with conservative
referees afterward.
--Dan
*of course, you can mislead your Ss into believing that they might be matched with any
of the 23 others when in fact they are matched only with 5 others ... but some referees
will condemn even that mild sort of deception. Dilemmas abound!
From: "John Kagel" <[email protected]
To: [email protected], "ESA Experimental Methods Discussion"
<[email protected]>
34
I am loath to reply to the group as a whole on anything but I could not disagree more
strongly with Karim's remarks. Notice that the operative word with respect to full and
complete contamination (so that the outcome reduces to SINGLE observation) is MAY so where's the proof that it happens?? It's pretty thin as far as I can tell. And there are
tradeoffs as I outline below:
Each of XX's sessions had 12 subjects, who were told they would be randomly matched
with another participant. In practice, the set of 12 subjects was divided into 3 groups of 4
subjects each with rotation within each group. This was done in an effort to obtain
"three independent sets of observations per session instead of only one" as the unit of
observation employed in the analysis is primarily session level data. The idea behind
"only one" independent observation per session if randomly rotating among all 12 bidders
in the session is given the repeated interactions between subjects this generates session
level effects that will dominate the data. In this regard he is among a growing number of
experimenters who believe this, and who break up their sessions into smaller subgroups
in an effort to obtain more "independent" observations per session. This practice [of what
Karim calls below, sterile subgroups] ignores the role of appropriate panel data
techniques to correct for dependencies across and between subjects within a given
experimental session.
There are several important and unresolved issues in choosing between these two
procedures. In both cases experimenters are trying to squeeze as much data as they can
from a limited subject-payment budget. As experimenters who have consistently
employed random rematching between all subjects recruited for a given session, and
applied panel data analysis to appropriately account for the standard errors of the
estimates, we are far from unbiased with respect to this issue. With this in mind we point
out several things: First, advocates of repeated matching of the same small sbset of
subjects within an experimental sessions to generate more "independent" observations
ignore the fact that there is no free lunch as: (i) they are implicitly lying to/deceiving
subjects by not reporting the rotation rule employed and (ii) if subjects are as sensitive to
repeated matching effects as they seem to assume under random matching between all
subjects in a given experimental session, it seems plausible that repeated play within a
small subset might generate super-game effects that will contaminate the data. Second,
and more importantly, there have been a few experiments which have devoted treatments
to determine the severity of possible session level effects from random rematching for the
group as a whole. More often than not these studies find no differences, e.g. Cooper et al.
(1993; footnote 13, p. 1308), Duffy and Ochs (2006). Also see Walker et al. (1987) and
Brosig and Rei (2007) who find no differences when comparing bids in auctions with all
human bidders against humans bidding against computers who follow the RNNE bidding
strategy. For more on the econometrics of this issue see Frechette (2007).
JK
On Mar 9, 5:45 am, karim <[email protected]> wrote:
35
> Dear Timothy,
>
> these discussions are very ancient and most of us were hoping that they will never
come back.
> If I'm getting you right, you have 24 subjects interacting over the
> course of your experiment. Obviously, if one of them does freaky
> things, this "virus" may spread through your entire subject pool. Just
> imagine it is really a virus: Anybody who has contacted the person
> with the virus may be contaminated. Anybody who has contacted anybody
> who has contacted the person with the virus may also be contaminated.
> Anybody ... etc.
>
> Since everyone in your experiment has interacted - i.e. there are no
> steril subgroups - you end up with a single independent observation.
> Analyzing the subjects on an individual level is a good idea, but it
> doesn't make them independent, because they have been interacting,
> i.e. contaminating, each other in the course of the experiment.
>
> It is a good idea to let the subjects gain experience by playing
> randomly matched games over and over for a couple of rounds, but you
> don't have to mix them all with one another. You can take 24 subjects
> and separate them into 2 or 3 independent subgroups (i.e. "steril"
> subgroups) and then rematch in the subgroups. Add 2 sessions of this
> kind to your experiment and you will have a total of at least 5 and at
> most 7 independent observations. That sounds like a good minimum
> number of ind obs for an experimental paper.
>
> When analyzing the data, you can then use non-parametric statistics or
> run regressions, in which you explicitly take the random effects that
> pertain to the subgroup membership into account. So you see, once you
> have enough observations, all types of statistical testing are
> available.
>
> If you don't trust me on all this (and why should you?!), please, look
> it up in one of the many textbooks on experimental economics or on
> experimental methods in general (e.g. in social psychology).
>
> Sorry folks, for putting this on the list. I really don't often bug
> you, by speaking up on the list. But, this time I felt I had to say
> something out loud to remind us all of the importance of sticking to
> the word "science" in ESA, instead of replacing it with "speculation."
> It certainly will be better for the reputation of experimental
> economics, if we stick to the minimum requirements that were hammered
> out many years ago. Letting go of what was once a general agreement
> amongst experimental economists will deteriorate our credibility and
36
> our standing within economics and amongst the sciences.
>
> Greetings,
> karim
>
> On 9 Mrz., 08:42, Timothy Dang <[email protected]> wrote:
>
> > Hello Karl et al>
> > On Sun, Mar 8, 2009 at 8:10 AM, Karl Schlag <[email protected]> wrote:
> > > I would prefer to discuss either why you thought this is the right decision
> > > or what kind of parametric model you could apply that allows for
> > > dependencies.
>
> > First, an off-list reply prompts me to be a bit more specific. I have
> > 24 subjects--12 in each of two roles--playing a 2x2 game for 50
> > periods, with random re-matching every period. I'm not aiming to treat
> > each of the 600 game-plays as an observation. Rather, for most of my
> > analysis, I'm planning to treat a player as an observation. For
> > instance, one data point would be "How many times did subject 5 play
> > Up?"
>
> > My motivation for random re-matching was the traditional motivation: I
> > wanted to get play which is strategically close to one-shot. But I
> > also want the subjects to have an opportunity to learn the game, and
> > I'd like data with a bit less variance than I'd to get from truly
> > one-shot play.
>
> > > In such a forum I would not be in favor of discussing how to justify
> > > treating observations independently when these are not independent.
>
> > I hope others don't share your qualms ;). I hope it's clear that I'm
> > not looking for a way to mis-represent my data. I made a pragmatic
> > decision, and I'm looking for the best way to present the results
> > which is both informative and forthright.
>
> > -Timothy
>
> > > Timothy Dang wrote:
> > >> Hello ESA>
> > >> I've recently run a 50-period 2x2 game experiment, with random
> > >> re-matching of players each period. Now I need to report the results.
> > >> I know random re-matching has tradition behind it, and I also know
> > >> that it's been subject to some good criticism when statistics get
> > >> applied as if the games are independent. In spite of those critiques,
37
> > >> it seemed the right decision to me.
>
> > >> But now, what's the right way to actually report the results? My best
> > >> feeling is that I should go ahead with the stats as if the
> > >> observations were independent, but with the conspicuous caveat that
> > >> this isn't truly legit. I've been having trouble finding clean
> > >> examples of how this is handled in recent papers.
>
> > >> Thanks
>
> > >> -Timothy
>
> > > -> > > --------------------------------------------------------------------> > > Karl Schlag
> > > Professor
Tel: +34 93 542 1493
> > > Department of Economics and Business Fax: +34 93 542 1746
> > > Universitat Pompeu Fabra
email: [email protected]
> > > Ramon Trias Fargas 25-27
www.iue.it/Personal/Schlag/
> > > Barcelona 08005, Spain
room: 20-221 Jaume I
>
> > -> > -----------------------------> > Timothy O'Neill Dang / Cretog8
> > 520-884-7261
> > One monkey don't stop no show.
38
### Harrison, G.W. & Lau, M.I. 2005. Is the evidence for hyperbolic discounting in
humans just an experimental artifact? Behavioral and Brain Sciences 28(5): 657
-
What’s the basic point?
The elicitation of the time value of payments at different points in time
(specifically at t = 0, and t > 0) may be confounded by transaction costs
(including experimenter commitment/credibility issues) afflicting the later
payments in the typical comparison scheme; these transaction costs may
lead subjects to discount the time value of those later payments. As
Harrison & Lau say, “(t)he subject is being asked to compare ‘good apples
today’ with ‘bad apples tomorrow’”.
This effect induces deviations from the exponential curve towards a curve
that’s more bowed / present – oriented (“hyperbolic discounting”, or “time
inconsistency”)
Conceptually, a FED (front end delay) can be used to understand the
importance of this confound (e.g., is the discount rate for a given horizon
and elicited with a FED different than the discount rate for the same
horizon and elicited with no FED?)
In fact, Harrison, Lau and Williams (AER 2002) use a FED in a field
experiment in Denmark and find that elicited discount rates are
“proximately invariant” wrt to horizon.
The problem: There are settings where payments at t = 0 (“money today”)
have to be compared to payments at t > 0 (“money in the future”); the
Harrison, Lau and Williams result seems to suggest that transaction costs
(lack of credibility etc.) are the source for present bias. Similarly, Harrison
& Lau (2005) suggest that the results in Coller & Williams (EE 1999)
suggest that much. The work of Coller, Harrison & Rutstroem (wp 2003)
suggest that it takes as little as a 7-day FED to overcome the effects of
subjective transaction costs.
Related manuscript:
-
Andersen, S., Harrison, G.W., Lau, M.I. & Rutstrom, E.E. 2008. Eliciting risk and
time preferences. Econometrica.
Key result: Eliciting risk and time preference together reduces time
discount rates dramatically, (Very important paper.)
39
Infinite repetition:
[Drawing also on Kreps, Binmore, and also MCWG 12D and 12App]
The main idea:
If a game is played repeatedly then the mutually desirable outcome
(Nash equilibrium) *may* differ from that of the “stage game”.
Imagine that A plays the 1sPDG not just against one P but many P’s indexed P1, P2, ...
Each Pn is only interest in the payoff from his interaction for A.
For A, however, an outcome is now a sequence of results -- what happens with P1,
what happens with P2, etc. ... – and hence the sum of payoffs u1, u2, etc., duly
discounted: u1 + *u2, + *2 u3 + ... where * , (0,1) could be the discount factor
associated with the fixed interest rate r. Crucially, prospective employee Pn, when
deciding whether to take employment is aware of A’s history of treatment of
workers.
This set-up changes the decision problem for A who now has to choose the payoffs of
exploiting and not exploiting. Whether the former payoff-dominates the latter is a
function of the strategies being used.
For example, let’s assume that all workers have the following decision rule, “Never seek
employment with an employer who in the past exploited a worker”, then the employer
would face a “grim” or “trigger” strategy. (This is nothing but what MCWG, p. 401, call
the “Nash reversion strategy: Firms cooperate until someone deviates, and any deviation
triggers a permanent retaliation in which both firms thereafter set their prices equal to
cost, the one-period Nash strategy.”)
In terms of our example, the decision problem of the employer becomes
1 + * + *2 + ... > 2 + 0 + 0 + ... which holds if * > 1/2. So, for all * > ½, the “grim” or
“trigger” strategy constitutes a subgame perfect equilibrium in the PA game as
parameterized above.
(Compare to equations 12.D.1 and 12.D.2 in MCWG: note that the above result is
nothing but Proposition 12.D.1 for the 1sPDG. Proposition 12.D.1 deals with the 2sPDG
case of the indefinitely repeated Bertrand duopoly game.)
Note: A cool result here (which is highly applicable to experimental work):
Whenever two people interact in that kind of scenario and you watch them for just one instance,
you might think they are really nice, altruistic, what not, when in fact they are just self-interested
utility maximizers (!)
40
Continuing with in(de)finite repetitions – see Binmore 2007, pp. 328 - 346:
Unfortunately, the “grim” or “trigger” strategy is one of many, many strategies.
Let’s focus, as Binmore does, on those strategies that can be represented by
finite automata (“idealized computing machines”) that can remember only a
finite number of things (actions0 and therefore cannot keep track of all possible
histories in a long repeated game. Figure 11.5. shows all 26 one-state and twostate finite automata that can play the PDG:
41
42
Folk theorem of fundamental importance for political philosophy.
But it creates an equilibrium selection problem: Which of the many equilibria will
ultimately be selected? (Well, it depends.)
43
### Pedro Dal Bo, Cooperation under the Shadow of the Future:
Experimental Evidence from Infinitely Repeated Games - guiding
questions (and some answers)
infinitely or indefinitely?
- What’s the purpose of this paper? (see abstract, intro, conclusion)
- What is “the shadow of the future”?
- What is an innovation of this paper? (p. 1591) Or, in other words, why is it
problematic to assume, without further controls, that an increase in cooperation
brought about by an increase in the probability continuation, is due solely to the
increase in the probability of continuation? (e.g., p. 1594)
- What are the simple stage games used in this article?
- What’s the difference between the two games called PD1 and PD2 and shown in
Table 2? (p. 1595) Intuitively, for which of these two games would you expect
more collaboration? Why? - Does Table 3 confirm your intuition?
See also hypotheses 3 and 4.
- What exactly was the design? (pp. 1595 – 1597)
How did subjects interact? What exactly was the “matching
procedure” (p. 1595, p. 1596) Was there are trial period? (pp. 1598 –
1599)
What were the players’ earnings? (p. 1595, p. 1598; see also the first
paragraph in section III. on p. 1597)
What were the three important new elements of the experimental
design? (p. 1595 - 1597) What exactly is the relation of the “Dice”
sessions to the “Finite” sessions? (Make sure you understand fn 16
well!) What distinguishes “Normal” from “UD” sessions?) Explain
exactly why this experiment consisted of eight sessions with three
treatments each. (p. 1597)
- What exactly are the theoretical predictions? (pp. 1597 – 1598)
Explain what Table 3 says !
44
-
Explain why the first two hypotheses are fairly general while the last
two are specific. (Make sure to connect your answer to what you see
in Table 3.)
- How exactly where the experiments implemented? (p. 1598)
- What exactly were the results of the experiment?
Discuss Table 4 in detail
-
Discuss Table 5 in detail. What are some of the key results?
-
Does the shadow of the future increase cooperation (as suggested by
theory)? (pp. 1599 – 1600) If so, how large is the effect?
Discuss Table 6 in detail
-
45
-
How do levels of cooperation differ between Dice and Finite sessions ?
(pp. 1600 – 1601) Pay attention to the analysis of individual actions
(“strategies”).
Do payoff Details Matter? (pp. 1601 – 1602) and how much do they
matter?
- What are the conclusions (the key findings) that can be drawn from the study?
strong support for the theory of infinitely repeated games
the shadow of the future matters
it significantly reduces opportunistic behavior
more coop as cont prob goes up
more coop in indefinitely repeated games than finitely
repeated games with the same expected length
behavioral differences in reaction to seemingly small
payoff changes
interesting (and unpredicted) effects of experience, might be of
interest for theories of equilibrium selection in indefinitely repeated
games.
46