Quantitative standards for absolute linguistic universals

Quantitative standards for
absolute linguistic universals
Steven T. Piantadosi
Edward Gibson
Department of Brain and Cognitive Sciences, MIT
Abstract
Linguistic universals often are justified by cross-linguistic analysis: if many languages exhibit a property, the property is taken to be universal, perhaps specified in
the cognitive or linguistic systems of language learners and users. Here we evaluate
the implicit methodological assumptions behind this reasoning, and provide a quantitative standard for when absolute universal patterns in language are statistically
justified by cross-linguistic analysis. We argue for a simple rule-of-thumb: claims
regarding absolute universals need to be confirmed by 500 statistically-independent
languages in order meet typical scientific standards. The magnitude of this number
indicates that many—if not all—purported absolute claims in linguistics have not
received sufficient justification.
Introduction
One of the most important goals for linguistics is to discover the range of possible
human languages. An understanding of the range and breadth—or lack thereof—of human
languages is key for discovering the core properties of the cognitive and linguistic systems
which support it. Do all languages have recursion? Lexical categories? Principles of binding
theory? In linguistic typology, such types of core universal structures and patterns have long
been formally hypothesized. Greenberg (1960) described two different classes of universal
patterns in languages. First, there are violable statistical tendencies in languages, universal
patterns which hold with higher than chance frequency. We refer to these as typicality
universals1 . An example is Greenberg (1963)’s first universal that subjects tend to precede
objects in simple declarative sentences. This is not true of all the world’s languages, but is
evidenced in the non-uniform distribution of word orders across languages.
Perhaps more interestingly, typology also formalizes absolute constraints, meaning
ones which are inherently impossible for human language. While nearly all instances of
1
We differ from previous literature and Greenberg, which refer to these as statistical universals. We
change terminology because the term statistical universal is confusing in a paper on statistical methods for
analyzing universals.
QUANTITATIVE STANDARDS
2
absolute constraints are contested (see Evans & Levinson 2008), prototypical hypothesized
examples include the presence of linguistic features such as WH-movement and auxiliaries
(Pinker & Bloom 1990). There are also implicational absolutes, such as Greenberg (1963)’s
universal that languages with VSO word order are always prepositional. Though some
have argued that typicality universals are more informative and useful than absolute ones
(Dryer 1998, Cysouw 2004, Bickel 2007) or that typological data is not relevant to universal
grammar (Newmeyer 1998, 2004, Haspelmath 2004), we believe that the most substantive and interesting theories about cognition would come from absolute universals. Such
universals—if any exist—would delineate necessary properties of human languages and thus
characterize hard constraints on cognitive capacities. They also provide hypotheses about
what knowledge or constraints humans bring to the problem of language learning—what
makes humans distinctively human.
Statistical methods for typicality universals have been extensively discussed in the
literature (for an excellent review, see Cysouw 2004). A basic challenge with testing typicality universals is that the set of existing languages may not form a representative sample
in the space of all possible human languages. For instance, if it is observed that SOV word
order is more frequent than OVS, this may be due to an innate constraint against OVS word
order. Alternatively, the observed data might result from historical accident: perhaps SOV
languages just happened to be the ones that survive to the present day. What languages
survive is part chance, determined by what cultures are dominant politically, socially, and
militarilty. But, the problem is more serious than just the effects of random chance. The
processes of language change and contact lead to multiple languages with correlated features, meaning that some word orders will be over-represented much more than would be
expected just by pure random sampling. That is, languages will tend to “clump” together
in features space since they are related, making the true distribution seem even less uniform
than it would appear by chance. This problem known as Galton’s problem, after Sir Francis
Galton, who first pointed out the issues it causes for comparative anthropological work2 .
Many authors have proposed addressing this problem in typology by using a sample
of languages which is as independent as possible. If one could find independent samples of
language—removing all the historical and geographical accidents of language evolution—
one might be able to estimate the relative frequencies of different features reliably. Dryer
(1989) argues that simply using stratified sampling (e.g. Tomlin 1986) does not adequately
address this problem. Instead, Dryer suggests dividing languages up into genera, genetic
groups of languages from 3500-4000 years ago (but see Maslova 2000, Dryer 2000), and
looking at the patterns in genera of five different geographical regions. Dryer reasons that
if a pattern is observed in all geographical regions, then it is likely to be universal since
languages will tend to be mostly independent across geographical regions. An alternative
to this, proposed by Perkins (1989) and Rijkhoff (1993, 1998), is to sample languages which
are as distinct as possible in their feature values.
These methods are explicitly constructed for evaluating typicality universals in that
they try to use sophisticated sampling techniques to more reliably estimate the frequencies
of features. However, it is not clear how one might apply them to absolute universals.
For one, taking a smaller, independent sample can never improve the estimate of what
2
Some anthropologists have even argued that the problem is so severe, there exist only 50 to 75 cultures
that can be examined in the whole world and treated independently (Naroll 1973).
QUANTITATIVE STANDARDS
3
is possible and what is not, because it can only remove languages, potentially removing a
crucial example of an otherwise unseen feature. Second, most statistical tests which are used
by these authors are designed to compare relative frequencies, not to determine when the
true frequency of a feature is zero. Indeed, absolute universals provide an interesting puzzle
for scientific methodology: the presence of an absolute constraint can only be inferred from
what has not yet been observed. But, as is often pointed out by typologists, it is always
possible that an exception to a universal constraint might be observed with one additional
language. Dryer (1998) describes the problem as
... no matter how many languages we examine and find conforming to an absolute universal, we can never know that there is not another language that fails
to confirm to the universal, either one that was once spoken or a hypothetical
language that is possible but never actually spoken due to historical accident.
What this means is that absolute universals are never testable3 . ... No amount
of data conforming to the generalization provides any reason to believe that
there are no languages that do not conform. And no evidence from attested
languages can provide any basis for believing that exceptions are not possible.
For Dryer this is both the logical problem of induction, and a statistical puzzle of how we can
know whether features are low probability, or zero probability. Indeed, Bickel (2001) goes so
far as to say that cross-linguistic analysis “cannot in principle contribute” to the discovery
of what is possible and impossible in human languages since, “a probabilistic statement is
not, and cannot be converted into, a possibility statement.” He lists examples of linguistic
phenomena which were previously presumed to be universally impossible, including for
instance syntactic ergativity without morphological ergativity, and pronoun borrowing.
This paper has two goals. The first is to show that it is possible within standard
and justifiable scientific methods to discover absolute universal laws of language—to infer
what is possible from what appears to be probable. We do this in two ways. First, we
present a method centered around the frequentist idea of keeping a low false positive rate
of proposing linguistic universals. Second, we present a formal mathematical model which
formalizes the degree of belief one should have that property of language is impossible, given
a set of sampled languages.
The model is a generalization of a simple problem often taught in introductory
Bayesian statistical courses: how many times would a coin have to come up heads before you
were reasonably convinced that both sides of the coin were heads? Though you may never
observe both sides simultaneously, enough successive instances of heads—say, 1000—will
eventually convince you beyond any scientific standard of evidence that tails is impossible.
The reason for this is that 1000 heads in a row is very low probability if the coin is fair, but
has probability 1 if the coin has two heads. Thus, the data will eventually outweight any
low expectations you have about coming across two-headed coins. Such coin-flipping examples demonstrate one way data and modeling can be used to adjudicate between potential
theories—in this case between an outcome of tails being possible or impossible.
Without these types of analysis, it is perhaps unclear if it is even scientifically valid
to make claims about unobserved universal patterns. At what point does the absence of
3
Dryer uses “testable” as a technical term, meaning that there exists possible evidence to falsify the
theory, and possible evidence to confirm it.
QUANTITATIVE STANDARDS
4
a linguistic feature justify the inference that it is impossible for human language? Is not
observing a feature in ten languages enough? One hundred languages? Ten thousand? Does
it require more than the number of languages currently in existence?
The second goal of this paper is to answer these questions, showing how many languages need to be examined in order to provide strong evidence for a universal. We demonstrate that within scientific standards, the number of languages which one must sample in
order to provide strong evidence for absolute linguistic universals is quite large. We will
justify the following rule-of-thumb:
Findings about absolute universals in language should be based on examination
of no fewer than 500 independent languages.
To our knowledge, most typological studies report results from far fewer than this amount,
indicating that many purported absolute universals have likely received less than the necessary amount of justification.
In the next section, we formalize a framework for testing absolute universals. Then,
after presenting a case study on word order which demonstrates several important methodological points, we apply bootstrapping techniques to 138 features4 from the World Atlas of
Linguistic Structures (Cysouw 2008), and show that incorrect absolutes are often inferred
by sampling fewer than 500 languages. In the last section, we develop the Bayesian statistical analysis model for this problem, and show that Bayesian standards of evidence can be
more optimistic, but still require sampling hundreds of languages.
A framework for absolute universals
In this section we introduce some terminology and provide setup for our formal analyses. We will assume that we are interested in the values that a single feature of language
can have. We will denote this feature as F , and its N particular values as v1 , v2 , . . . vN . The
values of vi should exhaust the logically possible values of F , corresponding to the entire
space of feature values a naive linguist might suppose that languages could exhibit.
One useful example of a feature is F is word order, a language’s preferred linear order
of subject, verb, and object. Here, there are seven different possible values for F : SVO, SOV,
OSV, OVS, VSO, VOS, ND (no dominant word order). Thus, v1 = SV O, v2 = SOV, v3 =
OSV, etc.. This setup is similar in spirit to the principles and parameters approach to
language acquisition (e.g. Chomsky, 1981), but is meant to be theory-neutral. The features
we describe may be entirely descriptive—or any property of language that interests linguists.
There is no assumption that the space of these parameters is innate or even explicitly
represented by any cognitive structure.
Much work in typology tries to find suspicious gaps of unobserved feature values vi
that don’t appear unlikely for any a priori reason. When these features are not observed, it
is eventually concluded that the feature must not be permissible according to cognitive or
linguistic constraints common to all humans. The basic principle behind this reasoning is:
If value vi for feature F does not occur in any sampled language, there is an
absolute universal constraint against vi .
4
We did not include features on sign languages.
QUANTITATIVE STANDARDS
5
Use of this principle will be evaluated in the next two sections. Of course, unrestricted use
of it leads to ridiculous conclusions. To our knowledge, no language uses the same word for
accordions and dissertation advisors, but one would not want to conclude that such a lexicon
is ruled out by some universal constraint. The features and values which are tested must respect other theoretical and empirical findings—in this example, that lexicons are arbitrary.
In this kind of setup, one can formalize many types of universals, including implication universals: for instance letting, v1 = not VSO word order and v2 = VSO and prepositional,
v3 = VSO and not prepositional, Greenberg’s third universal holds that v3 is impossible.
In general, there are two types of errors one might make: one might incorrectly infer
that some vi is possible, or one might incorrectly infer that some vi is impossible. If the
null hypothesis is that all the vi are possible (which is a reasonable null hypothesis since it
is least constrained), then these mistakes correspond respectively to false negative (Type-I)
and false positive (Type-II) errors. Use of the above principle can never lead to a false
negative error, thinking that an impossible linguistic feature is actually possible5 . This is
because is a feature value vi is observed in a language, it must necessarily be possible—
otherwise it could not have been observed. However, as discussed above, false positives can
occur since just by chance one may fail to sample a language with feature value vi .
In the next two sections, we will follow an approach common in frequentist statistics of
assuring that probability of making a false positive error—incorrectly inferring a universal
constraint—is kept low. It is impossible to make the probability zero since that would
require examining all possible languages, a huge or infinite number. Instead, we might
suppose that we wish our scientific method to keep the false positive probability below a
bound, perhaps 0.05, the standard for publication in psychology. This means that 1 in 20
findings of universal constraints with this method will be false positives. However, for what
are essentially non-replicable experiments (since there are only finitely many languages to
test), one may actually desire a much smaller false positive rate of 0.01 or 0.001.
Bootstrapping word order
One way to begin to understand the amount of evidence needed make the false positive
rate low is through simulation. Suppose a naive scientist examined the word order patterns
of L different languages, and applied the above principle to conclude that unobserved word
orders must be impossible. Since we know the ground truth that all word orders are possible
for languages, we know that each time such a scientist claims to have discovered a universal,
they have made a false positive error. This means that we can estimate the false positive
rate in word order by computing how often such a scientist would reach the conclusions that
some word orders are impossible (c.f. Haspelmath & Siegmund 2006). This is a frequentist
way of thinking: how often would we have been lead to the wrong conclusion if we repeatedly
ran the experiment of sampling L human languages and concluding that any unobserved
word order was impossible.
We start with a simple example of this way of thinking before continuing to several
more sophisticated analyses. Ideally, to compute this false positive rate, we would repeatedly
sample L languages from the true distribution of possible human languages and see how
5
Assuming we can always correctly identify the values of vi for a particular language, which is not always
easy, but is a reasonable simplification for our analysis
QUANTITATIVE STANDARDS
6
Figure 1. The probability of thinking that each word order is impossible, as a function of the
number of languages observed.
often we make a false positive error6 . One problem with this way of thinking is that we
do not know the true distribution of word orders. However, we can use a trick known
as bootstrapping (see Davidson & Hinkley 1997). In bootstrapping, you approximate the
true distribution with an empirically-observed distribution. In our case, we sample word
orders from the distribution of word orders in a large database, as an appoximation to the
true distribution. The false positive rate you find resampling from the empirically-observed
distribution will approximate the false positive rate of sampling from the true distribution.
This trick is very general and can be applied in many situations where the reliability of an
estimate must be computed.
To do this in our case, we imagine that a scientist sampled randomly L word orders
from the empirically observed distribution of word orders in the 2885 languages of the
World Atlas of Linguistic Structures (WALS) (Cysouw 2008). As in all bootstrapping,
this sampling is done with replacement in order to simulate sampling from the infinity of
possible human languages. A simple version of this computation as applied to word order
was performed by Bell (1978), and the calculation itself is straightforward: if a word order
occurs in known languages with probability p, then with probability (1 − p)N it will not
occur in a subset of N languages. This expression gives the probability of not observing
a word order, and thus concluding (incorrectly) that a particular word order is absolutely
impossible.
Figure 1 shows how often one would conclude that each word order is impossible,
as the number of languages sampled, L, varies from 1 to 2000 languages. For instance,
the line for OVS line represents the probability that a scientist who looked at L languages
would incorrectly conclude that OVS is absolutely and universally outlawed. Again, we are
6
For simplicity, here we assume that the languages sampled are drawn independently, corresponding to
a scientist who examined independent languages, but the topic of independence will be addressed more in
the next section.
QUANTITATIVE STANDARDS
7
interested in making the probability of false positives small, below either the dashed 0.05
line, or the 0.01 line.
This graph illustrates several key points. First—and unfortunately—this figure illustrates that for word order, one must have examined a relatively large number of languages
in order to make the false positive probability below 0.05. A linguist who looked at fewer
than around 900 languages would have a false positive rate of greater than 0.05 of thinking
that OSV is an impossible word order. Here, this is even a slightly conservative7 estimate
since we may be interested in the probability of any vi occuring in a false positive, not just
the least frequent word order. To make the rate less than 0.01, one would have to look at
around 1400 languages. This example should lead us to conclude that absolute universals
based on studying around 900 languages will not be justified: a method using fewer than
this number gives too high of a false positive rate.
Second, this graph demonstrates that finding the least frequent feature values is hardest. This is because they are least likely to be drawn on any individual sample, and so are
the most likely not to occur in a collection of L samples. This should be intuitive, and illustrates an often discussed problem for typology: rare feature values vi are more difficult to
find, so it is never clear if a feature value is impossible, or rare and simply not yet observed.
This means that our degree of belief in an absolute universal should depend on how likely
the rarest feature values are.
Bootstrapping features from WALS
The analysis in the previous section raises two interesting issues. First, since the false
positive rate depends on how rare the rarest features are, it is not clear that findings based
on the distribution of word orders generalize to other features linguists might care about.
It may be that rare word orders are much rarer than other typical features one would study,
for instance. This means that the necessary L should not be estimated from word order
simulations alone. To address this, we study 138 features from WALS, not just word order.
Keeping the false positive rate low across this broader range of features should provide good
guidelines for how to keep the false positive rates low for future linguistic features which
are similar to those already in WALS.
Second, it is not clear that the raw distribution of features in observed languages
provides a good estimate of the true distribution. Many of the languages in WALS are
genetically and historically related, meaning that accidents of linguistic history may be
driving the overall distribution of word orders. This means that when we model sampling
from the observed distribution, we are not correctly approximating the “true” distribution
of word orders, and thus not approximating the correct false positive rate. This means that
we should also consider alternative methods of approximating the true distribution. Here,
we consider four:
1. Flat sampling – This uses the raw distribution of feature values observed in
WALS.
2. Family stratified – Sampling from the true distribution is done by first sampling
7
In this paper, conservative means it underestimates the number of languages required. This is conservative with respect to the number required, not with respect to the estimated false positive rate for any
sample size.
QUANTITATIVE STANDARDS
8
a language family uniformly at random, and then sampling a language within that family
uniformly at random.
3. Genus stratified – Sampling from the true distribution is done by first sampling
a language genus uniformly at random, and then sampling a language within that genus
uniformly at random.
4. Independent subsample – This method constructs a set of languages which
share as few features as possible, and treats sampling from the true distribution as sampling
from that set of approximately independent languages.
The flat sampling method is the one used in the previous section, using just the raw
feature counts from WALS. The family stratified method attempts to correct for the fact
that imbalances in the number of languages observed in each language family are likely due
to historical accidents. We can partially correct for these historical accidents by increasing
the probability with which we sample languages from language families with fewer languages.
That is, within each language family we take the true distribution of feature values to be
that observed in the family’s features, but adjust for the fact that all languages families
may represent equally likely subspaces of the set of possible languages. The genus stratified
sampling does the same thing, but for language genera instead of families.
The independent subsample method is meant to approximate the true distribution
by the distribution of features in a sample of languages whose features are as uncorrelated
as possible (see also Perkins 1989, Rijkhoff 1993, 1998). This also attempts to remove
artifacts of history, but does it by finding a sample of languages which is as independent as
possible, rather than assuming that the true distribution can be approximated by our own
classification of languages in the families and genera. To do this, we ideally would search
over subsets of languages and choose a subset of languages with the most diverse set of
feature values, taking this to be an approximation to the true and independent distribution
of features. Unfortunately, the number of such subsets is exponential in the size of the
subset. To make matters worse, we are interested in finding a sample of languages in
which all the possible feature values are attested—otherwise our approximation to the true
distribution will already be missing possible feature values8 .
To solve this problem, we implemented a simple greedy algorithm to find a diverse
subset of languages. The algorithm starts with an empty set of languages and considers
adding each possible language in WALS. It first adds languages which decrease the number
of unobserved feature values. This ensures that all feature values are represented. To break
ties, it adds languages which share the smallest proportion of their features with languages
already in the set9 . Further ties are broken by picking languages which have the largest
number of defined features. Note that after enough languages have been added so that all
feature values are observed, additional languages never decrease the number of unobserved
feature values, meaning that the decision about which language to add is based first on
the proportion of features it shares with languages already in the set. This creates a set of
languages with diverse feature values. We ran this algorithm to produce a set containing
8
The requirement that all feature values are represented corresponds to an instance of SET-COVER, an
NP complete problem. Our algorithm is simliar to the best polynomial time approximation to SET-COVER
(Chvatal 1979).
9
Note that this means the languages are actually anticorrelated, a more conservative analysis for our
purposes than independent.
QUANTITATIVE STANDARDS
9
Figure 2. Bootstrapped false positive rates for 138 features from WALS.
255 and 639 languages, respectively 10% and 25% of languages in WALS10
With each of these sampling methods, we used a dynamic programming algorithm to
compute the probability of making a false positive error, averaging across the 138 features
from WALS.
The estimated false positive rate is shown in Figure 2 for each of the methods discussed
above. Unlike Figure 1, this plot shows the probability a random feature (not feature value)
from WALS would be found incorrectly to be constrained in a sample of L languages. For
instance, the value of the “flat” curve is slightly less than 0.05 at L = 500, indicating that if
the true distribution is best approximated using the flat method above, and 500 independent
languages were examined, one could expect a false positive rate of slightly less than 0.05.
This figure demonstrates that the number of languages necessary to achieve a false
positive rate of 0.05 varies from around 300 to around 1000, depending on how we approximate the true distribution. Since we do not know which method best approximates the true
distribution, we do not know which curve best represents the false positive rate. However,
as discussed above, it is not clear that a false positive rate of 0.05 is sufficient for what is
essentially a non-replicable experiment. The most optimistic curve, “Independent sample
10%,” drops below a false positive rate of 0.01 only after 500 languages, and many of the
others do not make it there even after 2000 samples. We take these results as leading to
the above heuristic that absolute universals require examining at least 500 languages.
There is one very important caveat to this analysis. The sampling procedure we use
assumed independent samples from the true distribution, simulating the false positive rate
for linguists who sample independently from the true distribution. This means that what
is really requires is 500 independent languages, not 500 languages overall. For instance,
Spanish and Italian do not count as two separate languages in this analysis since they
10
This is a conflict between getting a large number of samples to approximate the true distribution—and
thus approximating it well—and using fewer and thus more independent languages to approximate it.
QUANTITATIVE STANDARDS
10
are genetically related. This means that the effective number of languages necessary may
be much larger than 500, when sampling uses non-independent languages since correlated
samples will provably increase the number of samples needed to stay below a given false
positive rate.
A Bayesian approach
In the previous section we looked at how many samples are necessary to maintain a
low false positive rate, using the simulated false positive rates on current data as a proxy
for false positive rates on future data. An alternative approach is to compute the degree of
belief that one normatively should have that a feature is impossible. As we will show, this
is possible through a Bayesian model, which illustrates that it is possible to infer possibility
from probability. With enough examples that fail to show a feature value, the statistically
better theory is one in which that feature value has zero probability.
The Bayesian approach specifies a probability model over unknown states of the world
and links this probability model to the observed data. In our case, the unknown state of
the world is the true, underlying distribution of feature values. Let x = (x1 , x2 , . . . , xN ) be
this true distribution on feature values v1 , v2 , . . . vN . Here, x assigns each feature value a
probability, corresponding to the probability that an independently sampled language will
exhibit that feature value. For instance, if the features are word orders, with v1 = SV O
and v2 = OSV , then x1 might be relatively high, say 0.35 and x2 might be relatively
low, say 0.05. We assume that the languages we observe have their feature values chosen
independently from this distribution x. Thus, if x1 = 0.35, then we would expect 35% of
languages to have feature value v1 in a large sample of languages.
With this setup, we can consider two statistical models. First, model MN holds that
all N feature values are possible, assigning them each some nonzero probability xi > 0 for all
i. The second model, MK , holds that only the first K are possible, meaning that xi = 0 for
i > K. Intuitvely, MK corresponds to the scientific hypothesis that in the true distribution
x, some feature values are fundamentally outlawed for human language. Considering this
model only makes sense if ci = 0 for i > K, since otherwise MK must be the wrong model.
We assume this in the rest of this section. Note that if we could directly study x, we could
decide between MN and MK by looking to see if xi > 0 for some i > K. Unfortunately this
true distribution is not observable and its properties must be inferred from the counts ci of
how many times we observe each feature value.
We will use a simple Bayesian model known as a Dirichlet-Multinomial model which
makes parametric assumptions about the distribution x. In particular, it assumes that x
has been generated from a Dirichlet distribution. Our parameterization of the Dirichlet
distribution has a single free parameter, α, which controls the degree to which we expect
x to be uniform. As α gets small, we expect that most of the probability mass in x is
on only a few xi : most feature values are low probability, and the distribution is far from
uniform. Large α means that the distribution x is very close to uniform: all feature values
are equally likely. When α = 1, this corresponds to no real bias, with all distributions on
x being equally likely. We can also fit α by seeing which value makes the observed counts
in WALS most likely. For this we find α = 0.9, a value close to uniform, meaning that all
distributions on features values are about equally likely.
QUANTITATIVE STANDARDS
11
Given a value for α, we can compare the models by seeing which assigns the data a
higher probability. A standard in Bayesian statistics is to compute a Bayes factor, which is
the log of the ratio of the probability each model assigns to the observed data (the set of
counts c):
P (c | MK , α)
BF = log
.
(1)
P (c | MN , α)
Here, a value greater than zero supports the constrained model MK . However, what matters
more is the strength of this evidence. Just as frequentist statistics standardizes p-values,
Bayesians standardize how much evidence is required to conclude that one model is better
than another. One commonly accepted standard is due to Jeffreys (1961) who considered
Bayes factors in the 1.6 − 3.3 range11 “substantial,” 3.3 − 5.0 “strong,” 5.0 − 6.6 “very
strong” and > 6.6 “decisive.”
Equation 1 states the preference for the constrained model in terms of P (c | MK , α)
and P (c | MN , α), but it may not be intuitively clear how to compute these values since
x does not appear in them. As we show in Appendix A, these values can be found by
integrating over x using the standard Dirichlet-Multinomial model, meaning essentially
that we compute the probability of c averaging over the unknown distribution x. Doing
this, Equation 1 becomes
BF = log
P (c | MK , α)
Γ (K · α) Γ (N α + C)
= log
·
.
P (c | MN , α)
Γ (N · α) Γ (Kα + C)
(2)
PK
P
where C = N
i=1 ci and Γ is the Gamma function, a generalization of factorials
i=1 ci =
to non-integers. This expression gives a concise form for the strength of evidence in favor
of a restricted model with some feature values universally outlawed, over an unrestricted
model.
One easy example is to work out the case when α = 1 and K = N − 1. A value
of α = 1 corresponds to the expectation that, before any counts have been observed, all
distributions x on feature values are equally likely. K = N − 1 means that only a single
feature has not been observed. In this case, Equation 1 simplifies to
BF = log
C +N −1
N −1
(3)
For example, to get strong evidence—say, a Bayes factor of 4.0—in favor of a universal
restriction against a single feature value when N = 10 feature values are possible, we would
require a sample of 24 · (10 − 1) − 10 + 1 = 135 independently sampled languages not
showing the feature. “Decisive” evidence, a Bayes Factor of 6.6, would require 864 samples.
When C N , the Bayes factor is approximately log(C/N ), which indicates that doubling
the number of possible feature values requires roughly doubling the number of samples, to
maintain the same Bayes factor. Note that these results are still assuming that ci = 0 for
i > K; otherwise observing a language with ci > 0 for some i > K would falsify model MK .
Figure 3 shows the Bayes factor according to Equation 2, for various N and K,
using our empirically-determined value of α = 0.9. These results qualitatively agree with
our earlier frequentist analysis that very convincing evidence in support of an absolute
11
Log base 2.
QUANTITATIVE STANDARDS
12
Figure 3. Bayes factor in favor of a constrained model, as a function of the number of samples for
α = 0.9.
universal is only found after around 500 languages. However, quantitatively this analysis
is somewhat more optimistic, finding strong support for absolute universals after around
200 languages, and decisive support between around 500 and 800. As above, this requires
independently sampled languages form the “true” distribution of languages. This analysis
required making parametric assumptions about the relevant distributions—most notably
that feature distributions are approximated by a Dirichlet-distribution (Equation 4). We
also assumed that all feature values are equally likely a priori, a reasonable simplification
but one that is likely not true for real linguistic features. Since these assumptions are just
approximations, we put more stake in the numbers from the earlier analysis which were less
optimistic.
The key point from this model-based analysis is that it is possible to statistically
justify theories of absolute universals. With enough evidence, a model with restrictions on
feature values better predicts the observed data than one without restrictions, assuming
the restricted feature values have not been observed in the sample. This is because the
restricted model assigns higher probability to the observed features since it does not waste
probability mass on unobserved features. With enough evidence, this difference in likelihood
becomes large and should convince an ideal statistical observer that some feature values are
not possible. Said another way, after a particular feature value is not observed enough
times, it becomes less likely that it is possible—since if it was possible, we would expect to
see it under random sampling. The above model formalizes this intuition, allowing us to
compute the degree of belief in a constrained model relative to an unconstrained one. This
means that it is rational to believe in absolute universals assuming an adequate number
of languages have been tested. The number of languages necessary can be computed by
solving Equation 2 for the desired Bayes factor12
12
This framework can also handle more complex analyses, such as not assuming equal priors on all features,
but the math is more complex.
QUANTITATIVE STANDARDS
13
Conclusion
We have shown that in the case of absolute linguistic universals it is possible within
standard scientific practices to infer what is possible from what appears to be probable.
The Bayesian analysis we presented shows there is no logical problem with finding absolute
universals, assuming that we can make assumptions about the a priori distribution of feature
values. In fact, relative to the assumptions of the model, the Bayesian approach provides
an optimal strategy for inferring from data whether some feature values are impossible for
language.
We also presented and justified a rule of thumb that claims about absolute linguistic
universals should be based on examination of at least 500 independent language. This is
a rule of thumb since the true sample size required depends on many factors, such as the
unknown true distribution, the number of possibel feature values, and our expectations
about the probability of rare features. Since we found this number using features from
WALS, it can be interpreted as the sample size necessary for testing absolute universals
with feature distributions and counts similar to those found already in WALS. The number,
500, is grim for several reasons.
First, it is almost certainly an underestimate of the necessary number of languages.
Most of the analyses required far more than 500 languages to achieve false positive rates
below 0.05 or 0.01, but many analyses required more. Much worse, however, is that there is
a selection bias in the features which are used as input from the model since only observed
feature values from WALS are considered. It is likely that for some features, there are other
values which are possible for human languages but not found in the current sample in WALS.
This means that we have likely underestimated the occurrance of rare features, therefore
underestimating the number of languages necessary for justified claims about universals.
Second, it is not intuitively clear to us that there are even 500 essentially independent languages in existence. Even if languages in different language families are assumed
to be essentially independent (which they are not), there are only 208 language families
represented in WALS. This means that even in a database the size of WALS with over
2,500 languages, there are not enough independent languages to make statistically-justified
claims about absolute linguistic universals. The situation improves somewhat if we assume
language genera to be independent since there are 477 of these represented, but this assumption is doubtful. Unfortunately, performing an analogous analysis for nonindependent
language samples appears to be quite difficult due to the complex historical relationship
between languages.
In general, we suggest that claims about linguistic universals should be accompanied
by some measure of the strength of evidence in favor of such a universal—for instance, a
Bayes factor for the constrained theory over the unconstrained theory. This is especially
important when universals are taken to be relevant to theoretical debates in linguistics,
since theories should not be evaluated by or depend on very weakly-supported “universal”
properties. Our results demonstrate that a considerable amount of work is required in order
to justify absolute universals, and that skepticism is warranted when absolute universal are
posited from sample sizes less than around 500 languages.
QUANTITATIVE STANDARDS
14
Appendix A: Derivation of Bayes factor
We apply a simple Bayesian model known as a Dirichlet-Multinomial model. We first
present the derivation for MN as the derivation for MK is analogous. We assume a Dirichlet
prior on x which assigns any possible distribution x an a priori probability. The Dirichlet
prior has a simple form:
N
P (x | α, MN ) =
Γ (N · α) Y
Γ (α)N
xα−1
.
i
(4)
i=1
term is simply a normalizing constant, written in
This prior has two parts. The Γ(N ·α)
Γ(α)N
terms of the Γ-function, a function that roughly generalizes factorials to non-integers. The
product of xαi formalizes the key part of the prior, and is parameterized by the single
variable α. As α gets large, this prior favors a uniform distribution, xi = 1/N for all i. As
α gets small, this prior favors distributions which place probability mass on only a few xi .
Crucially, however, this prior treats all xi the same, meaning that a priori we do not expect
some feature values to be more likely than others13 .
The second part of the model is the likelihood, which formalizes how likely the observed data are for any assumed distribution x. The likelihood takes a simple form as well.
Suppose that each feature value vi occurs ci times in the observed data. This happens with
probability xci i since each instance of vi occurs with probability xi , and there are ci of them.
This means that P (c | x), with c = (c1 , c2 , . . . , cN ), is then
P (c | x, MN ) =
Y
xci i .
(5)
i
With this likelihood, distributions that closely predict the empirically observed distribution
ci are more likely than those which predict distributions far from the observed ones.
We are interested in how well this model can predict the observed counts. Formally,
we want a model such that P (c | MN , α) is high, corresponding to a model which assigns
the observed data high probability. To compute this term, we can write P (c | MN , α) as an
integral over all x:
Z
P (c | MN , α) =
P (c | MN , x)P (x | MN , α)dx.
(6)
This gives the probability of the observed counts c for any α, by averaging over (integrating
out) the unobserved distribution x. Because we used a Dirichlet prior and multinomial
13
This is a simplification for our purposes, but this same class of model will work with different prior
expectations for each xi .
15
QUANTITATIVE STANDARDS
likelihood, Equation (6) can be explicitly computed. Substituting (5) and (4) into (6) gives
Z
P (c | MN , α) =
=
=
N
Γ (N α) Y
Γ (α)N i=1
Z N
Γ (N α) Y
N
Γ (α)
Γ (N α)
·
N
Γ (α)
xci i
N
Y
xα−1
dx
i
i=1
xci i +α−1 dx
i=1
QN
i=1 Γ (ci
P
N
Γ
i=1 (ci
(7)
+ α)
.
+ α)
where the integral is a special form known as the Dirichlet integral which has an analytical
solution in terms of the Γ function. For any value of α, parameterizing how sparse the
distribution x is, this equation gives the probability of the observed data (the counts c),
summing over all possible values of x, the variable we do not know.
We assume that ci = 0 for i > K. The derivation for MK is identical to the one for
MN , except that the sums run only up to K since MK does not assign any probability mass
to xi for i > K. With this, we have
P (c | MK , α) =
Γ (Kα)
Γ (α)K
QK
i=1 Γ (ci + α)
.
· P
K
(c
+
α)
Γ
i
i=1
(8)
The standard way to compare these two models—one which gives all feature values
nonzero probability, and one which is restricted by a universal constraint—is to compute
a Bayes factor. The Bayes factor quantifies the strength of evidence in favor of one model
over another, and is computed simply as the log of the ratio of probability of the data under
each model:
BF = log
P (c | MK , α)
P (c | MN , α)
N
= log
Γ (Kα) Γ (α)
·
Γ (N α) Γ (α)K
P
N
Γ
(c
+
α)
i
i=1
i=1 Γ (ci + α)
· P
· QN
K
Γ
(c
+
α)
i
Γ
(c
+
α)
i=1
i=1 i
QK
(9)
1
Γ (N α + C)
Γ (Kα)
= log
· Γ (α)N −K · QN
·
Γ (N α)
Γ (Kα + C)
i=K+1 Γ (α)
= log
Γ (K · α) Γ (N α + C)
·
.
Γ (N · α) Γ (Kα + C)
PN
PK
where C =
i=1 ci =
i=1 ci . Note that this simplification works because ci = 0 for
i > K. This expression gives a concise form for the strength of evidence in favor of a
restricted model with some feature values universally outlawed, over an unrestricted model.