Quantitative standards for absolute linguistic universals Steven T. Piantadosi Edward Gibson Department of Brain and Cognitive Sciences, MIT Abstract Linguistic universals often are justified by cross-linguistic analysis: if many languages exhibit a property, the property is taken to be universal, perhaps specified in the cognitive or linguistic systems of language learners and users. Here we evaluate the implicit methodological assumptions behind this reasoning, and provide a quantitative standard for when absolute universal patterns in language are statistically justified by cross-linguistic analysis. We argue for a simple rule-of-thumb: claims regarding absolute universals need to be confirmed by 500 statistically-independent languages in order meet typical scientific standards. The magnitude of this number indicates that many—if not all—purported absolute claims in linguistics have not received sufficient justification. Introduction One of the most important goals for linguistics is to discover the range of possible human languages. An understanding of the range and breadth—or lack thereof—of human languages is key for discovering the core properties of the cognitive and linguistic systems which support it. Do all languages have recursion? Lexical categories? Principles of binding theory? In linguistic typology, such types of core universal structures and patterns have long been formally hypothesized. Greenberg (1960) described two different classes of universal patterns in languages. First, there are violable statistical tendencies in languages, universal patterns which hold with higher than chance frequency. We refer to these as typicality universals1 . An example is Greenberg (1963)’s first universal that subjects tend to precede objects in simple declarative sentences. This is not true of all the world’s languages, but is evidenced in the non-uniform distribution of word orders across languages. Perhaps more interestingly, typology also formalizes absolute constraints, meaning ones which are inherently impossible for human language. While nearly all instances of 1 We differ from previous literature and Greenberg, which refer to these as statistical universals. We change terminology because the term statistical universal is confusing in a paper on statistical methods for analyzing universals. QUANTITATIVE STANDARDS 2 absolute constraints are contested (see Evans & Levinson 2008), prototypical hypothesized examples include the presence of linguistic features such as WH-movement and auxiliaries (Pinker & Bloom 1990). There are also implicational absolutes, such as Greenberg (1963)’s universal that languages with VSO word order are always prepositional. Though some have argued that typicality universals are more informative and useful than absolute ones (Dryer 1998, Cysouw 2004, Bickel 2007) or that typological data is not relevant to universal grammar (Newmeyer 1998, 2004, Haspelmath 2004), we believe that the most substantive and interesting theories about cognition would come from absolute universals. Such universals—if any exist—would delineate necessary properties of human languages and thus characterize hard constraints on cognitive capacities. They also provide hypotheses about what knowledge or constraints humans bring to the problem of language learning—what makes humans distinctively human. Statistical methods for typicality universals have been extensively discussed in the literature (for an excellent review, see Cysouw 2004). A basic challenge with testing typicality universals is that the set of existing languages may not form a representative sample in the space of all possible human languages. For instance, if it is observed that SOV word order is more frequent than OVS, this may be due to an innate constraint against OVS word order. Alternatively, the observed data might result from historical accident: perhaps SOV languages just happened to be the ones that survive to the present day. What languages survive is part chance, determined by what cultures are dominant politically, socially, and militarilty. But, the problem is more serious than just the effects of random chance. The processes of language change and contact lead to multiple languages with correlated features, meaning that some word orders will be over-represented much more than would be expected just by pure random sampling. That is, languages will tend to “clump” together in features space since they are related, making the true distribution seem even less uniform than it would appear by chance. This problem known as Galton’s problem, after Sir Francis Galton, who first pointed out the issues it causes for comparative anthropological work2 . Many authors have proposed addressing this problem in typology by using a sample of languages which is as independent as possible. If one could find independent samples of language—removing all the historical and geographical accidents of language evolution— one might be able to estimate the relative frequencies of different features reliably. Dryer (1989) argues that simply using stratified sampling (e.g. Tomlin 1986) does not adequately address this problem. Instead, Dryer suggests dividing languages up into genera, genetic groups of languages from 3500-4000 years ago (but see Maslova 2000, Dryer 2000), and looking at the patterns in genera of five different geographical regions. Dryer reasons that if a pattern is observed in all geographical regions, then it is likely to be universal since languages will tend to be mostly independent across geographical regions. An alternative to this, proposed by Perkins (1989) and Rijkhoff (1993, 1998), is to sample languages which are as distinct as possible in their feature values. These methods are explicitly constructed for evaluating typicality universals in that they try to use sophisticated sampling techniques to more reliably estimate the frequencies of features. However, it is not clear how one might apply them to absolute universals. For one, taking a smaller, independent sample can never improve the estimate of what 2 Some anthropologists have even argued that the problem is so severe, there exist only 50 to 75 cultures that can be examined in the whole world and treated independently (Naroll 1973). QUANTITATIVE STANDARDS 3 is possible and what is not, because it can only remove languages, potentially removing a crucial example of an otherwise unseen feature. Second, most statistical tests which are used by these authors are designed to compare relative frequencies, not to determine when the true frequency of a feature is zero. Indeed, absolute universals provide an interesting puzzle for scientific methodology: the presence of an absolute constraint can only be inferred from what has not yet been observed. But, as is often pointed out by typologists, it is always possible that an exception to a universal constraint might be observed with one additional language. Dryer (1998) describes the problem as ... no matter how many languages we examine and find conforming to an absolute universal, we can never know that there is not another language that fails to confirm to the universal, either one that was once spoken or a hypothetical language that is possible but never actually spoken due to historical accident. What this means is that absolute universals are never testable3 . ... No amount of data conforming to the generalization provides any reason to believe that there are no languages that do not conform. And no evidence from attested languages can provide any basis for believing that exceptions are not possible. For Dryer this is both the logical problem of induction, and a statistical puzzle of how we can know whether features are low probability, or zero probability. Indeed, Bickel (2001) goes so far as to say that cross-linguistic analysis “cannot in principle contribute” to the discovery of what is possible and impossible in human languages since, “a probabilistic statement is not, and cannot be converted into, a possibility statement.” He lists examples of linguistic phenomena which were previously presumed to be universally impossible, including for instance syntactic ergativity without morphological ergativity, and pronoun borrowing. This paper has two goals. The first is to show that it is possible within standard and justifiable scientific methods to discover absolute universal laws of language—to infer what is possible from what appears to be probable. We do this in two ways. First, we present a method centered around the frequentist idea of keeping a low false positive rate of proposing linguistic universals. Second, we present a formal mathematical model which formalizes the degree of belief one should have that property of language is impossible, given a set of sampled languages. The model is a generalization of a simple problem often taught in introductory Bayesian statistical courses: how many times would a coin have to come up heads before you were reasonably convinced that both sides of the coin were heads? Though you may never observe both sides simultaneously, enough successive instances of heads—say, 1000—will eventually convince you beyond any scientific standard of evidence that tails is impossible. The reason for this is that 1000 heads in a row is very low probability if the coin is fair, but has probability 1 if the coin has two heads. Thus, the data will eventually outweight any low expectations you have about coming across two-headed coins. Such coin-flipping examples demonstrate one way data and modeling can be used to adjudicate between potential theories—in this case between an outcome of tails being possible or impossible. Without these types of analysis, it is perhaps unclear if it is even scientifically valid to make claims about unobserved universal patterns. At what point does the absence of 3 Dryer uses “testable” as a technical term, meaning that there exists possible evidence to falsify the theory, and possible evidence to confirm it. QUANTITATIVE STANDARDS 4 a linguistic feature justify the inference that it is impossible for human language? Is not observing a feature in ten languages enough? One hundred languages? Ten thousand? Does it require more than the number of languages currently in existence? The second goal of this paper is to answer these questions, showing how many languages need to be examined in order to provide strong evidence for a universal. We demonstrate that within scientific standards, the number of languages which one must sample in order to provide strong evidence for absolute linguistic universals is quite large. We will justify the following rule-of-thumb: Findings about absolute universals in language should be based on examination of no fewer than 500 independent languages. To our knowledge, most typological studies report results from far fewer than this amount, indicating that many purported absolute universals have likely received less than the necessary amount of justification. In the next section, we formalize a framework for testing absolute universals. Then, after presenting a case study on word order which demonstrates several important methodological points, we apply bootstrapping techniques to 138 features4 from the World Atlas of Linguistic Structures (Cysouw 2008), and show that incorrect absolutes are often inferred by sampling fewer than 500 languages. In the last section, we develop the Bayesian statistical analysis model for this problem, and show that Bayesian standards of evidence can be more optimistic, but still require sampling hundreds of languages. A framework for absolute universals In this section we introduce some terminology and provide setup for our formal analyses. We will assume that we are interested in the values that a single feature of language can have. We will denote this feature as F , and its N particular values as v1 , v2 , . . . vN . The values of vi should exhaust the logically possible values of F , corresponding to the entire space of feature values a naive linguist might suppose that languages could exhibit. One useful example of a feature is F is word order, a language’s preferred linear order of subject, verb, and object. Here, there are seven different possible values for F : SVO, SOV, OSV, OVS, VSO, VOS, ND (no dominant word order). Thus, v1 = SV O, v2 = SOV, v3 = OSV, etc.. This setup is similar in spirit to the principles and parameters approach to language acquisition (e.g. Chomsky, 1981), but is meant to be theory-neutral. The features we describe may be entirely descriptive—or any property of language that interests linguists. There is no assumption that the space of these parameters is innate or even explicitly represented by any cognitive structure. Much work in typology tries to find suspicious gaps of unobserved feature values vi that don’t appear unlikely for any a priori reason. When these features are not observed, it is eventually concluded that the feature must not be permissible according to cognitive or linguistic constraints common to all humans. The basic principle behind this reasoning is: If value vi for feature F does not occur in any sampled language, there is an absolute universal constraint against vi . 4 We did not include features on sign languages. QUANTITATIVE STANDARDS 5 Use of this principle will be evaluated in the next two sections. Of course, unrestricted use of it leads to ridiculous conclusions. To our knowledge, no language uses the same word for accordions and dissertation advisors, but one would not want to conclude that such a lexicon is ruled out by some universal constraint. The features and values which are tested must respect other theoretical and empirical findings—in this example, that lexicons are arbitrary. In this kind of setup, one can formalize many types of universals, including implication universals: for instance letting, v1 = not VSO word order and v2 = VSO and prepositional, v3 = VSO and not prepositional, Greenberg’s third universal holds that v3 is impossible. In general, there are two types of errors one might make: one might incorrectly infer that some vi is possible, or one might incorrectly infer that some vi is impossible. If the null hypothesis is that all the vi are possible (which is a reasonable null hypothesis since it is least constrained), then these mistakes correspond respectively to false negative (Type-I) and false positive (Type-II) errors. Use of the above principle can never lead to a false negative error, thinking that an impossible linguistic feature is actually possible5 . This is because is a feature value vi is observed in a language, it must necessarily be possible— otherwise it could not have been observed. However, as discussed above, false positives can occur since just by chance one may fail to sample a language with feature value vi . In the next two sections, we will follow an approach common in frequentist statistics of assuring that probability of making a false positive error—incorrectly inferring a universal constraint—is kept low. It is impossible to make the probability zero since that would require examining all possible languages, a huge or infinite number. Instead, we might suppose that we wish our scientific method to keep the false positive probability below a bound, perhaps 0.05, the standard for publication in psychology. This means that 1 in 20 findings of universal constraints with this method will be false positives. However, for what are essentially non-replicable experiments (since there are only finitely many languages to test), one may actually desire a much smaller false positive rate of 0.01 or 0.001. Bootstrapping word order One way to begin to understand the amount of evidence needed make the false positive rate low is through simulation. Suppose a naive scientist examined the word order patterns of L different languages, and applied the above principle to conclude that unobserved word orders must be impossible. Since we know the ground truth that all word orders are possible for languages, we know that each time such a scientist claims to have discovered a universal, they have made a false positive error. This means that we can estimate the false positive rate in word order by computing how often such a scientist would reach the conclusions that some word orders are impossible (c.f. Haspelmath & Siegmund 2006). This is a frequentist way of thinking: how often would we have been lead to the wrong conclusion if we repeatedly ran the experiment of sampling L human languages and concluding that any unobserved word order was impossible. We start with a simple example of this way of thinking before continuing to several more sophisticated analyses. Ideally, to compute this false positive rate, we would repeatedly sample L languages from the true distribution of possible human languages and see how 5 Assuming we can always correctly identify the values of vi for a particular language, which is not always easy, but is a reasonable simplification for our analysis QUANTITATIVE STANDARDS 6 Figure 1. The probability of thinking that each word order is impossible, as a function of the number of languages observed. often we make a false positive error6 . One problem with this way of thinking is that we do not know the true distribution of word orders. However, we can use a trick known as bootstrapping (see Davidson & Hinkley 1997). In bootstrapping, you approximate the true distribution with an empirically-observed distribution. In our case, we sample word orders from the distribution of word orders in a large database, as an appoximation to the true distribution. The false positive rate you find resampling from the empirically-observed distribution will approximate the false positive rate of sampling from the true distribution. This trick is very general and can be applied in many situations where the reliability of an estimate must be computed. To do this in our case, we imagine that a scientist sampled randomly L word orders from the empirically observed distribution of word orders in the 2885 languages of the World Atlas of Linguistic Structures (WALS) (Cysouw 2008). As in all bootstrapping, this sampling is done with replacement in order to simulate sampling from the infinity of possible human languages. A simple version of this computation as applied to word order was performed by Bell (1978), and the calculation itself is straightforward: if a word order occurs in known languages with probability p, then with probability (1 − p)N it will not occur in a subset of N languages. This expression gives the probability of not observing a word order, and thus concluding (incorrectly) that a particular word order is absolutely impossible. Figure 1 shows how often one would conclude that each word order is impossible, as the number of languages sampled, L, varies from 1 to 2000 languages. For instance, the line for OVS line represents the probability that a scientist who looked at L languages would incorrectly conclude that OVS is absolutely and universally outlawed. Again, we are 6 For simplicity, here we assume that the languages sampled are drawn independently, corresponding to a scientist who examined independent languages, but the topic of independence will be addressed more in the next section. QUANTITATIVE STANDARDS 7 interested in making the probability of false positives small, below either the dashed 0.05 line, or the 0.01 line. This graph illustrates several key points. First—and unfortunately—this figure illustrates that for word order, one must have examined a relatively large number of languages in order to make the false positive probability below 0.05. A linguist who looked at fewer than around 900 languages would have a false positive rate of greater than 0.05 of thinking that OSV is an impossible word order. Here, this is even a slightly conservative7 estimate since we may be interested in the probability of any vi occuring in a false positive, not just the least frequent word order. To make the rate less than 0.01, one would have to look at around 1400 languages. This example should lead us to conclude that absolute universals based on studying around 900 languages will not be justified: a method using fewer than this number gives too high of a false positive rate. Second, this graph demonstrates that finding the least frequent feature values is hardest. This is because they are least likely to be drawn on any individual sample, and so are the most likely not to occur in a collection of L samples. This should be intuitive, and illustrates an often discussed problem for typology: rare feature values vi are more difficult to find, so it is never clear if a feature value is impossible, or rare and simply not yet observed. This means that our degree of belief in an absolute universal should depend on how likely the rarest feature values are. Bootstrapping features from WALS The analysis in the previous section raises two interesting issues. First, since the false positive rate depends on how rare the rarest features are, it is not clear that findings based on the distribution of word orders generalize to other features linguists might care about. It may be that rare word orders are much rarer than other typical features one would study, for instance. This means that the necessary L should not be estimated from word order simulations alone. To address this, we study 138 features from WALS, not just word order. Keeping the false positive rate low across this broader range of features should provide good guidelines for how to keep the false positive rates low for future linguistic features which are similar to those already in WALS. Second, it is not clear that the raw distribution of features in observed languages provides a good estimate of the true distribution. Many of the languages in WALS are genetically and historically related, meaning that accidents of linguistic history may be driving the overall distribution of word orders. This means that when we model sampling from the observed distribution, we are not correctly approximating the “true” distribution of word orders, and thus not approximating the correct false positive rate. This means that we should also consider alternative methods of approximating the true distribution. Here, we consider four: 1. Flat sampling – This uses the raw distribution of feature values observed in WALS. 2. Family stratified – Sampling from the true distribution is done by first sampling 7 In this paper, conservative means it underestimates the number of languages required. This is conservative with respect to the number required, not with respect to the estimated false positive rate for any sample size. QUANTITATIVE STANDARDS 8 a language family uniformly at random, and then sampling a language within that family uniformly at random. 3. Genus stratified – Sampling from the true distribution is done by first sampling a language genus uniformly at random, and then sampling a language within that genus uniformly at random. 4. Independent subsample – This method constructs a set of languages which share as few features as possible, and treats sampling from the true distribution as sampling from that set of approximately independent languages. The flat sampling method is the one used in the previous section, using just the raw feature counts from WALS. The family stratified method attempts to correct for the fact that imbalances in the number of languages observed in each language family are likely due to historical accidents. We can partially correct for these historical accidents by increasing the probability with which we sample languages from language families with fewer languages. That is, within each language family we take the true distribution of feature values to be that observed in the family’s features, but adjust for the fact that all languages families may represent equally likely subspaces of the set of possible languages. The genus stratified sampling does the same thing, but for language genera instead of families. The independent subsample method is meant to approximate the true distribution by the distribution of features in a sample of languages whose features are as uncorrelated as possible (see also Perkins 1989, Rijkhoff 1993, 1998). This also attempts to remove artifacts of history, but does it by finding a sample of languages which is as independent as possible, rather than assuming that the true distribution can be approximated by our own classification of languages in the families and genera. To do this, we ideally would search over subsets of languages and choose a subset of languages with the most diverse set of feature values, taking this to be an approximation to the true and independent distribution of features. Unfortunately, the number of such subsets is exponential in the size of the subset. To make matters worse, we are interested in finding a sample of languages in which all the possible feature values are attested—otherwise our approximation to the true distribution will already be missing possible feature values8 . To solve this problem, we implemented a simple greedy algorithm to find a diverse subset of languages. The algorithm starts with an empty set of languages and considers adding each possible language in WALS. It first adds languages which decrease the number of unobserved feature values. This ensures that all feature values are represented. To break ties, it adds languages which share the smallest proportion of their features with languages already in the set9 . Further ties are broken by picking languages which have the largest number of defined features. Note that after enough languages have been added so that all feature values are observed, additional languages never decrease the number of unobserved feature values, meaning that the decision about which language to add is based first on the proportion of features it shares with languages already in the set. This creates a set of languages with diverse feature values. We ran this algorithm to produce a set containing 8 The requirement that all feature values are represented corresponds to an instance of SET-COVER, an NP complete problem. Our algorithm is simliar to the best polynomial time approximation to SET-COVER (Chvatal 1979). 9 Note that this means the languages are actually anticorrelated, a more conservative analysis for our purposes than independent. QUANTITATIVE STANDARDS 9 Figure 2. Bootstrapped false positive rates for 138 features from WALS. 255 and 639 languages, respectively 10% and 25% of languages in WALS10 With each of these sampling methods, we used a dynamic programming algorithm to compute the probability of making a false positive error, averaging across the 138 features from WALS. The estimated false positive rate is shown in Figure 2 for each of the methods discussed above. Unlike Figure 1, this plot shows the probability a random feature (not feature value) from WALS would be found incorrectly to be constrained in a sample of L languages. For instance, the value of the “flat” curve is slightly less than 0.05 at L = 500, indicating that if the true distribution is best approximated using the flat method above, and 500 independent languages were examined, one could expect a false positive rate of slightly less than 0.05. This figure demonstrates that the number of languages necessary to achieve a false positive rate of 0.05 varies from around 300 to around 1000, depending on how we approximate the true distribution. Since we do not know which method best approximates the true distribution, we do not know which curve best represents the false positive rate. However, as discussed above, it is not clear that a false positive rate of 0.05 is sufficient for what is essentially a non-replicable experiment. The most optimistic curve, “Independent sample 10%,” drops below a false positive rate of 0.01 only after 500 languages, and many of the others do not make it there even after 2000 samples. We take these results as leading to the above heuristic that absolute universals require examining at least 500 languages. There is one very important caveat to this analysis. The sampling procedure we use assumed independent samples from the true distribution, simulating the false positive rate for linguists who sample independently from the true distribution. This means that what is really requires is 500 independent languages, not 500 languages overall. For instance, Spanish and Italian do not count as two separate languages in this analysis since they 10 This is a conflict between getting a large number of samples to approximate the true distribution—and thus approximating it well—and using fewer and thus more independent languages to approximate it. QUANTITATIVE STANDARDS 10 are genetically related. This means that the effective number of languages necessary may be much larger than 500, when sampling uses non-independent languages since correlated samples will provably increase the number of samples needed to stay below a given false positive rate. A Bayesian approach In the previous section we looked at how many samples are necessary to maintain a low false positive rate, using the simulated false positive rates on current data as a proxy for false positive rates on future data. An alternative approach is to compute the degree of belief that one normatively should have that a feature is impossible. As we will show, this is possible through a Bayesian model, which illustrates that it is possible to infer possibility from probability. With enough examples that fail to show a feature value, the statistically better theory is one in which that feature value has zero probability. The Bayesian approach specifies a probability model over unknown states of the world and links this probability model to the observed data. In our case, the unknown state of the world is the true, underlying distribution of feature values. Let x = (x1 , x2 , . . . , xN ) be this true distribution on feature values v1 , v2 , . . . vN . Here, x assigns each feature value a probability, corresponding to the probability that an independently sampled language will exhibit that feature value. For instance, if the features are word orders, with v1 = SV O and v2 = OSV , then x1 might be relatively high, say 0.35 and x2 might be relatively low, say 0.05. We assume that the languages we observe have their feature values chosen independently from this distribution x. Thus, if x1 = 0.35, then we would expect 35% of languages to have feature value v1 in a large sample of languages. With this setup, we can consider two statistical models. First, model MN holds that all N feature values are possible, assigning them each some nonzero probability xi > 0 for all i. The second model, MK , holds that only the first K are possible, meaning that xi = 0 for i > K. Intuitvely, MK corresponds to the scientific hypothesis that in the true distribution x, some feature values are fundamentally outlawed for human language. Considering this model only makes sense if ci = 0 for i > K, since otherwise MK must be the wrong model. We assume this in the rest of this section. Note that if we could directly study x, we could decide between MN and MK by looking to see if xi > 0 for some i > K. Unfortunately this true distribution is not observable and its properties must be inferred from the counts ci of how many times we observe each feature value. We will use a simple Bayesian model known as a Dirichlet-Multinomial model which makes parametric assumptions about the distribution x. In particular, it assumes that x has been generated from a Dirichlet distribution. Our parameterization of the Dirichlet distribution has a single free parameter, α, which controls the degree to which we expect x to be uniform. As α gets small, we expect that most of the probability mass in x is on only a few xi : most feature values are low probability, and the distribution is far from uniform. Large α means that the distribution x is very close to uniform: all feature values are equally likely. When α = 1, this corresponds to no real bias, with all distributions on x being equally likely. We can also fit α by seeing which value makes the observed counts in WALS most likely. For this we find α = 0.9, a value close to uniform, meaning that all distributions on features values are about equally likely. QUANTITATIVE STANDARDS 11 Given a value for α, we can compare the models by seeing which assigns the data a higher probability. A standard in Bayesian statistics is to compute a Bayes factor, which is the log of the ratio of the probability each model assigns to the observed data (the set of counts c): P (c | MK , α) BF = log . (1) P (c | MN , α) Here, a value greater than zero supports the constrained model MK . However, what matters more is the strength of this evidence. Just as frequentist statistics standardizes p-values, Bayesians standardize how much evidence is required to conclude that one model is better than another. One commonly accepted standard is due to Jeffreys (1961) who considered Bayes factors in the 1.6 − 3.3 range11 “substantial,” 3.3 − 5.0 “strong,” 5.0 − 6.6 “very strong” and > 6.6 “decisive.” Equation 1 states the preference for the constrained model in terms of P (c | MK , α) and P (c | MN , α), but it may not be intuitively clear how to compute these values since x does not appear in them. As we show in Appendix A, these values can be found by integrating over x using the standard Dirichlet-Multinomial model, meaning essentially that we compute the probability of c averaging over the unknown distribution x. Doing this, Equation 1 becomes BF = log P (c | MK , α) Γ (K · α) Γ (N α + C) = log · . P (c | MN , α) Γ (N · α) Γ (Kα + C) (2) PK P where C = N i=1 ci and Γ is the Gamma function, a generalization of factorials i=1 ci = to non-integers. This expression gives a concise form for the strength of evidence in favor of a restricted model with some feature values universally outlawed, over an unrestricted model. One easy example is to work out the case when α = 1 and K = N − 1. A value of α = 1 corresponds to the expectation that, before any counts have been observed, all distributions x on feature values are equally likely. K = N − 1 means that only a single feature has not been observed. In this case, Equation 1 simplifies to BF = log C +N −1 N −1 (3) For example, to get strong evidence—say, a Bayes factor of 4.0—in favor of a universal restriction against a single feature value when N = 10 feature values are possible, we would require a sample of 24 · (10 − 1) − 10 + 1 = 135 independently sampled languages not showing the feature. “Decisive” evidence, a Bayes Factor of 6.6, would require 864 samples. When C N , the Bayes factor is approximately log(C/N ), which indicates that doubling the number of possible feature values requires roughly doubling the number of samples, to maintain the same Bayes factor. Note that these results are still assuming that ci = 0 for i > K; otherwise observing a language with ci > 0 for some i > K would falsify model MK . Figure 3 shows the Bayes factor according to Equation 2, for various N and K, using our empirically-determined value of α = 0.9. These results qualitatively agree with our earlier frequentist analysis that very convincing evidence in support of an absolute 11 Log base 2. QUANTITATIVE STANDARDS 12 Figure 3. Bayes factor in favor of a constrained model, as a function of the number of samples for α = 0.9. universal is only found after around 500 languages. However, quantitatively this analysis is somewhat more optimistic, finding strong support for absolute universals after around 200 languages, and decisive support between around 500 and 800. As above, this requires independently sampled languages form the “true” distribution of languages. This analysis required making parametric assumptions about the relevant distributions—most notably that feature distributions are approximated by a Dirichlet-distribution (Equation 4). We also assumed that all feature values are equally likely a priori, a reasonable simplification but one that is likely not true for real linguistic features. Since these assumptions are just approximations, we put more stake in the numbers from the earlier analysis which were less optimistic. The key point from this model-based analysis is that it is possible to statistically justify theories of absolute universals. With enough evidence, a model with restrictions on feature values better predicts the observed data than one without restrictions, assuming the restricted feature values have not been observed in the sample. This is because the restricted model assigns higher probability to the observed features since it does not waste probability mass on unobserved features. With enough evidence, this difference in likelihood becomes large and should convince an ideal statistical observer that some feature values are not possible. Said another way, after a particular feature value is not observed enough times, it becomes less likely that it is possible—since if it was possible, we would expect to see it under random sampling. The above model formalizes this intuition, allowing us to compute the degree of belief in a constrained model relative to an unconstrained one. This means that it is rational to believe in absolute universals assuming an adequate number of languages have been tested. The number of languages necessary can be computed by solving Equation 2 for the desired Bayes factor12 12 This framework can also handle more complex analyses, such as not assuming equal priors on all features, but the math is more complex. QUANTITATIVE STANDARDS 13 Conclusion We have shown that in the case of absolute linguistic universals it is possible within standard scientific practices to infer what is possible from what appears to be probable. The Bayesian analysis we presented shows there is no logical problem with finding absolute universals, assuming that we can make assumptions about the a priori distribution of feature values. In fact, relative to the assumptions of the model, the Bayesian approach provides an optimal strategy for inferring from data whether some feature values are impossible for language. We also presented and justified a rule of thumb that claims about absolute linguistic universals should be based on examination of at least 500 independent language. This is a rule of thumb since the true sample size required depends on many factors, such as the unknown true distribution, the number of possibel feature values, and our expectations about the probability of rare features. Since we found this number using features from WALS, it can be interpreted as the sample size necessary for testing absolute universals with feature distributions and counts similar to those found already in WALS. The number, 500, is grim for several reasons. First, it is almost certainly an underestimate of the necessary number of languages. Most of the analyses required far more than 500 languages to achieve false positive rates below 0.05 or 0.01, but many analyses required more. Much worse, however, is that there is a selection bias in the features which are used as input from the model since only observed feature values from WALS are considered. It is likely that for some features, there are other values which are possible for human languages but not found in the current sample in WALS. This means that we have likely underestimated the occurrance of rare features, therefore underestimating the number of languages necessary for justified claims about universals. Second, it is not intuitively clear to us that there are even 500 essentially independent languages in existence. Even if languages in different language families are assumed to be essentially independent (which they are not), there are only 208 language families represented in WALS. This means that even in a database the size of WALS with over 2,500 languages, there are not enough independent languages to make statistically-justified claims about absolute linguistic universals. The situation improves somewhat if we assume language genera to be independent since there are 477 of these represented, but this assumption is doubtful. Unfortunately, performing an analogous analysis for nonindependent language samples appears to be quite difficult due to the complex historical relationship between languages. In general, we suggest that claims about linguistic universals should be accompanied by some measure of the strength of evidence in favor of such a universal—for instance, a Bayes factor for the constrained theory over the unconstrained theory. This is especially important when universals are taken to be relevant to theoretical debates in linguistics, since theories should not be evaluated by or depend on very weakly-supported “universal” properties. Our results demonstrate that a considerable amount of work is required in order to justify absolute universals, and that skepticism is warranted when absolute universal are posited from sample sizes less than around 500 languages. QUANTITATIVE STANDARDS 14 Appendix A: Derivation of Bayes factor We apply a simple Bayesian model known as a Dirichlet-Multinomial model. We first present the derivation for MN as the derivation for MK is analogous. We assume a Dirichlet prior on x which assigns any possible distribution x an a priori probability. The Dirichlet prior has a simple form: N P (x | α, MN ) = Γ (N · α) Y Γ (α)N xα−1 . i (4) i=1 term is simply a normalizing constant, written in This prior has two parts. The Γ(N ·α) Γ(α)N terms of the Γ-function, a function that roughly generalizes factorials to non-integers. The product of xαi formalizes the key part of the prior, and is parameterized by the single variable α. As α gets large, this prior favors a uniform distribution, xi = 1/N for all i. As α gets small, this prior favors distributions which place probability mass on only a few xi . Crucially, however, this prior treats all xi the same, meaning that a priori we do not expect some feature values to be more likely than others13 . The second part of the model is the likelihood, which formalizes how likely the observed data are for any assumed distribution x. The likelihood takes a simple form as well. Suppose that each feature value vi occurs ci times in the observed data. This happens with probability xci i since each instance of vi occurs with probability xi , and there are ci of them. This means that P (c | x), with c = (c1 , c2 , . . . , cN ), is then P (c | x, MN ) = Y xci i . (5) i With this likelihood, distributions that closely predict the empirically observed distribution ci are more likely than those which predict distributions far from the observed ones. We are interested in how well this model can predict the observed counts. Formally, we want a model such that P (c | MN , α) is high, corresponding to a model which assigns the observed data high probability. To compute this term, we can write P (c | MN , α) as an integral over all x: Z P (c | MN , α) = P (c | MN , x)P (x | MN , α)dx. (6) This gives the probability of the observed counts c for any α, by averaging over (integrating out) the unobserved distribution x. Because we used a Dirichlet prior and multinomial 13 This is a simplification for our purposes, but this same class of model will work with different prior expectations for each xi . 15 QUANTITATIVE STANDARDS likelihood, Equation (6) can be explicitly computed. Substituting (5) and (4) into (6) gives Z P (c | MN , α) = = = N Γ (N α) Y Γ (α)N i=1 Z N Γ (N α) Y N Γ (α) Γ (N α) · N Γ (α) xci i N Y xα−1 dx i i=1 xci i +α−1 dx i=1 QN i=1 Γ (ci P N Γ i=1 (ci (7) + α) . + α) where the integral is a special form known as the Dirichlet integral which has an analytical solution in terms of the Γ function. For any value of α, parameterizing how sparse the distribution x is, this equation gives the probability of the observed data (the counts c), summing over all possible values of x, the variable we do not know. We assume that ci = 0 for i > K. The derivation for MK is identical to the one for MN , except that the sums run only up to K since MK does not assign any probability mass to xi for i > K. With this, we have P (c | MK , α) = Γ (Kα) Γ (α)K QK i=1 Γ (ci + α) . · P K (c + α) Γ i i=1 (8) The standard way to compare these two models—one which gives all feature values nonzero probability, and one which is restricted by a universal constraint—is to compute a Bayes factor. The Bayes factor quantifies the strength of evidence in favor of one model over another, and is computed simply as the log of the ratio of probability of the data under each model: BF = log P (c | MK , α) P (c | MN , α) N = log Γ (Kα) Γ (α) · Γ (N α) Γ (α)K P N Γ (c + α) i i=1 i=1 Γ (ci + α) · P · QN K Γ (c + α) i Γ (c + α) i=1 i=1 i QK (9) 1 Γ (N α + C) Γ (Kα) = log · Γ (α)N −K · QN · Γ (N α) Γ (Kα + C) i=K+1 Γ (α) = log Γ (K · α) Γ (N α + C) · . Γ (N · α) Γ (Kα + C) PN PK where C = i=1 ci = i=1 ci . Note that this simplification works because ci = 0 for i > K. This expression gives a concise form for the strength of evidence in favor of a restricted model with some feature values universally outlawed, over an unrestricted model.
© Copyright 2024