How many species are there on Earth and in the Ocean? Short title: On the number of species on Earth and in the ocean Camilo Mora*, Derek P. Tittensor, Sina Adl, Alastair G.B. Simpson, Boris Worm Department of Biology, Dalhousie University, Halifax, NS, Canada *E-mail: [email protected] ABSTRACT The diversity of life is one of the most striking aspects of our planet; hence knowing how many species inhabit Earth is among the most fundamental questions in science. Yet the answer to this question remains enigmatic because the limited sampling of the world’s biodiversity has precluded direct quantification, and because indirect estimations rely on assumptions that have proven highly controversial. Here we quantify the total number of species on Earth using a new described pattern. We show that the higher taxonomy of species (i.e. assignment of species to kingdom, class, order, family and genus) follows a consistent pattern from which the total number of species in any taxonomic group can be extrapolated. The approach was validated with well-known taxa and when applied to all domains of life it predicts ~8.9 million eukaryotic species globally of which ~2.2 million are marine. In spite of 250 years of taxonomic classification and over 1.2 million species already cataloged in a central database, our results suggest that some 86% of the species on Earth, and 91% in the ocean, still await description. With current rates of extinction now reaching 100 to 1000 times natural background rates, our slow advance in the description of species suggest that many species are likely to become extinct without us ever knowing they existed. Renewed interest in further exploration and taxonomy is critically required if this significant gap in our knowledge of life on Earth is to be breached. INTRODUCTION Robert May [1] recently noted that if aliens visited our planet, one of their first questions would be "How many distinct life forms-species does your planet have?". He also pointed out that we would be embarrassed by our answer. This simple narrative illustrates the fundamental nature of knowing how many species there are on Earth, and our limited progress thus far [1-4]. Unfortunately, the precarious sampling of the world’s biodiversity has prevented a direct quantification of the number of species on Earth whereas indirect estimations remain doubtful due to the use of controversial approaches (see detailed review of available estimations, methods and limitations in Table 1). Globally, our best approximation to the total number of species is based on the opinion of experts, whose estimates range widely between 3 and 100 million species [1]; these numbers, however, have been questioned due to their limited empirical basis [5] and subjectivity [5-6] (Table 1). Several attempts to improve global estimates have employed macroecological patterns and biodiversity ratios but the assumptions of these methods have been strongly contested [3vs7, 8vs9, 10vs2, 1112vs13, 14, 15vs5, 16vs17] plus their overall predictions are relevant to only subsets of species, such as insects [9,18-19], deep sea invertebrates [13], large organisms [6-7,10], animals [7], or fungi [20]. With the exception of a few extensively studied taxa (e.g. birds [21], fishes [22]), we are still remarkably uncertain as to how many species there are for most taxonomic groups, highlighting a significant gap in our basic knowledge of life on Earth. Here we present a novel, statistically robust method to quantifying the global number of species in all domains of life. We report that the number of higher taxa, which is much more completely known than the number of species [23], is strongly correlated to taxonomic rank [24] and that such a pattern allows extrapolating the global number of species for any kingdom of life (Fig. 1-2). We note that higher taxonomic data has been used previously to quantify species richness within specific areas by relating the number of species against the number of genera or families at well sampled locations, and then, use the resulting regression model to estimate the number of species at other locations for which higher taxon are better known than species richness (reviewed by Gaston [23]). This method, however, is inherently developed from and for local and not global species richness (for it to be suitable for the purpose of quantifying global number of species we would need data on the richness of replicated planets; obviously, not an option as far as we know). Here we analyze higher taxonomic data in a different manner by assessing patterns on the entire taxonomic classification of major taxonomic groups. The existence of predictable patterns in the classification of species (Fig. 3) could potentially allow predicting the total number of species within taxonomic groups and we are not aware that this method or idea has been evaluated for the purpose of estimating global species richness. RESULTS AND DISCUSSION We compiled the taxonomic classification of ~1.2 million currently valid species from several publicly accessible sources (see Materials and Methods). Among eukaryote “kingdoms”, assessment of time-accumulation curves of taxa (i.e. the cumulative number of described species, genera, orders, classes and phyla over time) indicated that higher taxonomic ranks are much more completely described than lower levels (Fig. 1a-f). However, this was not the case for prokaryotes in which new taxa are steadily described at all taxonomic levels (Fig. S1). For eukaryotes, the rate of discovery of new phyla (or “divisions” in botanical nomenclature) and classes has slowed substantially in modern times whereas the number of taxa at the genus or species level continues to increase steadily in all kingdoms (Fig. 1a-f, Fig. S1). This prevents direct extrapolation of the number of species from species-accumulation curves [21-22] and highlights our current uncertainty regarding estimates of total species richness (Fig. 2f). However, the increasing completeness of higher taxonomic ranks could facilitate the estimation of species richness, if the former predict the latter. We evaluated this hypothesis at the global scale. First, we accounted for undiscovered higher taxa by fitting asymptotic regression models to the time-taxa accumulation curves (Fig. 1a-e) and used a formal multimodel averaging framework based on Akaike’s Information Criterion [22] to predict the asymptotic number of taxa at each taxonomic level (Fig. 1a-e; see Material and Methods). Secondly, the predicted number of taxa at each taxonomic rank down to genus was regressed against the numerical rank, and the fitted model was used to predict the number of species (Fig. 1g, Material and Methods). We applied this approach to 18 taxonomic groups for which the total numbers of species are relatively well known and found that the approach yields accurate predictions for these groups (Fig. 2). When applied to all eukaryote kingdoms, the approach predicted ~7.9 million species of animals, ~300,000 species of plants, ~650,000 species of fungi, ~36,000 species of protozoa and ~5,800 species of chromists; in total 8.9 million species of eukaryotes are estimated to exist on Earth (Table 2). Restricting this approach to marine taxa resulted in a prediction of 2.2 million eukaryote species in the world’s oceans (Table 2). We applied the approach to prokaryotes as well; unfortunately, the steady pace of description of taxa at all taxonomic ranks precluded the calculation of asymptotes for higher taxa (Fig. S1). Thus, we used raw numbers of higher taxa (rather than asymptotic estimates) for prokaryotes and as such our estimations represent only a lower boundary of the diversity in this group. Our approach predicted ~10,000 species of prokaryotes of which ~550 are marine. An additional reason for the low number of prokaryotes is the existence of horizontal gene transfer between lineages, which leads to broader criteria for the classification of species in prokaryotes than in eukaryotes [25]. As a result there are relatively few described species of prokaryotes (only ~10,000 species are currently accepted), which are phenotypically broader and phylogenetically much older than eukaryotes [25]. We observed that the statistical predictability of the number of species from higher taxa emerges from a consistent pattern in the frequency distribution of subordinate taxa; that is, at any taxonomic rank, the diversity of subordinate taxa is concentrated within few groups with a long tail of low-diversity groups (Fig. 3). The accurate fit of a statistical model to this pattern implies that the number of taxa at any taxonomic rank can be predicted from the number of subordinate taxa and vice versa. The existence of these consistent patterns throughout the entire taxonomic hierarchy implies that knowing the number of taxa at just one taxonomic level could yield, with varying accuracy, the data necessary to reconstruct the number of taxa in the entire taxonomic hierarchy of a taxon. The mechanism for this pattern is uncertain, but in the case of taxa classified phylogenetically, it may involve a consistent pattern of diversification likely characterized by radiations within few clades and slow evolution (little cladogenesis) in most others [26]. In the case of taxa classified phenetically, the pattern implies the possibility of congruence with phylogenetic classifications, which has been demonstrated for some groups [25], although it might also reflect unseen patterns in the way phenotypic data have been used to classify species. Two long-standing issues in the classification of species that may influence the interpretation of our results concern the definition of species and the system of classification. Different taxonomic communities (e.g. the zoological, botanical and bacteria codes of nomenclature) use different levels of differentiation to define a species; this implies that estimations on the number of species according to the different codes are not directly comparable. Although we estimated the number of species for kingdoms traditionally classified under the same rules, our global predictions for eukaryotes and prokaryotes should be interpreted with that caution in mind. With respect to the second issue, changes between cladistic (i.e. classification based on phylogenetic origins) and taximetric (i.e. classification based on phenetics) systems as well as improvements within systems due to new technologies and data imply that the higher taxonomy will likely change in structure. When taxa simply change names or move to other clades while maintaining their taxonomic level, this will not influence our results since our method relies on the number of taxa and not their linkages. In contrast, increases or decreases in the number of higher taxa (e.g. by lumping or splitting or discovering new taxa) will affect our estimations. Nonetheless, a sensitivity analysis indicated that our predictions are surprisingly robust to moderate changes in higher taxonomy (Fig. S2). Knowing the total number of species has been a question of great interest motivated in part by our collective curiosity about the diversity of life on Earth and in part by the need to provide a reference point for current and future losses of biodiversity. Unfortunately, incomplete sampling of the world’s biodiversity combined with a lack of robust extrapolation approaches has yielded highly uncertain and controversial estimates of how many species there are on Earth. Here we have introduced a new approach whose validation and explicitly statistical nature adds greater robustness to its predictions. Our estimates suggest that after 250 years of taxonomic classification only ~14% of the species on Earth are currently catalogued while only ~9% of those in the ocean have been indexed in a central database (Table 2). Our slow advance in the description of species in combination with extinction rates now exceeding natural background rates by a factor of 100 to 1000 [27], suggest that species may become extinct without us ever knowing they existed. Increasing rates of biodiversity loss provide urgent incentives to increase our knowledge of the Earth’s remaining species. Though remarkable effort and progress have been made, a further closing of this knowledge gap will require a renewed interest in exploration and taxonomy by both researchers and funding agencies, and a continuing effort to catalogue existing biodiversity data in publicly available databases. MATERIALS AND METHODS Databases Calculations of the number of species on Earth were based on the classification of currently valid species from the Catalog of Life (www.sp2000.org) whereas the estimations for species in the Ocean were based on The World’s Registry of Marine Species (www.marinespecies.org). The latter database is largely contained within the former. These databases were screened for errors in the higher taxonomy including classification of organisms in multiple clades and homonyms. The Earth’s prokaryotes were analyzed independently using the most recent classification available in the List of Prokaryotic Names with Standing in Nomenclature database (http://www.bacterio.cict.fr). Additional information on the year of description of taxa was obtained from the Global Names Index database (http://www.globalnames.org). We only used data to 2006 to prevent artificial flattening of accumulation curves due to recent discoveries not yet being entered into databases. Statistical analysis To account for higher taxa yet to be discovered, for each taxonomic rank from phylum to genus we fitted six asymptotic parametric regression models to the time-taxa-accumulation curve and used multimodel averaging based on Akaike’s Information Criteria (AIC) to predict the asymptotic number of taxa (see Fig. 1a-e) [22]. Ideally data should be modelled using only the decelerating part of the accumulation curve [21-22]. However, there was no obvious breakpoint at which accumulation curves switched from an increasing to a decelerating rate of discovery (Fig. 1a-e). We therefore fitted the models starting at all possible years. We considered a specific year cutoff if at least 10 years of data remained available and if at least five of the six asymptotic models converged to the subset data. The multimodel asymptotic predictions and standard errors from the different cutoff years were then used to calculate a weighted consensus asymptotic mean and its standard error, ensuring that the variability within and among predictions were incorporated [29-30]. To estimate the number of species in a taxonomic group, we related the consensus number of higher taxa against their numerical rank (Fig. 1g) using least squares regression techniques. Since data are not strictly independent across hierarchically organized taxa, we also used Generalized Least Squares models assuming autocorrelated regression errors. Models were run with and without the inverse of the consensus variance as weights to account for variable uncertainty in the asymptotic number of higher taxa. We fitted exponential, power and hyperexponential functions to the data and obtained a prediction of the number of species by multimodel averaging based on AIC. For the validation analysis (Fig. 2), we could not use the hyperexponential function when less than three taxonomic ranks were available due to limited degrees of freedom. REFERENCES 1. May R (2010) Tropical Arthropod Species, More or Less?. Science 329:41-42 2. May RM (1992) How many species inhabit the earth? Sc. Amer 10:18-24. 3. Storks N (1993) How many species are there? Biodiv Conserv 2:215-232. 4. Gaston K, Blackburn T (2000) Pattern and Process in Macroecology. Blackwell Science Ltd. 5. Erwin TL (1991) How many species are there? Revisited. Conserv Biol 5:1-4. 6. Bouchet P (2006) The magnitude of marine biodiversity, in: Duarte, C.M. (Ed.). The exploration of marine biodiversity: scientific and technological challenges. pp. 31-62. 7. May RM (1988) How many species are there on earth?. Science 241:1441-1449. 8. Thomas CD (1990) Fewer species. Nature: 347:237. 9. Erwin TL (1982) Tropical forests: their richness in Coleoptera and other arthropod species. Coleopterists Bulletin 36:74-75. 10. Raven PH (1985) Disappearing species: a global tragedy. Futurist 19:8-14. 11. May RM (1992) “Bottoms up for the oceans”. Nature 357:278-279. 12. Lambshead PJD, Boucher G (2003) Marine nematode deep-sea biodiversity-hyperdiverse or hype? J Biogeogr 30:475-485. 13. Grassle JF, Maciolek NL (1992) Deep-sea species richness: regional and local diversity estimates from quantitative bottom samples. Am Nat 139:313- 341. 14. Briggs JC, Snelgrove P (1999) Marine species diversity. Bioscience 49:351-352. 15. Gaston KJ (1991) The magnitude of global insect species richness. Conserv Biol 5:183-96. 16. Poore CB, Wilson GDF (1993) Marine species richness. Nature 361:597-598. 17. May RM (1993) Marine species richness. Nature 361:598. 18. Hodkinson ID, Casson D (1991) A lesser predilection for bugs: Hemiptera (Insecta) diversity in tropical forests. Biol J Linn Soc 43:101-119. 19. Hamilton AJ et al. (2010) Quantifying uncertainty in estimation of tropical arthropod species richness. Am Nat 176:90-95. 20. Hawksworth DL (1991) The fungal dimension of biodiversity: magnitude, significance and conservation. Mycol Res 95:641-55. 21. Bebber DP et al. (2007)Predicting unknown species numbers using discovery curves. Proc Roy Soc B 274:1651-1658. 22. Mora C et al. (2008) The completeness of taxonomic inventories for describing the global diversity and distribution of marine fishes. Proc Roy Soc B 275:149-155. 23. Gaston KJ, Williams PH (1993) Mapping the world’s species-the higher taxon approach. Biodiv Lett 1:2-8. 24. Ricotta C et al. (2002) Using the scaling behaviour of higher taxa for the assessment of species richness. Biol Conserv 107:131-133. 25. Young JM (2001) Implications of alternative classifications and horizontal gene transfer for bacterial taxonomy. Int J Syst Evol Microbiol 51:945-953. 26. Dial KP, Marzluff JM (1989) Nonrandom Diversification within Taxonomic Assemblages. Syst Zool 38:26-37. 27. Lawton JH, May RM (1995) Extinction rates, Oxford University Press, Oxford, UK. 28. Chapman AD (2009) Numbers of Living Species in Australia and the World. 2nd edition. Australian Biodiversity Information Services, Toowoomba, Australia. 29. Rukhin AL, Vangel MG (1998) Estimation of a common mean and weighted means statistics. J Am Stat Assoc 93:303-308. 30. Levenson MS et al (2000) An ISO GUM approach to combining results from multiple methods. J Res Natl Inst Stan 105:571- 579. Acknowledgments We thank David Stang, Ward Appeltans, the Catalog of Life, the World Registry of Marine Species, and many constituent databases for making their species checklists freely available. We are grateful to Andrew Solow and Catherine Muir for comments on the manuscript. Funding for this project was provided by the Sloan Foundation through the Census of Marine Life Program, Future of Marine Animal Populations project. Table 1. Current methods for estimating global number of species and their limitations Case study Macroecological patterns Body size frequency distributions. By extrapolation from the frequency of large to small species, May [7] estimated 10 to 50 million species of animals. Latitudinal gradients in species. By extrapolation from the better sampled temperate regions to the tropics, Raven [10] estimated 3 to 5 million species of large organisms. Species-area relationships. By extrapolation from the number of species in deep-sea samples, Grassle & Maciolek [13] estimated that the world’s deep seafloor could contain up 10 million species. Time-accumulation curves. By extrapolation from the discovery record it was estimated that there are ~19,800 species of marine fishes [22] and ~11,997 birds [21]. Diversity ratios Ratios between taxa. By assuming a global ratio of fungi to vascular plants of 6:1 and considering that there are ~270.000 species of vascular plants, Hawksworth [20] estimated 1.6 million species of fungi. Host-specificity and spatial ratios. Given the 50.000 known species of tropical trees and assuming i) a 5:1 ratio of host beetles to trees, ii) that beetles represent 40% of the canopy arthropods and iii) that the canopy has twice the species of the ground, Erwin [9] estimated 30 million species of arthropods in the tropics. Known to unknown ratios. Hodkinson & Casson [18] estimated that 62.5% of the bug species in a sampled location were unknown; by assuming that 7.5-10% of the global diversity of insects is bugs, they estimated between 1.84 and 2.57 million species of insects globally. Expert opinions Analysis of expert estimations. Estimates of ~5 million species of insects [15] and ~200,000 marine species [14] were arrived at by compiling opinion-based estimates from taxonomic experts. Robustness in the estimations is assumed from the consistency of responses among different experts. Limitations May himself [7] suggested that there was no reason to expect a simple scaling law from large to small species. Further studies confirmed different modes of evolution among small species [4] and inconsistent body frequency distributions among taxa [4]. May [2] questioned the assumption that temperate regions were better sampled than tropical ones; the approach also assumed consistent diversity gradients across taxa which is not always true [4]. Lambshead & Boucher [12] questioned this estimation by showing that high local diversity in the deep sea does not reflect high global biodiversity given low species turnover. This approach is not widely applicable because it requires species accumulation curves to approach asymptotic levels, which is not common for most taxa [21-22]. Ratio-like approaches have been heavily critiqued because, given known patterns of species turnover, locally estimated ratios between taxa may or may not hold consistent at the global scale [3] and because at least one group of organisms should be well known at the global scale, which may not always be true [15]. Bouchet [6] elegantly demonstrated the shortcomings of ratio-based approaches by showing how in a well-inventoried area the ratio of fishes to total multicellular organisms would yield ~0.5 million global marine species whereas the ratio of Brachyura to total multicellular organisms in the same sample would yield ~1.5 million species. Erwin [5] labeled this approach as "non-scientific” due to a lack of verification. Estimates can vary widely, even those of a single expert [5,6]. Bouchet [6] argues that expert estimations are often passed on from one expert to another and therefore a "robust" estimation could be the same guess copied again and again. Table 2. Currently cataloged and predicted total number of species on Earth and in the ocean. Predictions for prokaryotes represent a lower bound because they do not consider undescribed higher taxa. For protozoa, the ocean database was substantially more complete than the database for the entire Earth so we only used the former to estimate the total number of species in this taxon. In some instances our predictions were smaller than the number of catalogued species. This can be due to the individual or combined effects of the imprecision in the estimations of our approach [note Standard Errors (SE) around predicted values] and the overestimation in the number of currently catalogued species due to synonyms, which may be substantial in certain taxa [3,15]. Figure legends Figure 1. Predicting the global number of species in Animalia from their higher taxonomy. First, we accounted for undiscovered higher taxa by fitting six asymptotic regression models to the cumulative number of taxa described over time (solid black lines) at all taxonomic ranks (a-f). Ideally asymptotic regression models should be fitted only to the decelerating part of such accumulation curves [21-22]. However, because there was no obvious breakpoint at which accumulation curves switched from an increasing to a decelerating rate of discovery, we fitted the models starting at all possible years. We considered a specific year cutoff if at least 10 years of data remained available and if at least five of the six asymptotic models converged to the subset data. The graded colors indicate the density of the multimodel fits to all starting years selected. The predictions from each of the cut-off years were used to calculate a weighted consensus asymptotic mean (horizontal dashed lines) and its standard error (horizontal grey area), ensuring that the variability within and among the multimodel average predictions were incorporated. g, Relationship between the consensus asymptotic number of higher taxa and the numerical hierarchy of each taxonomic rank. Black circles represent the consensus means, green circles the cataloged number of taxa, and the box at the species level indicate the 95% prediction interval around the predicted number of species. The predicted number of species was calculated from weighted averaging of the predictions of multiple models fitted to the consensus means of higher taxa. The consensus 95% confidence intervals from all models are presented in red (see Material and Methods). Figure 2. Validating estimations on the number of species. We compared the number of species estimated from the higher-taxon approach implement here to the known number of species in relatively well-studied taxonomic groups as derived from published sources [28]. We also used estimations from multimodel averaging from species accumulation curves for taxa with nearcomplete inventories. In total, we obtained data for 18 taxonomic groups for which estimates were available and for which our approach yielded an estimate. Vertical lines indicate the range of variation in the number of species from different sources. The dotted line indicates the 1:1 ratio. Figure 3. Frequency distribution of the number of subordinate taxa at different taxonomic levels. For display purposes we present only the data for Animalia; lines and test statistics are from a regression model fitted with a power function. Figure 1 Figure 2 Figure 3 Fig. S1. Completeness of the higher taxonomy of kingdoms of life on Earth. Columns 1 to 6 indicate the temporal accumulation of the number of taxa at each taxonomic rank (blue lines). The horizontal red lines indicate the consensus mean on the number of taxa (see Methods). The plots on the far right show the relationship between the number of taxa and the numerical rank (y-axis is in double log10). Vertical red areas indicate the 95% prediction interval on the number of species. Blue symbols indicate the currently cataloged number of taxa and red symbols the consensus mean. Where only blue symbols are shown the catalogued and consensus means overlapped. Note that predictions of the number of species are based on weighted averages of multiple models (see Methods). Cont. Fig. S1. Completeness of the higher taxonomy of kingdoms of life in the ocean. Fig. S2. Sensitivity analysis. One potential caveat in our approach is the degree to which estimations of the number of species may change if higher taxonomy changes. To test this effect we performed a sensitivity analysis in which the number of species was calculated after altering the number of higher taxa. We used Animalia as a test case. For each possible combination of taxonomic levels, we chose a random proportion of taxa from 10 to 100% of the current number of taxa and recalculated the number of species using the method described in this paper. The approach was repeated 1000 times and the average and 95% confidence limits of the simulations are shown as points and dark areas, respectively. Light gray lines and boxes indicate the currently estimated number of species and its 95% prediction interval, respectively. Our current estimation of the number of species appear remarkably robust to changes in higher taxonomy. As noted, results were only sensitive to changes in the number of phyla, and only when this number changed by over 50% relative to the number of currently described phyla. For all other taxonomic levels, expected changes led to predictions that remain within the confidence intervals of the currently estimated number of species.