B U L L E T I N
Publishing Digital Floras and Faunas
P. Bryan Heidorn is assistant professor, GSLIS, University of Illinois, Urbana-Champaign, 501 E. Daniel St., Champaign, IL 61820; 217-644-7792; email@example.com
One of the primary challenges for the creation of digital libraries is to enhance the value of paper-based publications by providing digital access to the materials. Simple full-text searching is just a first step in this process. The time-tested internal structure of paper-oriented documents and multi-volume collections has evolved to address specific information needs. These structures can be more fully exploited in the process of digital conversion of the material. While some structural aspects of paper-based publishing are helpful, if difficult, to bring to the electronic medium, other structural aspects of paper documents are artifacts of the limitations of the paper. Frequently these aspects, such as document subsections, can be improved in digital format. In this paper we explore how information science and technology can be used in the process of species description to achieve order of magnitude increases in efficiency and access. The natural structure within text as well as the inherent information structure of the domain of flora and fauna (which cannot always be represented in the paper medium) can be used to guide the design of new information systems to support electronic flora and fauna.
The word flora refers to both the plants that live in a particular region and a publication that describes those plants, as fauna is both the animals of a region and a publication that describes them. There is a worldwide effort to create natural history publications to help document the known species, in the face of major loss of habitat and extinctions. Some of the most visible projects include The Flora of North America, the Flora of China and the Flora of Australia. Jointly, the ultimate objective of these efforts is to produce a Flora of Earth. Similar efforts are underway for insects, but the number of insects far outnumbers plants. The huge potential size of floras and faunas is just one of many challenges to traditional methods for authoring, publishing and using them. Standing in a forest in the Smoky Mountains of Tennessee, how many of the species in front of you would you expect to find in your field guide? And what if you were in any tropical region on earth?
Examining the Problems
Many species are traveling with us around the world, introducing species where they never existed before with far-reaching consequences for food supply and health. These range from Gypsy moths to invasive plants and exotic diseases like West Nile virus and Monkey pox. We have little idea what the balance of life is or what is necessary to maintain it. We do not even know the names of the creatures we depend on. Many people believe that Carolus Linnaeus or Darwin and his immediate followers completed much of the work of naming and describing species 300 years ago. These pillars of biological science made profound contributions to the foundations of taxonomy but in terms of the breadth of life on earth, neither of them had the slightest clue as to the true diversity of life. Approximately 1.5 million species have names. Estimates range on the number of unnamed species, but 12 million is not unreasonable. For even well studied groups like plants, about 250,000 are named and between 300,000 to 500,000 are believed to exist. It is difficult to tell how many plants are named since there is no list anywhere of all of the names, although there is a provisional checklist of plant names from the International Organization for Plant Information (IOPI) (plantnet.rbgsyd.gov.au/iopi/iopihome.html). However, this list contains only names, not descriptions.
Perhaps 1.4 million insects have been described, but estimates vary wildly from 8 to 50 million for the number of species that exist (Erwin, Chapter 4, in Biodiversity II, National Academies Press, 1996). If we were to take the insects from one tree at the Los Amigos Research Station on Peru we might find several thousand species, many of which would be new to science, and there could be no one person or publication that even answers the question about which have names or descriptions.
Even if we exclude microbes, the 10,000 or so taxonomists worldwide have a lot of fieldwork to do. Even if we had the taxonomists, Scott Mori, in The Scale of Floristics, 2003, estimates the cost of producing a printed flora with 2,104 species to be $1,579,946 using current technology. Clearly, the order of magnitude improvement from technology hoped for by E. O. Wilson (see the Bulletin, October/November 2003) is sorely needed. As for that forest in Tennessee, the All Taxa Biodiversity Inventory estimates that there are 5,400 plant species and about 50,000 insect species in the park. Since it is difficult to tell which species would be around you at a particular location, you would need a field guide with all of them, which is clearly not possible in a paper format.
Nonetheless, paper publishing and the methods around it have served the taxonomic community well since Greek herbalists began authoring descriptions of medicinal plants. To move to the electronic environment from paper we need to study the use and users of those publications.
The projects to create electronic floras and faunas complement other major international initiatives designed to better understand and manage the world's natural heritage. The principles behind the Convention on Biological Diversity (www.biodiv.org/doc/publications/guide.asp), the United Nations Commission on Sustainable Development (www.un.org/esa/sustdev/) and the developing Global Taxonomic Initiative (www.biodiv.org/programmes/cross-cutting/taxonomy/) frequently motivate these developments. An indication of the significance of this work is that 157 countries signed the Convention on Biological Diversity at the United Nations Conference on Environment and Development in Rio de Janeiro, June 3-14, 1992. In addition, many private initiatives such as NatureServe (www.natureserve.org) and the All Species Foundation (http://www.all-species.org/) need and use information technology to help solve this problem. NatureServe is a non-profit conservation organization that provides the scientific information and tools needed to help guide effective conservation action. NatureServe is the major source for information about rare and endangered species. The All Species Foundation is a non-profit organization dedicated to the complete inventory of all species of life on Earth within the next 25 years. The information processing departments of these public and private enterprises are not large, and there is always a shortage of qualified workers with both biological and information technology skills and vision.
These efforts and others are motivated by the hope that electronic floras will be more readily accessible by more people (Morin, et al., 1989). The volume of information in these publications has traditionally limited their distribution and use. For example, the Flora of North America, mentioned above, will eventually include 31 volumes and over 18,000 species. Six hundred authors may contribute to this project, and many other flora projects are of comparable size: The Flora of China is projected to include 25 volumes and 30,000 species and The Flora of Australia 59 volumes and 26,000 species. Information science needs to provide computer-supported collaborative environments to support these efforts. Examples are the need to coordinate the names of the species (for example, International Taxonomic Information Service), the characteristic vocabulary to describe them and the geographic information systems to specify their range.
Characteristics of Floras and Faunas and Their Use
In large flora and fauna projects, the authors document the organisms' taxonomy, physical properties, distribution, economic value and other information. This information is used to facilitate species identification, taxonomy and ecological studies, as well as other functions (Morin et al., 1989). With the growth of the Internet it is natural that many of these projects are making their publications available over the World Wide Web as well as on paper to provide improved access. Published floras have evolved over the past several hundred years to meet particular needs and uses. It is these needs, and not the paper document structure, that must be supported in electronic publications. This is not to deny that the paper flora has not evolved to meet the same needs. Paper floras are the current solution to the people's information needs and information uses as constrained by the paper medium. Information scientists can use this history to inform design. The research question is to determine what new functionality is useful and can be economically provided in the electronic environment.
As information scientists, we need to understand the functions of the different constructs of these publications, such as order of the species, the index, the glossary and other tools. They all can be improved in the electronic realm. In the following paragraphs, we will look at just a few of the features of flora and fauna publishing that can be improved with carefully tailored information technology.
Structural Complexity. There are important characteristics of floras and faunas that differentiate them from discrete article publications such as journals and some Web pages. These differences provide unique opportunities and challenges for enhancing the publications for use in the electronic environment. These include the volume of information, homogeneity across like components and heterogeneity among content types (naming, morphology, distribution, etc), internal publication structure of materials across volumes and interdependent structure between components (for example, Linnaean taxonomy).
The first obvious difference is the volume of encyclopedic publications. Major floras typically require the integration of information resources that, when in paper form, would exist in many separate volumes of a collection (see above). Species Plantarum: Flora of the World (www.anbg.gov.au/abrs/flora/spplant/spplant.htm) is a paper and electronic publication project that is illustrative. These huge publishing projects are not your trusted Peterson Guides to be slipped into your pocket before heading to the field. For users to find particular pieces of information, they must first identify the appropriate volume. Users must then find the pages with the desired information. The electronic environment provides the opportunity to search many volumes quickly for needed information, if it is better structured than just putting paper on the Web. We might also create sub-collections of documents on demand; for example gathering together all insects in one family that are known to exist in one geographic region. We might print this on paper to take to the field or move a copy to our portable-computing device. Either approach, however, raises issues of copyright. In the current dominant publishing model, the treatments belong to the publisher not the author.
In floras, the individual entries, called treatments, are relatively short, ranging from less than one to two printed pages. In addition, these collections tend to contain different genre of materials within them to serve interrelated information needs. A flora may include textual distribution of species, genera, families or other taxonomic units, plant images, distribution maps, a glossary of definitions, thesaurus (synonymy, use context, term relationships) and identification keys as well as other data. The materials in each volume are arranged to facilitate the expected use of the information or the internal structure of the information. For example, anyone who has looked at a flora or fauna knows that the vocabulary can be specialized and complex. Therefore, there is a need for a glossary to define the terms. When readers need definitions of words they open the glossary to the necessary page (based on alphabetic order) – in the best case while the treatment/description is still open. Having multiple volumes open at one time is cumbersome in the library and impossible in the field. In the electronic environment we can hyperlink each word in a description to the definition and provide a pop-up window on demand.
The object of most information retrieval systems is to return a preexisting document to a searcher. The structure of the underlying document is not considered to be part of the retrieval system. A few notable exceptions have attempted to provide searching within document subsections. But there has been little research on the kinds of tools that are required to give access to tightly coupled document collections, such as floras, faunas and other encyclopedic publications. This challenge awaits us.
Processes for Adding Value to Descriptions. As information scientists, we can also examine where the taxonomic descriptions that generally form the bulk of a flora come from. The authors use three information sources: their personal knowledge, previous taxonomic descriptions and the plants themselves, usually represented by museum specimens. Information technology can help for at least the last two. Authors write new plant descriptions for previously described species for two main reasons. One is that the descriptions are out of date. For example, new members of a genus might have been discovered and the old descriptions are no longer adequate to differentiate all of the members of the genus. Also the taxonomy may change and species may be moved to other groups. In either case, the majority of the original description remains correct. Scholarly effort must be expended to insure this continued accuracy, and the publishing environment should speed up this process.
When writing descriptions, the botanist must gather "representative" members of each of the species, often from many museums, and then study them carefully to find shared characteristics within a species and characteristics that differentiate it from other species. For the most part, the specimens used to create descriptions are not currently tracked making it difficult to verify the accuracy of the descriptions later. This information could be tracked if there were sufficient space in the publication, which is generally not an issue in electronic publishing.
Better Access to Specimens. Over the past decade, many natural history museums have begun to create digital images of their specimens with the hope of making the specimens available to larger groups of researchers and other users more quickly. No studies have been done to determine the effectiveness of this approach. Specimen imaging is an active area of research for digital image quality, metadata, storage and search.
Identification Keys. The traditional tool that people use for identification is the binary key that many of us learned how to use in high school and then forgot. With a paper-based key, there is a branching set of decisions that you make about the characteristics of an unknown plant or animal. For example, if we were identifying a tree, we might need to decide first if the "leaves" of a tree were deciduous or evergreen. If it were evergreen, we would be told which line to go to next on the key. We might next need to decide how many needles are in each bundle. After answering several questions, we hopefully would end at an identification of the species.
One of the main difficulties with the approach is that if you do not end up at the correct answer, the tree in the answer looks nothing like the tree we were trying to identify. It is impossible to know at which decision point we made the mistake. In addition, if one of the characteristics is not available or you do not understand it, it is impossible to use the key. Computer-based interactive keys or polyclaves are intended to answer these problems and others. With the better of these tools, the characteristics can be entered in any order and there is a level of error tolerance so it is possible to make mistakes and still get an answer. Simple databases do not work for this application, which is therefore also an active area of research. The historical dominant interactive key was IntKey (biodiversity.uno.edu/delta/) based on the DELTA data format. Newer keys include LUCID (www.lucidcentral.com/), XID (www.xidservices.com/), Pollyclave (prod.library.utoronto.ca/polyclave) and a non-traditional visualization tool called BIBE (www.biobrowser.org).
The challenges and technologies listed above are but a few of those that exist in the new age of biology. We are changing our environment at an accelerated pace. In order to predict the effects of our actions we need to understand the interrelationships among living organisms. The first small step in this process is to catalog and describe all life on earth. Information technology can play a critical role in making understanding of the life process possible if appropriate research and development is focused on the problem.
For Further Reading
Australia's virtual herbarium. Available at www.anbg.gov.au/avh.html.
Erwin, Terry. (1997). Chapter 4: Biodiversity at its utmost: Tropical forest beetles. In Reaka-Kudla, M. L., Wilson, D.E. & Wilson, E.O. (Eds.), Biodiversity II: Understanding and Protecting our Biological Resources. Washington, D.C.: Joseph Henry Press.
Flora of Australia. (1982-2002). Canberra: Australian Government Publishing Service. Also published by CSIRO PUBLISHING/Australian Biological Resources Study. Available at www.publish.csiro.au/books/series.cfm?SID=6.
Flora of North America Editorial Committee (Eds.) (1993-). Flora of North America: North of Mexico. New York: Oxford University Press.
The Flora of China Project. Available at http://flora.huh.harvard.edu/china/
Heidorn, P. B. (2001). A tool for multipurpose use of online flora and fauna: The Biological Information Browsing Environment (BIBE). FirstMonday, 6 (2). Available at www.firstmonday.org/issues/issue6_2/heidorn/index.html.
Morin, N. R. et al. (Eds). (1989). Floristics for the 21st Century: Proceedings of the workshop sponsored by the American Society of Plant Taxonomists and the Flora of North America Project, 4-7 May 1988, Alexandria, VA. St. Louis, MO: Missouri Botanical Garden. (Monographs in systematic botany from the Missouri Botanical Garden, 28.)
Copyright © 2004, American Society for Information Science and Technology