B U L L E T I N
Biological Information Standards
Chulin Meng is a doctoral student in the Graduate School of Library and Information Science at the University of Illinois, Urbana-Champaign, 501 E. Daniel St., Champaign, IL 61820. He may be reached at 217-333-6298 or by email at email@example.com.
The revolutions caused by advanced computing power, advanced informatics and the Internet have changed the way scientific data are stored, located and disseminated. Today almost every scientist uses the Internet to share data, sometimes just with a close colleague, other times through large-scale databases. The volume, complexity and heterogeneity of biological data make data sharing a daunting task. Standards are clearly required to allow smooth and coherent data sharing.
Biological informatics already has established some standards for sharing data, from nomenclature to exchange protocols. In this paper we will give a brief introduction of those data standards. Library and information science has a long tradition of developing standards for sharing bibliographic data. Traditionally libraries have used technology to facilitate cooperation and collaboration in the creation of catalogue records for bibliographic materials. Library and information science has developed a set of well-established standards for sharing data, such as Anglo-American cataloguing rules, MARC format, Dublin core, and Z39.50. Resulting expertise from this experience could help to further improve the standards for biological data.
Natural history museums are the major holding agency for biodiversity data. Specimen data in the museums are often maintained in a form of catalogs similar to bibliographic catalogues in the libraries. Although the nature of the books in a library and specimens in the museum differ, the techniques for managing bibliographic data and specimen data are similar. The biological science community has conducted research into adopting standards developed by information scientists. For example, the Consortium for the Computer Interchange of Museum Information (CIMI) published the "CIMI's Guide to Best Practice: The Dublin Core" as the guide for use of the Dublin Core for sharing data among museum collections. We can expect that information scientists, who have developed the standards for sharing bibliographic data, surely could play an important role in sharing biological information.
Biological Data Standards
Stanley Blum (see For Further Reading) roughly categorizes biological data standards into three types: vocabulary or authority standards, which specify sets of values to make the data in different systems directly comparable; semantic standards, which specify the meanings of data elements or structures; and data exchange standards, which specify both syntactic and semantic components of a data stream enabling data to be shared between systems
A vocabulary standard specifies a set of values that are mutually understood among a group of people. That is, it specifies the valid values of an attribute. Most vocabulary standards also include supplementary information, such as an abbreviation, code or definition for each term or value in the set. Examples of vocabulary standards may include lists of scientific names, author names and abbreviations.
Vocabulary standards enable users of two or more independently developed information systems to apply the values uniformly as descriptors of related data objects. If the values are applied consistently, the participating databases can inter-operate at the value level, making the data in different systems directly comparable.
An example is the Integrated Taxonomic Information System (ITIS). ITIS is a database of North American species names and their hierarchical classification, which has been developed by an international partnership of agencies and taxonomic specialists. For each scientific name, ITIS includes the authority (author and date), taxonomic rank, associated synonyms and vernacular names where available, a unique taxonomic serial number, data source information (publications, experts, etc.) and data quality indicators. An even more ambitious project is Species 2000, established by the International Union of Biological Sciences (IUBS), in cooperation with the Committee on Data for Science and Technology (CODATA) and the International Union of Microbiological Societies (IUMS) in September 1994. Species 2000 has the objective of enumerating all known species of organisms on Earth by forming a federation of interoperable species databases, including ITIS. Thus, Species 2000 will deliver a standard set of data for every known species.
A semantic data standard specifies definitions, context and assembly rules that enable combinations of simple data values (character strings, numbers, etc.) to convey information (meaningful messages) between people or information systems. It specifies the meanings of data elements or more complex data structures. Such a standard can be expressed as a data dictionary, a conceptual schema or a natural language description of data structure and integrity rules. Darwin Core is an example of semantic standard and one of a series of tools developed for the Species Analyst (a research project developing standards and software tools for access to the world's natural history collection and observation databases). It is a profile describing the minimum set of rules for search and retrieval of natural history collections and observation databases.
Other examples of semantic data standards include Access to Biological Collections Data (ABCD) and Structure of Descriptive Data (SDD), being developed by subgroups of the Taxonomic Databases Working Group (TDWG). The Darwin Core forms a subset of ABCD standard. ABCD Task Group is developing an XML schema as a data interchange standard for item-level data. An item may be a specimen, culture, living organism or observation of an organism. The schema provides a comprehensive basic structure and is consequently quite large and highly structured. It will be used as the standard format in which the results of a distributed query will be returned. It is to be used as a result-schema, that is, for data returned from collection databases as the result of a request. A non-hierarchical access schema with a reduced number of elements (such as the Darwin Core) will be used for these data requests. The current draft of the profile, labeled version 1.04, can be found at the schema subgroup's website at www.bgbm.org/TDWG/CODATA/Schema/default.htm.
ABCD structures the interchange of information about the holdings of museums or observations. It does not, however, address the issue of descriptions of what is being held or observed. DELTA and SDD are designed to handle the morphological and other descriptive information about a species.
One more example is the Ecological Metadata Language (EML), a metadata standard developed to represent ecological rather than taxonomic data. The purpose of EML is to provide the ecological community with an extensible and flexible metadata standard for use in data analysis and archiving. This will support automated machine processing, searching and retrieval. EML is implemented as a series of XML document types that can be used in a modular and extensible manner. Each EML module is designed to describe one logical part of the total metadata that should be included with any ecological data set.
A data exchange standard typically includes both semantic and syntactic components. The semantic component provides the meanings for a data structure, which saves users from having to negotiate the semantics with every exchange event. The syntactic component specifies the data structure itself, as well as how it should be encoded in a data stream or file. Any exchange minimally involves two software systems – the source and target. Export and import routines are required at either end of the exchange to create and read the data file. The most commonly supported file formats, such as fixed-length and delimited, have their limitations. A tremendous amount of effort is now being invested in developing XML related standards, and integrating this markup language into network-based interoperability tools.
The Herbarium Information Standards and Protocols for Interchange of Data (HISPID) is a standard format for the interchange of electronic herbarium specimen information. Published in 1993, HISPID is concerned primarily with data interchange standards. The interchange of taxonomic, nomenclature, bibliographic, rare and endangered plant conservation, and other related information is not dealt with in this standard, unless it specifically refers to a particular accession.
Another example requiring attention is Z39.50, a computer protocol that defines a standard way for two computers to communicate for the purpose of information retrieval. Z39.50 is an ANSI standard and has also been adopted as an ISO standard (ISO23950). In practice, Z39.50 supports information retrieval in a distributed client and server environment where a computer operating as a client submits a search request (i.e., a query) to another computer acting as an information server. Software on the server performs a search on one or more databases and creates a result set of records that meets the criteria of the search request. The server returns records from the result set to the client for processing. The power of Z39.50 is that it separates the user interface on the client side from the information servers, search engines and databases. Z39.50 provides a consistent view of information from a wide variety of sources, and it offers client implementers the capability to integrate information from a range of databases and servers. The biological science community has adopted the Z39.50 standard. For example, the Consortium for the Computer Interchange of Museum Information (CIMI) has implemented the Z39.50 specifications to support the search and retrieval of museum information among multiple museums. The Z39.50 Biology Implementers Group (ZBIG) developed a Z39.50 profile that provides a common language for data exchange by sending a single query to multiple biological collections.
The Distributed Generic Info Retrieval (DiGIR) is a commonly accepted protocol for exchanging biological data. DiGIR provides the infrastructure for performing distributed information retrieval. The purpose of DiGIR is to provide a single point of access to distributed information resources, so that location and technical characteristics of native resource are transparent to users. As a client/server protocol, DiGIR uses HTTP as the transport mechanism and XML for encoding messages sent between client and server. It was originally developed to replace the Z39.50 protocol used in the Species Analyst project, but is also designed to work with any type of information, not just natural history collections. A major contributor to DiGIR is the Mammal Networked Information System (MaNIS) project (http://elib.cs.berkeley.edu/manis/).
Although data sharing has been adopted in areas such as biodiversity and genomics, the issue remains controversial in neuroscience. There are negative reactions to sharing neuroscience data, such as primary data is too complex for anyone other than the producer of the data to understand or someone else might find something new from the data that the producer worked hard to collect. Neuroscientists are working on these sociological issues of sharing data now.
However, there is a significant and growing need among neuroscientists to exchange experimental data. Several efforts have been developed to facilitate the sharing of primary data in the neuroscience community. In 2000, considerable debate emerged over the new policy of the Journal of Cognitive Neuroscience, which required that the primary data of papers published in the journal be made freely available. A new database has also been established specifically for the storage and sharing of functional magnetic resonance imaging (fMRI) data from cognitive studies. To share data effectively, it will be necessary to create sharing rules and guidelines. The involvement of the neuroscience and computer science community is essential to creating these new rules for data sharing. As a component of the Human Brain Project, Gardner proposed the Common Data Model (CDM) as a framework for federation of a broad spectrum of disparate neuroscience information resources. However, universal data sharing has not been fully adopted in the field of neuroscience, and there are no well-established and commonly accepted standards for sharing neuroscience data.
Although the complexity and heterogeneity of biological data make the establishment of widespread collaboration a daunting task, the advantages of sharing data would be enormous. With more and more information scientists joining the bioinformatics field, bringing in the expertise and experiences of developing standards for sharing bibliographic data, the biological community will obtain the huge benefit of sharing data through well-established data standards.
For Further Reading
Blum, S. A. (2000). Call to revise the TDWG standards development process. Available at www.tdwg.org/process/tdwg99_blum.html.
DiGIR website. Available at http://sourceforge.net/projects/digir
Ecological Metadata Language (EML).Available at http://knb.ecoinformatics.org/software/eml
Gardener, D. et al. (2001). Common data model for neuroscience data and data model exchange. Journal of the American Medical Informatics Association, 8(1), 17-33. Abstract available online without subscription at www.jamia.org/content/vol8/issue1/index.shtml.
Heidorn, P. B. (2001). Building a global biology digital library: Progress toward taxonomic data standards: The 2001 Taxonomic Data Working Group. Presented to the Illinois Natural History society, December 7, 2001. (Slides). Available at www.isrl.uiuc.edu/~pheidorn/papers/TDWG2001INHS.PDF
Hobern, D. (2002). Integrating biodiversity data standards and interoperability. Available at www.cria.org.br/eventos/tdbi/bis/presentations/bis_dhobern.ppt
ITIS website. Available at http://www.itis.usda.gov/
International Working Group on Taxonomic Databases website. Available at www.tdwg.org
Koslow, S.H. (2002). Sharing primary data: A threat or asset to discovery? Nature Reviews Neuroscience, 3 (4), 311-313.
Species 2000. Available at www.sp2000.org/
Copyright © 2004, American Society for Information Science and Technology