Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.

knowledge discovery
classification schemes
automatic taxonomy generation
machine aided indexing
digital repositories
information reuse
astronomy

Bulletin, April/May 2013


RDAP Review

Collaborative Annotation for Scientific Data Discovery and Reuse

by Kirk Borne

The enormous growth in scientific data repositories requires more meaningful indexing, classification and descriptive metadata in order to facilitate data discovery, reuse and understanding. Meaningful classification labels and metadata can be derived autonomously through machine intelligence or manually through human computation. Human computation is the application of human intelligence to solving problems that are either too complex or impossible for computers. For enormous data collections, a combination of machine and human computation approaches is required. Specifically, the assignment of meaningful tags (annotations) to each unique data granule is best achieved through collaborative participation of data providers, curators and end users to augment and validate the results derived from machine learning (data mining) classification algorithms. We see very successful implementations of this joint machine-human collaborative approach in citizen science projects such as Galaxy Zoo and the Zooniverse (http://zooniverse.org/).

In the current era of scientific information explosion, the big data avalanche is creating enormous challenges for the long-term curation of scientific data. In particular, the classic librarian activities of classification and indexing become insurmountable. Automated machine-based approaches (such as data mining) can help, but these methods only work well when the classification and indexing algorithms have good training sets. What happens when the data includes anomalous patterns or features that are not represented in the training collection? In such cases, human-supported classification and labeling become essential – humans are very good at pattern discovery, detection and recognition. When the data volumes reach astronomical levels, it becomes particularly useful, productive and educational to crowdsource the labeling (annotation) effort. The new data objects (and their associated tags) then become new training examples, added to the data mining training sets, thereby improving the accuracy and completeness of the machine-based algorithms.

Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams. 

The best annotation service in the world is useless if the tags (markup) are not scientifically meaningful (that is, if the tags do not enable data discovery, reuse and understanding). Therefore, it is incumbent upon science disciplines and research communities to develop common data models, taxonomies and ontologies. Since these concepts do not appear spontaneously out of large data collections, they require research and study. The data science of discovery informatics is focused on these research problems: how to enable discovery, access, interoperability, integration, reuse and mining of large distributed data. The disciplines of bioinformatics, geoinformatics and medical informatics are examples of well-established discovery informatics sub-disciplines within their larger scientific disciplines. A similar emerging research domain in the field of astronomy is astroinformatics, which targets the big data flood in astronomy [2]. New professional organizations within astronomy are now established in this area [3]. 

In astronomy, the proposed LSST (Large Synoptic Survey Telescope, www.lsst.org) project would carry out a systematic 10-year observation program to image the entire southern sky repeatedly throughout the night, every night for 10 years. The resulting data repository would include over 100 petabytes in the final image archive and over 20 petabytes in the final scientific database of extracted science measurements, parameters and metadata. The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org). AstroDAS is distributed in the sense that the source data and metadata are distributed, and the users are distributed. “Annotation” includes tagging both individual data granules and subsets of the data. It is a “system” in the sense that it is based on a formal, explicit, unified schema for the annotation database, applicable to all astronomy data collections, not only LSST. The DAS provides a distributed system for scientists (professional or citizen) anywhere to annotate individual astronomical objects with labels (known classes), attributes (known features) and new characterizations (newly discovered patterns and behaviors). These annotations can be applied to other astronomical data/metadata within distributed digital data collections. The annotations provide curation, provenance and semantic (scientifically meaningful) metadata about the data source and the data object being studied. 

The design and specification of a unique, meaningful, searchable and scientifically impactful set of tags can be achieved through collaborative (human-plus-machine) annotation efforts and through discovery informatics research. These steps will produce a searchable classification and indexing scheme for the curation, classification, discovery, reuse, interoperability, integration and understanding of digital repositories. These efforts will assist scientific data librarians in reaching the holy grail of semantic annotation of data, information and knowledge. 

Resources Mentioned in This Article 
[1] von Ahn, L. (2009). Human computation. ACM/IEEE Design Automation Conference, 46, 418-419

[2] Borne, K. D. (2010). Astroinformatics: Data-oriented astronomy research and education. Journal of Earth Science Informatics, 3, 5-17. 

[3] Feigelson, E. D., Ivezić, Z., Hilbe, J., & Borne, K. D. (2013). New organizations to support astroinformatics and astrostatistics. arXiv:1301.3069v1. Retrieved February 25, 2013 from http://arxiv.org/pdf/1301.3069.pdf 
 


Kirk Borne is professor of astrophysics and computational science at George Mason University in Fairfax, Virginia. He can be reached at kborne<at>gmu.edu