The astronomy community has been a leader in efforts to preserve and share research data to facilitate ongoing discoveries. Since some efforts have led to isolated and abandoned websites, the need to deposit data in specifically designed and maintained repositories is now recognized. These repositories may apply persistent data identifiers to data products including large and small datasets, software and raw data throughout the research cycle. Researchers should be encouraged to deposit and share data, but guidelines and practices are inconsistent and publishers’ policies vary. Agreement among stakeholders must be found on clearly defined standards for data citation and linking through publications. Establishing and complying with citation standards for all related data products will promote data discovery and reuse.
research data sets
Unlocking and Sharing Data in Astronomy
by Edwin Henneken
Sharing information is fundamental to furthering science and ensuring that the entire research life cycle is properly captured and described. Data products are but one component of this colorful information tapestry, and nowadays, data products encompasses much more than it did a decade ago. Data products should refer to all products related to the research life cycle. For example, software used to produce or process datasets is just as important as the dataset. There is general agreement that curating and preserving data products, and making them citable, are worthwhile efforts, as is linking data products with publications, in order to capture the full life cycle and to improve discoverability.
Astronomy has a long tradition of sharing data. While still far from done, we have a significant history to learn from, reflected in the dedicated archives and curation initiatives that have been around for decades. For example, in 1972 the Centre de données astronomiques de Strasbourg (CDS) (http://cdsweb.u-strasbg.fr) was established in France to collect and distribute astronomical information. They maintain a number of important services for the astronomy community, including SIMBAD (http://simbad.u-strasbg.fr/simbad/), an astronomical database of objects discussed in the scholarly literature, as well as the data catalog service VizieR (http://vizier.u-strasbg.fr/viz-bin/VizieR). In 1988 NASA began the Astrophysics Data System (ADS) (http://ads.harvard.edu), a networked, distributed system for accessing and managing NASA astrophysics data holdings. Initially established as an electronic library, ADS evolved into an index of astronomical research papers, and in 1993 it was connected with the SIMBAD database. Through cooperation with data archives, ADS supports discoverability of data products through links in publications.
Is the research community willing to participate in preserving data? Most definitely. Many astronomers establish websites describing and providing access to data products. Though commendable, this approach lacks persistency. Many links to such websites are now dead, because the site was moved or discontinued. A recent study  found that, while the number of links in papers published by the American Astronomical Society (AAS) rose dramatically from 1997 to 2005, by 2011 44% of links published a decade earlier were broken. The study also suggests that links to data on personal websites become unreachable faster than links to datasets on institutional sites. Therefore, an essential step in the data preservation process is to convince people to invest time and effort in depositing their data in repositories specifically designed for data preservation (like the Dataverse Network –http://thedata.org; TheAstroData – http://thedata.harvard.edu/dvn/dataverses/cfa; and Zenodo – http://zenodo.org).
Many data preservation repositories assign persistent data identifiers (PDIDs) to datasets to support access. The PDID is often a DOI, but it could also be something like a dataset identifier . Agreeing on having a PDID is relatively easy; agreeing on what should get a PDID is an entirely different matter. At one extreme, to accurately represent the research life cycle in its most granular form, each data product, including all versions, should receive a PDID, but this level of identification is highly labor-intensive and costly. At the other extreme, only data products described in scholarly publications would be preserved and receive a PDID. Hybrid approaches to assigning PDIDs are also possible, and this discussion is still ongoing in the astronomy community. Input from the digital library community will be valuable here, as their experience can help shape preservation policies and data citation recommendations.
While the astronomy community has been a pioneer in preserving major data collections, smaller, derived datasets are less likely to be preserved or shared. Repositories like TheAstroData and Zenodo are available for these smaller sets, but whether researchers are educated about the advantages of depositing data and are encouraged to do so depends on local initiatives. This need for education is especially true for supplementary material and non-tabular data (like raw data). Publishers play a major role in this respect. Within their commercial boundaries, they have shown a willingness to innovate and participate in initiatives from the community. If a publisher requires data products used in publications to be available in a persistent manner (either by hosting the data themselves or by persistent links to repositories), authors will find a way to meet those requirements.
The major astronomy journals all have different policies. Astronomy & Astrophysics states in its instructions for authors, “It is mandatory for A&A authors to publish the data that are presented and discussed in articles and needed to reproduce the results” . Monthly Notices of the Royal Astronomical Society is a bit vaguer: “Authors are particularly encouraged to make catalogues and databases available, so readers may reproduce their results or use them for future studies,” leaving the initiative up to the author . The journals of the AAS have a “Data Behind the Figure” option for smaller datasets in common formats, recognizing that this availability can increase the long-term citation of papers.
Formalizing data (and software) citation with an eye towards ease of use supports giving credit where credit is due, enhances discoverability of data products and facilitates the compilation of metrics for data products (be it usage or citation based). Past studies found that publications linked to data products received higher citation rates [5, 6], an additional incentive for authors. However, while we agree citation is a worthwhile goal, guidelines and practices still vary widely. Sometimes data products are mentioned in the article body or bibliography. Some publishers offer guidelines, but they differ from publisher to publisher. In rare cases, publishers embed PDIDs for datasets in the article metadata, allowing services like the ADS to locate them and create appropriate links. Another possibility is to maintain a separate data bibliography and service, similar to SIMBAD, which tags publications with data products cited therein. Another useful model is the Astrophysics Source Code Library (ASCL), which maintains an online registry of scientist-written software used in astronomy research. The ADS cooperates closely with the ASCL, indexing their records and helping link software and publications.
When data and software are linked to relevant publications and data product and software records are indexed, metrics are much easier to compile. Such metrics are critical for evaluating an instrument, project, mission or even an entire research field , but it takes a community effort for data (and software) citation to work properly. Only when the community agrees upon well-defined standards and on how to establish a registry will data citation become as transparent as it is for publications. Publishers need to be involved, as well, to incorporate data citation standards into their publication process.
Data product discovery is typically accomplished one of two ways: by searching directly in a data repository or by following contextual relationships (like article-data interlinking) in services like ADS. To find data products that have never been used or described in publications, users would go to a data repository (assuming that such data products have been deposited there). The ADS is a powerful environment for data discovery, providing access to datasets through the publication to which those data have been linked. When data product records are indexed, users may find other data through similar articles that have been linked to data products. This interlinking is possible through collaboration between ADS and services like NED and those offered by CDS. In ADS 2.0 (http://adslabs.org), faceted filtering has made data discovery even easier. Users start by searching for a term or phrase and then use the facets to drill down and filter results. In this version, the ADS offers a facet dedicated to data products. This environment allows one to study more sociological and infometric aspects of data usage in publications, for example, in combination with filters on aspects of affiliation data.
A publication based on a dataset is just one expression of the potential in that dataset. The backgrounds and interests of the researchers will influence which representation of that data is selected. But there are many different representations, and the ability to discover and access data products fosters the reuse of data products for different purposes as well as for combining data products in unanticipated combinations. Astronomy has made great progress towards unlocking data products, but we still have quite a journey ahead. With upcoming projects like the Large Synoptic Sky Survey, which will produce 1.28 petabytes of uncompressed data every year , the problem of data preservation will continue to be a pressing one.
Resources Mentioned in the Article
 Pepe, A., Goodman, A., Muench, A., Crosas, M., Erdmann, C. (2014). How do astronomers share data? Reliability and persistence of datasets linked in AAS publications and a qualitative study of data practices among US astronomers. PLoS ONE 9(8): e104798. doi:10.1371/journal.pone.0104798
 Astronomy & Astrophysics. (n.d.) Author information. Retrieved from www.aanda.org/index.php?option=com_content&view=article&id=136&Itemid=200.
 Surace, C. (2014). Virtual observatory in astrophysics. In: C. Diaconu, S. Kraml, C. Surace, D. Chateigner, T. Libourel, A. Laurent, …V. Beckman (Eds.). PREDON Scientific Data Preservation 2014. (pp. 13-16). Retrieved from http://hal.in2p3.fr/in2p3-00959072
Edwin Henneken is IT specialist for the NASA Astrophysics Data System at the Smithsonian Astrophysical Observatory in Cambridge, Massachusetts. He can be reached at ehennekencfa.harvard.edu.