The citation of research data sources is gaining momentum with buy-in from several influential institutions. The groundwork has been laid, including guidelines for suggested data attributes, identifier systems, easy creation of digital object identifiers (DOIs) and standards for data reuse and exchange. But significant work remains, starting with cataloging archived resources, facilitating the citation process for scientists and building a culture where citing data is the norm. Collaborative efforts will speed success, which deserves recognition, and the researchers, librarians and IT specialists involved deserve credit. Implementing these steps will support the fundamental goal of sharing valuable data while reinforcing the process.
research data sets
Bulletin, June/July 2012
Advancing the Practice of Data Citation: A To-Do List
by Joseph A. Hourclé
Citing data sources may not yet be common practice but with all of the work done in the last year, we should soon reach the tipping point where data citation is not only the norm but is expected or even required in scientific publishing.
The work already done is substantial. The necessary attributes in a data citation have begun to coalesce, with recommendations from DataCite , the Federation for Earth Science Information Partners (ESIP)  and others. The Digital Curation Centre (DCC) , the National Academy's Board of Research Data and Information (BRDI)  and the International Council for Science's Committee on Data for Science and Technology (CODATA)  are looking into the broader issues of data citation including culture and sustainability.
With these contributions, the research data community is now to the point where we have the necessary components to build citation frameworks. We have guidelines for the recommended attributes in a data citation; we've had analysis of different identifier systems; we have reasonably priced DOI minting via EZID ; we have OAI-ORE (0pen Archives Initiative-Object Reuse and Exchange)  and metalink  to describe aggregates and alternatives.
It’s time to start building out systems to test these different recommendations and see how well they all work together. Through this testing, we can find and identify edge cases that might prevent adoption by scientific communities. The Work to Be DoneCataloging. First, we need to catalog what we have. Those archives that have data should attempt to describe their data using DataCite and whatever discipline-specific standards that might exist. We can start by assigning identifiers to these records and store them where they can be still be maintained even if the data are discarded. All data in our control should be properly registered with a title, creator and other necessary attributes so that researchers can easily and unambiguously cite our data.
We need to build from within. Our community can work with the specific discipline that created the data to make the descriptions both harvestable by and useful to whatever search systems may exist in that scientific field. When possible, working with creators and current users of the data can help provide links to the necessary documentation, software and other supplementary information necessary to use and understand the data.
We need to reach out. When we can't easily describe our holdings, we should confer with others working in that science discipline to find acceptable solutions for the community and push those recommendations up to the appropriate standards body so that they can determine if it is generalizable and if they need to make changes.
For those collections of data that aren't in a formal archive, we need to work with the current data distributors to ensure there are records of their holdings. These records can be at whatever granularity makes sense for the data; they need not be mutually exclusive. As above, when we run into problems, we need to reach out to the appropriate communities so that we can try to resolve them and document them in case they are part of a larger issue.
Citation Generation. Once we have the necessary records, we need to make it as easy as possible for scientists to cite them. Citation could be facilitated, for example, by generating a string in the appropriate citation format that they can copy and paste into their paper or by providing BibTex or a similar tool so they can use their preferred citation management tools. We can make this information available from both the record documenting the data and, where possible, through the search and retrieval tools for the given discipline.
Scientist Outreach. We then need to work on changing the culture. We can approach the editors of the various disciplinary journals or other influential scientists about the benefits of data citation and encourage them to support it. As we talk to the scientists and researchers, we can identify specific needs for each science discipline, and as with the other problems, we can push those issues back up so we can find generalized solutions without fragmenting these efforts. We may also identify other areas where the science, library and IT communities can work together to solve challenges.
As scientific fields have different practices for sharing their data, some will take more effort than others to convince, but we can build up our success stories to share as examples of what can be done and how their field could be improved by data citation. Even baby steps move us all closer to the end goal.
Most scientists are receptive to the concept of data sharing and citation when you discuss the issue. In some cases, it's a problem that they know they have, but not their highest priority. In other instances, they may not realize the true extent of the issue, as it is something they have not considered. In some cases where the data sharing is still ad hoc, this situation must be resolved before we can even approach the issue of data citation. Most scientists are happy to have someone else help solve data-sharing problems, as they are more interested in doing science than managing data.
Share the Results. We need to work together. None of us should be working in isolation; we should share both our successes and challenges with the science, library and IT people working on these problems. We can start collaborative efforts by
- identifying other groups of people who we can work with us toward a common goal;
- looking to data archives, the disciplinary search systems, the science librarians or anywhere else involved in storing, organizing or searching for scientific data and information; and
- talking across disciplines, as data citation is not a problem for just one scientific field or even science in general – research data drives decision-making in business and thus may occasionally influence politics.
Build on Success. By showing the benefit of basic citation in scientific articles, we should be able to use that success to support more comprehensive provenance descriptions for the data citation. Just as those who provide the data should be given credit for their work, so should the many groups who help to develop the software used to find, retrieve, process and visualize the data.
Once we have descriptions of the data available, we can look into how best to make them discoverable to scientists in related but different disciplines, so that more researchers can find and use the data and so that we can extract the most value possible from the experiments and observation campaigns that generated the data.
A few simple steps can help us advance the scientific record and provide appropriate credit to all of those parties who contribute, no matter the role.
Resources Mentioned in the Article
 DataCite: http://schema.datacite.org/
 ESIP: http://myitsc.org/esipcommons/node/308
 DCC: www.dcc.ac.uk/resources/how-guides
 CODATA: www.codata.org/taskgroups/TGdatacitation/
 EZID: http://n2t.net/ezid/
 OAI-ORE: www.openarchives.org/ore/
 Metalink: www.metalinker.org/
Joseph Hourclé is principal software engineer, Wyle Information Systems. He can be reached at joseph.a.hourcle<at>nasa.gov.
Articles in this Issue
Advancing the Practice of Data Citation: A To-Do List