Bulletin, June/July 2006
Web Services for Controlled Vocabularies
by Diane Vizine-Goetz, Andrew Houghton and Eric Childress
All three authors are affiliated with OCLC Research, OCLC Online Computer Library Center, 6565 Frantz Road, Dublin, OH 43065. Diane Vizine-Goetz, a research scientist, can be reached at vizine<at>oclc.org. Andrew Houghton, consulting software engineer, is at Houghton<at>oclc.org. Eric Childress, consulting project manager, is at eric_childress<at>oclc.org
Amid the debates about whether folksonomies will supplant controlled vocabularies and whether the Library of Congress Subject Headings (LCSH) and Dewey Decimal Classification (DDC) system have outlived their usefulness, libraries, museums and other organizations continue to require efficient, effective access to controlled vocabularies for creating consistent metadata for their collections
In this article, we present an approach for using Web services to interact with controlled vocabularies. Services are implemented within a service-oriented architecture (SOA) framework. SOA is an approach to distributed computing where services are loosely coupled and discoverable on the network. A set of experimental services for controlled vocabularies is provided through the Microsoft Office (MS) Research task pane (a small window or sidebar that opens up next to Internet Explorer (IE) and other Microsoft Office applications). The research task pane is a built-in feature of IE when MS Office 2003 is loaded. The research pane enables a user to take advantage of a number of research and reference services accessible over the Internet. Web browsers, such as Mozilla Firefox and Opera, also provide sidebars which could be used to deliver similar, loosely-coupled Web services.
Many controlled vocabularies are available for representing the content of documents and other resources. Depending upon their needs, catalogers, metadata specialists and webmasters can choose from vocabularies containing just a few terms, such as the Dublin Core Metadata Initiative (DCMI) Type Vocabulary, to large vocabularies containing many thousands of terms, such as the National Library of Medicine’s Medical Subject Headings (MeSH) or the Library of Congress Subject Headings. As part of the Terminology Services (TS) project, OCLC researchers are prototyping Web services for various types of knowledge organization schemes, including classification data, subject heading systems, thesauri and lists of form and genre terms. Over the last 18 months, OCLC Research has made the following controlled vocabularies accessible via one or more Web services:
- DCMI Type Vocabulary
- Guidelines on Subject Access to Individual Works of Fiction, Drama, etc. (GSAFD) list of form/genre headings
- Library of Congress Subject Headings (LCSH)
- Library of Congress Annotated Card Program AC Subject Headings (LCSHac)
- Medical Subject Headings (MeSH) 2005
- Medical Subject Headings (MeSH) 2005 Sample
- Medical Subject Headings (MeSH) 2006
- Newspaper Genre List (NGL)
- Radio Form/Genre Terms Guide (RADFG)
- Répertoire de vedettes-matière (RVM) (access restricted)
- Union List of Artist Names (ULAN) Sample
For the project, all of the controlled vocabularies were encoded in the MARC 21 Format for Authority Data in XML. The MARC 21 Authority Format was chosen because it enabled us to code common controlled vocabulary elements, such as preferred and non-preferred terms, term relationships, term mappings, the source of the content and the origin of changes. For some vocabularies it was first necessary to convert the controlled vocabulary data from word processing documents or HTML pages to more structured data formats and then into MARC 21. A sample term from the DCMI Type vocabulary, originally available only as HTML and Resource Description Framework (RDF), is shown in MARC 21 in XML in Figure 1.
Encoding of DCMI Type value 'Image' in MARC 21 Format for Authority in XML
The DCMI Type Vocabulary is a controlled list of terms that can be used as values for the DCMI Resource Type element to identify the genre of a resource. Data field tag “040” subfield code “a” contains the MARC organization code for DCMI, the originator of the content; subfield code “c” contains the code for OCLC Research, the party responsible for converting the content to the MARC format. The genre term Image is coded in tag “155” and the associated genre term Still Image is coded in tag “555.”
For vocabularies already available in MARC 21, the conversion to MARCXML was a relatively straightforward process. Some problems were encountered with XML and XSLT tools when processing the larger vocabularies (more than 100,000 records) especially after the files were enhanced with the vocabulary’s full reference structure, term mappings and links to external Web sites. Once coded as XML the data could be used as the basis for Web services. SKOS (Simple Knowledge Organization) core, an emerging RDF schema for thesauri and related knowledge organization schemes, and the Zthes 0.5 schema, a z39.50 profile for thesaurus navigation, are also suitable formats for encoding vocabulary resources for Web services. Phase III of the High-Level Thesaurus (HILT) project is an example of a project that is using the SKOS core for encoding controlled vocabularies and classification data. MARC and Zthes formats may be added to HILT at a later stage. The Zthes 0.5 encoding for DCMI Type value Image is shown in Figure 2.
Zthes 0.5 encoding of DCMI Type value 'Image'
The implementation of Web services support in many widely adopted platforms presents opportunities to offer terminology Web services in various modular arrangements. OCLC Research is making a set of services for controlled vocabularies available through the Microsoft Office Research task pane. To use the OCLC TS pilot vocabularies, users add OCLC services to the research pane via a URL provided to pilot participants. Within the research pane, pilot users can search a given vocabulary, display information about a term, follow links to associated terms within a vocabulary and follow links to external Web sites. Because the pilot implementation is intended to be used alongside the user’s cataloging or metadata editing application, multiple copy and paste operations are provided. Users can insert controlled vocabulary terms with MARC field tags, indicators and subfield codes into MARC catalog records, or for non-MARC applications, users can insert terms as strings into their records without MARC coding.
OCLC Terminology Services pilot vocabulary in research pane alongside Connexion session. [Click here for larger version]
Figure 3 shows the research task pane open to the left of a catalog record that was created in the OCLC Connexion service. As a sidebar application, the MS Research pane provides users the ability to conveniently interact with remote databases without interrupting their interaction with their main application (in this case, OCLC Connexion in the main window in Internet Explorer). Information retrieved in the research pane content area such as a subject heading, form or genre term, or other term from a controlled vocabulary can be easily transferred into the main application. In this example, the record for the MeSH heading Body Image has been retrieved from a copy of the 2005 edition of the MeSH vocabulary hosted at OCLC. As applicable, for each term retrieved, the following data is displayed in the research pane as expandable/collapsible sections:
- Class numbers
- Non-preferred terms
- Broader terms
- Related terms
- Narrower terms
- Mapped terms
Links to external websites are displayed in the notes section of a record. The sample record in Figure 3 contains links to the MeSH online record and MeSH tree structure on the National Library of Medicine website.
The Microsoft research pane, item one (Figure 4, lower right), provides a service-oriented architecture framework for accessing terminology services. SOA is an approach to distributed computing where services are loosely coupled and discoverable on the network. Microsoft has defined a public schema, known as Research Services, for interaction with the research pane client. The Research Services Web service, item two (Figure 4, lower left), defines two Web methods: register and query.
- The register Web method is called by the research pane client to obtain information about the information provider and the services that will be offered to the client.
- The query Web method is called by the research pane client to obtain content from the information provider that will be displayed in the research pane content area.
The Research Services Web service, item two (Figure 4, lower left), is used as a proxy to any backend storage technology containing controlled vocabularies, item three (upper left). For example, vocabularies can be stored as full text databases, SQL databases or XML files. The diagram depicts access to the various backend storage technologies through distributed Web service protocol technologies such as SRU/W protocol (Search/Retrieve via URL/Web service), REST (Representational State Transfer) and SOAP (Simple Object Access Protocol). For maximum flexibility, we choose to insulate access to the various backend storage technologies. This approach allows a vocabulary to reside at OCLC or another location. The Web service, item two (lower left), can access those backend storage technologies directly and/or use the distributed Web service protocol technologies.
OCLC’s experimental implementation uses the OCLC Pears full text database software along with a Search/Retrieve Web service (SRW) interface to access the vocabularies. The terminology Web service acts as a proxy to the vocabularies providing query and markup translation along with authentication and authorization, when necessary.
Figure 4. OCLC Terminology Services Pilot System Architecture [Click here for larger version]
Our work with the Microsoft Office Research task pane explored the use of Web-based terminology services with library automation systems. We are now expanding our scope to Web-based terminology services that could interact with the Semantic Web applications.
The SIMILE (Semantic Interoperability of Metadata and Information in unLike Environments) project is a Semantic Web initiative that seeks to enhance interoperability among digital assets, schemata, vocabularies, ontologies, metadata and services. The initiative is a joint project of the W3C, MIT Libraries and MIT Computer Science and Artificial Intelligence Laboratory. The SIMILE project has created an application called Piggy Bank that is a Firefox Web browser extension which allows existing information on the Web to be used in more useful and flexible ways.
OCLC Research is investigating how the OCLC Pears full text database software along with its Search/Retrieve Web service (SRW) interface could be modified to interact with SIMILE’s Piggy Bank Semantic Web application. Our initial investigation for obtaining interoperability between these applications has been promising. Our focus has been on addressing issues on the provider side rather than the consumer side, that is, modifying the OCLC software and not the Piggy Bank application.
Interoperability issues have arisen due to differences between metadata formats and identifiers. For the Terminology Services project, all of the controlled vocabularies were encoded in the MARC 21 Format for Authority Data using the MARCXML standard. The SIMILE project uses a different standard, the RDF-XML standard.
The OCLC Pears full text database software contains an SRW service interface that is controlled by a series of Extensible Stylesheet Language (XSL) transforms. Our initial investigation revealed that, although it was possible to create an XSL transform to convert between the MARCXML and SKOS RDF-XML markup languages, the difference between the character encodings for these standards was more problematic. The resolution was to replace the existing XSLT processor (Apache Xalan XSLT 1.0) with the Saxon XSLT 2.0 processor and to create an XSL 2.0 transform that converts between the MARCXML and SKOS RDF-XML markup languages.
Identifiers to the terms in controlled vocabularies are also an issue of concern. Many controlled vocabularies either do not have identifiers – the preferred term acts as the identifier – or the internal identifiers are not Web actionable URLs. An example of an identifier for a term is shown in Figure 3. It is the unique ID, D001828, associated with the MeSH heading Body Image. Although the RDF-XML standard does not require Web actionable URLs, the lack of them makes Semantic Web applications like SIMILE's Piggy Bank less useful. The identifier issues remain under investigation and will impact the generation of the SKOS RDF-XML markup.
For Further Reading
Hilt (High Level Thesaurus Project) homepage. 27 May 2005. Centre for Digital Library Research, University of Strathclyde. Retrieved April 13, 2006 from http://hilt.cdlr.strath.ac.uk
Huynh, D., Mazzocchi, S. & Karger, D. (2005). Piggy Bank: Experience the Semantic Web inside your Web browser. Simile. Cambridge, MA: MIT. Retrieved April 13, 2006, from http://simile.mit.edu/papers/iswc05.pdf.
Manola, F., & Miller, E. (Eds.). (2004 February 10). RDF Primer. World Wide Web Consortium (W3C) RDF Core Working Group. Retrieved April 13, 2006, from www.w3.org/TR/rdf-primer/.
Miles, A., & Brickley, D. (Eds.) (2005, November 5). SKOS Core Guide. World Wide Web (W3C) Semantic Web Best Practices and Deployment Working Group. Retrieved April 13, 2006, from www.w3.org/TR/swbp-skos-core-guide/.
SRW: Search/Retrieve Web Service. (2004 February 14). In SRU: Search Retrieve Via URL v. 1.1. Washington, DC: Library of Congress. Retrieved April 13, 2006, from www.loc.gov/standards/sru/srw/.
Terminology Services. (n.d.). Dublin, OH: OCLC Research. OCLC Online Computer Library Center. Retrieved April 13, 2006, from www.oclc.org/research/projects/termservices/.
Vizine-Goetz, D, Hickey, C. Houghton, A. & Thompson, R. (2004). Vocabulary mapping for terminology services. Journal of Digital Information, 4. Retrieved April 13, 2006, from http://jodi.ecs.soton.ac.uk/Articles/v04/i04/Vizine-Goetz/.
Articles in this Issue
Web Services for Controlled Vocabularies