B U L L E T I N
New Metadata Standards for Digital Resources: MODS and METS
Rebecca Guenther and Sally McCallum are with the Network Development and MARC Standards Office at the Library of Congress. Rebecca Guenther can be reached at firstname.lastname@example.org
Metadata has taken on a new look with the advent of XML and digital resources. XML provides a new versatile structure for tagging and packaging metadata as the rapid proliferation of digital resources demands both rapidly produced descriptive data and the encoding of more types of metadata. Two emerging standards are attempting to harness these developments for library needs. The first is the Metadata Object and Description Schema (MODS), a MARC-compatible XML schema for encoding descriptive data. The second standard is the Metadata Encoding and Transmission Standard (METS), a highly flexible XML schema for packaging the descriptive metadata and various other important types of metadata needed to assure the use and preservation of digital resources.
The Library of Congress' Network Development and MARC Standards Office developed MODS (www.loc.gov/standards/mods/) in consultation with interested experts to satisfy the expressed need for an abbreviated XML version of MARC 21. XML is being increasingly deployed in computer applications, particularly on the Web, as a richer, more flexible alternative to HTML. Many have expressed the need to move to XML for metadata in libraries and other cultural institutions. It is appropriate for an XML version of MARC to be investigated since it is perhaps the oldest metadata standard designed for use in computers.
Over the years people have expressed concerns about the number of data elements in MARC and their complexity. Some have suggested use of the Dublin Core Metadata Element Set (http://dublincore.org), although that set is intended to satisfy a broader range of purposes and communities than MARC 21. In order to address these concerns about MARC and also allow for a rich description, the Library of Congress developed MODS, an XML schema with language-based tags that includes a subset of data elements derived from MARC 21. It is intended to carry selected data from existing MARC 21 records as well as to enable the creation of original resource description records.
MODS is intended to complement other metadata formats and to provide an alternative between a simple metadata format with a minimum of fields and little or no substructure such as Dublin Core and a very detailed format with many data elements having various structural complexities such as MARC 21. MODS has a high level of compatibility with MARC records because it inherits the semantics of the equivalent data elements in the MARC 21 bibliographic format. Thus, it is richer than Dublin Core and more compatible with library data than ONIX (www.editeur.org/onix.html), which was developed for the book industry, but it is also simpler than the full MARC format (either as ISO 2709 or full MARCXML). It is more "friendly" because it uses language-based tags that can be easily understood by anyone dealing with the "raw" record, as opposed to the numeric tags traditional to MARC.
Most elements that have been defined in MODS have equivalents in the MARC 21 bibliographic format. In addition the Library of Congress has made available mappings between MARC and MODS and vice versa (www.loc.gov/standards/mods/modsmapping.html; www.loc.gov/standards/mods//mods2marcmapping.html). Since MODS elements inherit the semantics of MARC elements, an element in MODS has the meaning detailed in the MARC 21 bibliographic format.
In MODS some elements in MARC have been repackaged, for example in cases where several data elements are brought into one. This repackaging occurs in the MODS element genre, which uses controlled values that are used in various MARC elements, particularly in fixed fields. The Library of Congress has made available a controlled list of genre values found in various places in the MARC 21 bibliographic format to be used with the MODS genre element (www.loc.gov/marc/sourcecode/genre/genrelist.html).
MODS, like MARC 21, does not assume the use of any particular cataloging code. It can accommodate record content that is full AACR2 with authoritative name and subject headings, uncontrolled by cataloging rules, or anything in-between.
Since MODS is a subset of MARC, decisions were made about which elements to include, which to combine with others to form a single element and which to drop altogether. For instance, there are numerous types of relationships that are expressed in the MARC linking entry fields. These are carried in MODS under relatedItem with a type attribute to express the type of relationship. Not all relationships in MARC are given type values.
Certain MODS elements define concepts that recur in more than one element as sub-elements. XML facilitates using the same definition for multiple elements. For example, "name" can be the primary name associated with the resource or a name associated with a related item; in MODS, both use the same definition. This concept is certainly present in MARC 21 but not as consistently as in MODS.
Since MODS includes a subset of MARC 21 bibliographic fields, it allows for a conversion from MARC 21 fields to MODS, while other MARC 21 fields may be dropped or carried in a less specific manner. The MODS schema does not target "round-tripability" with MARC 21. A converted record may lose some of its tagging, for instance, when the tagging is simpler, or accommodate some data even when there is not an equivalent data element. When an XML schema is desired that does not result in any data loss, the MARC 21 XML schema may be used (www.loc.gov/standards/marcxml/), since it allows for the expression of a full MARC record in XML. For any conversion between MARC in ISO 2709 format and MODS, it is expected that the record would first go through a conversion to MARCXML before a transformation to the subset that is MODS. The Library of Congress is providing tools for the conversion from MARC 21 to MARCXML with a further transformation to MODS.
The need for a rich metadata standard such as MODS has been expressed by members of the digital library and related communities as they attempt to implement projects involving search and retrieval, management of complex digital objects, integrating metadata from library databases with other non-MARC sources and other functions.
The "Search/Retrieve Web Service" (SRW) in ZING (Z39.50 International Next Generation ) (www.loc.gov/z3950/agency/zing/srwu/srw.html) is a proof-of-concept initiative to develop value-added search and retrieve applications built on Z39.50 along with Web technologies – XML, SOAP/RPC and HTTP. It defines a search service that specifies metadata schemas for retrieval. Since it uses XML, an XML metadata schema is needed, and one compatible with library data such as MODS would be desirable.
The Open Archives Initiative Protocol for Metadata Harvesting (www.openarchives.org/) harvests MARC records from multiple systems and makes them available widely. Generally, the records have been available in MARC (using MARC tagging and syntax in MARCXML) or simple Dublin Core in XML. The Library of Congress is planning to incorporate MODS as an alternate format for its over 100,000 metadata records that describe various forms of material digitized for American Memory. This will allow for the export of richer metadata than the Dublin Core record, which drops much of the metadata, but provides simpler data than full MARCXML.
MODS may be used for original resource description that allows for rich description that is generally compatible with existing library data and is expressed in XML syntax. Because it includes a subset of MARC fields and repackages some of them, it is particularly useful for technician input.
An additional use of MODS is as an extension schema for descriptive metadata for a METS object, as detailed later in this paper.
Experimentation with MODS
Since MODS was officially made available in June 2002, experimentation is just beginning. In June 2002, MODS was frozen for a six-month trial, although suggested additions are being listed on the MODS website. The following describes a few sample experiments.
· The Library of Congress' AudioVisual Prototyping Project is exploring aspects of digital preservation for audio and video. This collaborative project is developing approaches for packaging digital content, with a focus on metadata (http://lcweb.loc.gov/rr/mopic/avprot/avprhome.html). The project is experimenting with METS for packaging the digital object and its metadata and is implementing MODS for use as its descriptive metadata schema, particularly because of the rich descriptions of relationships with other items that may be expressed. Where possible metadata is being reused from descriptive cataloging records in one of the Library's databases with minimal data loss. In some cases, original resource description is provided and a MODS template is used.
· MINERVA (Mapping the Internet: Electronic Resources Virtual Archive) is an experimental pilot developed to identify, select, collect and preserve websites (http://lcweb.loc.gov/minerva/minerva.html). LC is collaborating with the Internet Archive (Alexa), SUNY and the University of Washington to collect and archive websites, providing descriptive metadata that will be used to search, retrieve and analyze the archived collections. Metadata will be created for websites in the collection using the MODS schema because of its compatibility with MARC data, to be used in the search and retrieval system and later converted to MARC and added to the Library's online catalog.
· The University of Chicago Press is implementing a project to support the development of the Chicago Digital Distribution Center, which would be built upon its traditional distribution center and involves making digital books available for distribution. The Press harvests MARC records to enhance searchability and for export to their client presses, converting them into MODS for more concise description and more understandable language-based tags.
· The California Digital Library is establishing a generic METS repository infrastructure to help manage the digital objects in its control. A project provides for search and display of 1,500 records for books published online by the CDL on behalf of the University of California Press. Records are extracted from the union catalog and transformed to MODS, then inserted into the METS record. Specified fields are used for indexing and searching as well as in response to an Open Archives Initiative Harvesting Protocol query.
The Metadata Encoding and Transmission Standard (METS) (www.loc.gov/standards/mets) grew out of several experimental 1990s digital projects. In February 2001, the Digital Library Federation convened a meeting of experts from several projects to evaluate what had been learned with respect to metadata and to decide how to go forward. Out of that meeting came the idea for METS, an XML document that packages the metadata associated with a digital resource – the descriptive, administrative, structural, rights and other data needed for retrieving, preserving and serving up digital resources. Then, in a little over a year the METS XML schema was developed, a maintenance structure set up and experimentation worldwide began.
METS metadata is essential for a digital material repository, where digital resources – over 7 million at LC alone – are stored along with information about the resources. A repository, which can take many configurations, is the instrument for access and preservation of the objects. The METS data is also important for the interchange of digital objects for viewing and use by other systems. If the digital resource has with it the METS description, the file should be usable for many activities at the receiving system.
Characteristics of METS
METS is an open standard, not a proprietary one. Library system staff and librarians who also participate in developments in the Internet community are constructing it. Jerome McDonough of NYU serves in the critical role of editor-in-chief of the METS XML schema. The Library of Congress has agreed to serve as the maintenance agency for the standard – building a website for METS, supporting an open listserv for implementers and working on extension schema. Recently an editorial board was formed of major contributors to the development thus far with the intent to identify and bring in global partners. The schema is now complete and stable enough to consider taking it to a formal standards body such as ISO or NISO.
The structure of the METS schema is highly flexible and relatively simple. It is conceptually six modules that contain and/or point to the different types of metadata needed for a digital resource. In several of the modules the METS standard does not define the metadata elements and tags to be used. It allows the user to choose a standard "extension" schema and identify and use it. For several modules it allows the metadata to reside outside the package, pointed to from within the METS document. These features exemplify the flexibility of METS.
The METS "Package"
The six parts of the METS package or document are as follows: header, descriptive metadata, administrative metadata, file section, structural map and behavior section. They are all optional except for the header and structural map, which are needed for basic access to the digital resource. The descriptive, administrative and behavior sections may reside in the METS document or be external. If they are internal, XML schemas are preferred. If the metadata for these sections is external and merely pointed to from the appropriate section of the METS document, the metadata may be of any type and format. For descriptive metadata it may even be an entry in a catalog if the catalog record can be adequately referenced.
The METS descriptive metadata section is the most familiar to librarians, as it contains cataloging and finding-aid data. There are several established schemas that can be used for the descriptive metadata – including MODS, which was designed with a special focus on electronic resources; Dublin Core, when only minimal data is needed; or MARCXML when full MARC record information is available.
The administrative metadata is the most critical for use and preservation of the digital resource. Here resides source information such as resource creation date, resource format information, resource use information, digital provenance and copyright and license information. This section may contain information on past transformations and migrations of the data and master/derivative information, all useful for preservation purposes. XML schemas are not yet standard for most of this information but the METS project is pushing their development. For example, the recently completed NISO data dictionary for technical metadata for still images (www.niso.org/standards/resources/Z39_87_trial_use.pdf) is being used through a supporting XML schema, MIX (Metadata Imagining in XML) (www.loc.gov/mix/). Project participants also have drafts prepared for the technical data for text, audio and moving images. The draft schemas are available from the METS website.
The behavior section of the METS document contains pointers to computer programs or applications that are used to display digital objects such as page-turners or audio players. The behavior information is intended to assist in providing "disseminators" for end user access.
The header, file group and structural map sections may only reside in the METS document – there is no option to point to them outside. The file section identifies all the files the object clusters, such as thumbnails, master archival, pdf versions and text-encoded versions. The structural map contains a clear layout of the hierarchical structure of the document. An important feature of METS is that the structural map may point to parts of the descriptive and administrative metadata from different places on the structure hierarchy of the resource, enabling linking to subparts of digital resources, such as cuts on sound recordings. The header also gives information about the METS document itself, such as identifiers, date created, dates updated and status.
METS allows use of any established schemas for the different modules. While essential for acceptance in these times when standards are not yet in place, such flexibility could impede interoperability. Accordingly, the library community will be trying to work out a subset of schemas or profiles for the different types of data described above that will be used across exchange communities. This makes international participation in the use and development of METS especially important.
METS Implementation and Use
METS came at a key moment for implementation and use. Many institutions had experimented with digitization and had begun to build collections large enough to seriously require better organization tools. Also there has recently been an increasing number of projects for archiving of open access Web subsets. The recent METS implementations have fortunately varied in form of material and size, giving good information to the standard's developers. To indicate only a few projects, the Library of Congress is using METS for a very large body of moving image and audio material and other mixed media folk life resources; the National Library of Wales is using METS initially for textual material; Harvard is experimenting with audio collections; and Michigan State is working with moving images. Both OCLC and RLG are working METS into their digital projects.
The library community has well developed bibliographic description traditions that with some adjustment for digital resources, such as the MODS development exemplifies, will serve the digital future. In the larger metadata picture, the development of METS is a big step toward bringing to non-descriptive metadata the stability needed for a smoothly functioning Internet environment where electronic resources flow seamlessly between systems. These developments relate well to the OAIS (Open Archival Information System) Reference Model (www.ccsds.org/RP9905/RP9905.html), which helps to define the processes and boundaries in creating, managing, sustaining and serving digital resources. The METS package can be used to collect digital resource metadata for submission to the repository, serve as the place for the metadata within the repository and be the supplier of information to the tools that provide the resources to the patrons.
This article is based on one to appear in early 2003 in Portal: Libraries and the Academy, Johns Hopkins University Press.
Copyright © 2003, American Society for Information Science and Technology