From Book Classification to Knowledge Organization: Improving Internet Resource Description and Discovery

by Diane Vizine-Goetz

Classification experts and librarians have long recognized the potential of library classification systems to improve resource description and discovery, yet it has taken the growth in Internet content to provide the motivation and opportunities for more fully exploring the information organizing capabilities of these schemes. One search system that uses a library classification system to facilitate access to Internet resources is OCLC's NetFirst database. NetFirst is an Internet index intended for libraries and library users. Professional editors create NetFirst metadata records (resource descriptions) which include informative abstracts, Dewey Decimal Classification (DDC) numbers and Library of Congress Subject Headings (LCSH). An example of a NetFirst record is shown in Figure 1.

While each of the elements shown in Figure 1 contributes toward effective retrieval (for instance a domain element, such as gov, can be combined with a keyword element, such as travel, to retrieve all NetFirst records describing travel-related government sites), it is the inclusion of one or more DDC numbers in NetFirst records that supports the browsing and filtering capabilities unique to the NetFirst interface. Using the hierarchical structure of the DDC and revamped Dewey captions, the NetFirst browse capability allows a user to select from subject categories (such as health, home, technology), topics (such as health and medicine) and subtopics (such as health, preventive medicine) to view records grouped by DDC numbers (Figure 2).

With just three clicks of the mouse, a set of records numbering over 16,000 is reduced to a more manageable set of some 370 records (Figure 3). Further refinements in searching can be achieved by combining one or more terms with DDC categories. For instance, a NetFirst user interested in finding resources containing information about health concerns for travelers can browse to the second level topic, health and medicine, and then search the selected topic area for items on this topic. This is accomplished by entering search terms in the box below the category list (e.g., travel or tourism) and clicking on the search button. Browsing and filtering the database records in this way (using the structure of DDC, but not its class numbers) enables users to retrieve relevant items that may not be as easily discovered using traditional keyword searching capabilities. In this case, a keyword search for health and (travel or tourism) retrieves 185 items; a similar search filtered by DDC topic area retrieves 47 items (Figure 4).

In addition to NetFirst, several other World Wide Web information services are using the DDC to organize electronic resources. The following are among them:

Adapting the DDC for Electronic Resources

Edition 21 of the DDC has undergone many improvements that make it more suitable than previous editions for searching and browsing Internet resources. For example, the Dewey knowledge base has been expanded by the addition of many new classes and extensions of existing ones. Several examples can be found in the topic area computers and computer networking -- new classes have been added for Client-server computing (004.36), Internet (004.678), Visual programming (005.118) and Neural nets (006.32), etc. Additional computer science topics can be accommodated by class structure extensions that include subdivision by operating system, user interface and mode of processing. Another area of improvement in DDC 21 is the updating of terminology throughout the edition. Changes were made to DDC descriptions to reflect currency, international usage and sensitivity to the preferred usage of social and national groups. These enhancements make the DDC easier to apply and support new uses of the underlying database. For a more complete description of recent adaptations in the Dewey Decimal Classification, see Mitchell (1997).

Beyond these editorial improvements, additional steps are being taken by OCLC to make the DDC even more versatile and applicable to electronic resources. These include a Web-accessible service of OCLC Forest Press and three complementary Office of Research projects. The Forest Press service called Hot Classification Topics provides up-to-date information on new and changed DDC classes, suggested numbers for emerging subject areas and newly approved Library of Congress Subject Headings (LCSH) paired with candidate DDC numbers. This service is accessible at

http://www.oclc.org/oclc/fp/ddc/hottopic/hottoc.htm

The DDC-oriented research projects are

These projects interact with one another to explore the potential of the DDC as a knowledge-structuring tool for large collections of electronic documents (especially on the Internet and WWW). The Scorpion system is a research project that explores indexing and cataloging of electronic resources, with emphasis on building tools for automatic subject recognition using schemes like the DDC (see article by Shafer in this issue of the Bulletin of the American Society for Information Science). Dewey ETC Trees and WordSmith are concerned with expanding the Dewey knowledge base and enhancing the vocabulary and terminology of the DDC to make the DDC even more useful in systems like Scorpion. The goal of the Dewey ETC Trees project is to link subject-access systems like LCSH and the Library of Congress Classification with the DDC. Its scope is narrower than that of WordSmith, which is focused on building a set of natural language parsing tools that can be used in concert with other OCLC research projects. In this article we will focus on WordSmith tools that are used to enhance the DDC. The synergy among the projects will be illustrated by the examples given in the next section.

Enhancing the DDC

A combination of techniques is being employed within the Dewey ETC Trees project to link Dewey with LCSH. One approach matches Dewey relative index terms to authorized heading fields in LCSH. For example, the Dewey topic Tourism (338.4791) can be mapped to the LC heading Tourist trade as shown in Figure 5. Through these mappings, we are able to associate synonyms and other variants from LCSH (MARC tags 4xx) with a given Dewey topic, thereby expanding the vocabulary and terminology associated with the topic. The variants from LCSH are then added to their corresponding Dewey records in experimental versions of Scorpion Dewey databases. Preliminary experiments indicate that this technique improves the ability of the Scorpion system to automatically assign (classes) subjects to electronic documents.

In addition to mapping, we are exploring using the Scorpion Dewey databases themselves to associate LC Headings with the DDC. This technique involves generating HTML tagged versions of LC Subject Authority records and submitting them to a Scorpion Dewey database for categorization. Again using the LC heading Tourist trade as an example, the results presented in Figure 6A show that Scorpion assigns DDC class 338.4791 as the most highly ranked Dewey subject for this LC heading. The field labeled Raw Text of Document contains the terms derived from the authority record. In this case MARC fields 150 (Topical terms), 450 (Topical See From Tracings) and 550 (Topical See Also From Tracings) were included in the classified document (see table 1 Record A). We are currently experimenting with various combinations of MARC fields to determine the optimal combination(s) for our purposes.

When neither mapping nor Scorpion classifications produce satisfactory results, as is the case for the LC heading Tourist information centers (see Figure 6B "Scorpion Results" and Table 1 Record B), we turn to the WordSmith team for assistance. They have demonstrated that it is possible to extract new and emerging concepts and terminology from unrestricted text and link them to the DDC. The following terms were recently extracted from Web-accessible newspaper texts: alternative rock band, anonymous ftp site, artificial life, baby-boomer parents, chat rooms, frames-capable browser, high-performance computing and virtual mall. None of these concepts are found in the current version of the DDC, but through the adaptation of statistical techniques developed by computational linguists, WordSmith processes have been developed to automatically mine such topics from free text and classify them according to Dewey. See Godby (1996) for a more detailed description of these techniques. The concepts extracted from free text and their candidate classifications can be reviewed by the DDC editorial staff for possible inclusion in future DDC editions or be used to provide end-user vocabulary that enables electronic versions of the DDC to be customized to a particular database. Although the major goal of this work is to provide additional indexing vocabulary for the DDC, it can also supplement the effort to map the LCSH to the DDC. Part of the process of associating a concept such as virtual mall with the DDC involves constructing a context comprising the words that are highly associated with the term: business, commercial, marketplace, net, networking, shopping, Web...

When this context is used as a query to Scorpion, the top five categories retrieved reveal the two most prominent facets of a concept that refers to shopping on the Internet:
004.6Interfacing and communications
004.678Internet
025.04Information storage and retrieval
380.1Commerce (trade)
381.1Retail trade

By applying similar procedures to LC headings, such as Tourist information centers, that are otherwise difficult to associate with the DDC we are able to effect additional links between LCSH and Dewey.

Summary

In the Dewey ETC tree and WordSmith projects we are demonstrating how the DDC knowledge base can be enhanced with supplemental concepts and indexing vocabulary. Mapped headings often represent current and popular topics not represented by existing captions or Dewey index terms but within the scope of Dewey's structure. We further show how terminology imported from free text can form a bridge between prevailing language usage and the Dewey editorial process. The new associations (from free text and controlled indexing systems) can be used in turn to enhance Scorpion Dewey databases capable of automatically generating NetFirst-like metadata. Through these complementary efforts we are trying to explore the true potential of a library classification scheme to organize electronic resources.

References

Godby, C. J. (1996). "Enhancing the indexing vocabulary of the Dewey Decimal Classification." Annual Review of OCLC Research 1996. Accessible at http://www.oclc.org/oclc/research/publications/reviews96/vocabulary.htm
Mitchell, J. S. (1997). "DDC 21: an Introduction." In Dewey Decimal Classification: Edition 21 and International Perspectives. Albany, NY: Forest Press, 3-15.
Examples used in this article were previously published in the OCLC Newsletter March/April 1997 No. 226. Search results have been updated to reflect the content of the NetFirst database as it appeared during August 1997.


Diane Vizine-Goetz is senior research scientist in the Office of Research, OCLC Online Computer Center, 6565 Frantz Road, Dublin, OH 43017-3395. She can be reached by phone at 614/764-6084 or by e-mail at vizine@oclc.org