B U L L E T I N
Selected Abstracts from JASIST
Editor’s note: We invite JASIST authors to submit structured abstracts of their articles for possible inclusion in the Bulletin, particularly those that might be of interest to practitioners. ASIST would welcome reader feedback on the usefulness of this (or any other) Bulletin feature (email@example.com).
From JASIST, v. 54 (1)
Novarro, G., Baeza-Yates, R. & Arcoverde, J.M. (2003). A flexible approximate matching tool for searching proper names, pp. 3-15.
Study and Results: We present the architecture and algorithms behind Matchsimile, an approximate string-matching tool especially designed for extracting person and company names from large texts. Part of a larger information extraction environment, this engine finds all the occurrences of a large set of proper names in a large text collection. Beyond the similarity search capabilities applied at the intraword level, the tool considers a set of specific person name formation rules at the word level, such as combination, abbreviation, duplication detection, ordering, word omission and insertion, among others. This engine is used in a successful commercial application to search for lawyer’s names in official law publications.
What’s New? At the character level, Matchsimile searches the text for all the names in parallel, triggering the approximate occurrences of words from the different names sought. This technique is appropriate for several other applications in information such as filtering, virus-detection, spelling and natural language processing. At the word level, Matchsimile detects text passages that contain sufficient occurrences of words from a single name and analyzes them one-by-one using several name formation rules specific from the Brazilian culture. The new algorithms developed to compare phrases are likely to be adaptable to other applications where different wordings of the same phrase are to be found.
Limitations: Only whole-word matching is possible. This does not permit recovering from errors where two words are joined or a word is split, nor application of the method to sequences where there are not clearly separated “words” (e.g. DNA text). The structure to hold the patterns must fit in main memory.
Cole, C. & Leide, J.E. (2003). Using the user’s mental model to guide the integration of information space into information need, pp. 39-46.
Study and Results: The study examined 38 history undergraduates using two history databases to seek information for a course essay. Half received a high recall search strategy intervention (with visualization) and half received a high precision search strategy intervention (with no visualization). There was no significant difference in mean mark for the essays between the two groups.
What’s New? The study took place in a naturalistic setting with a real information need, compared to the imposed information need in most IR system user studies. The study shows that a high recall search strategy can be given to students without adverse effects if the results of the search are visualized. The study suggests that users’ definitions of information need may be an additional and key stage in obtaining a focus for their essays, which is Stage 4 of Kuhlthau’s six-stage ISP model. This stage may be an important point of entry for creating enabling interactive IR systems. We include a design for a two-part IR interface which links the users’ mental models of their essay topic spaces with a system representation of the topic spaces.
Limitations: A high recall search strategy (without visualization) intervention group should have been included so that the effects on student performance of a high recall search strategy and visualization could have been separately determined.
Van den Besselaar, P. (2003). Empirical evidence of self-organization? pp. 87-90
Study and Results: Words (title words, abstract words or full text) are often used as data in information science research. Frequency distributions of word (co)occurrences are generally very skewed, and the resulting data matrices are sparsely filled. Consequently these matrices contain many zero’s and one’s. The latter implies that many co-occurrences are unique observations, and this implies that the statistical methods for analyzing the data have to be selected even more carefully than usually. The paper shows how a lack of methodological scrutiny may result in interpretations that are completely false.
What’s New? Using discriminant analysis (and, more generally, cluster techniques) for analyzing word co-occurrences is very risky, as the data violate the conditions of this technique. Although discriminant analysis is considered to be robust with respect to violation of these conditions, we showed that this is not the case for bibliometric data.
Limitations: As it stands the paper only reanalyzes one study. A more formal proof of the problems may be useful to generalize the results.
From JASIST, v. 54 (2)
Newby, G.B., Greenberg, J. & Jones, P. (2003). Open source software development and Lotka’s Law: Bibliometric patterns in programming, pp. 1169-178.
Study and Results: This research applies Lotka's Law to metadata on open source software development. Lotka's Law predicts the proportion of authors at different levels of productivity. Open source software development harnesses the creativity of thousands of programmers worldwide, is important to the progress of the Internet and many other computing environments, and yet has not been widely researched. We examine metadata from the Linux Software Map (LSM), which documents many open source projects, and Sourceforge, one of the largest resources for open source developers. Authoring patterns found are comparable to prior studies of Lotka's Law for scientific and scholarly publishing. Lotka's Law was found to be effective in understanding software development productivity patterns and to offer promise in predicting aggregate behavior of open source developers.
What’s New? Open source software development has seldom been studied. Furthermore, many uses of empirical data for Lotka's Law have had problems. In particular, this research points out a limitation in Pao's method.
Limitations: Open source software changes quickly. Thus, there have been new developments and other studies since this work was completed.
Belefant-Miller, H. & King, D.W. (2003). Profile of faculty reading and information-use behaviors on the cusp of the electronic age, 179-181.
Study and Results: While the majority of a university faculty's work deals with information or knowledge – finding, getting, reading and using it, basic data in the literature on these information-use behaviors is often fragmentary, when available. We analyzed the demographic portion of a library use survey given to the faculty and staff of the University of Tennessee at Knoxville in 1993 in order to profile their information-related activities. Journal articles were the predominant type of document that faculty both read and authored. Faculty averaged 4.2 journal subscriptions per person, of which 84% were paid for personally. Twenty-five percent of the faculty had obtained some funds for information products and of those funded the median amount provided was $500. Faculty spent 24 minutes per day using e-mail and 78 minutes per week using the non-e-mail computer network. Faculty reported publishing 3.0 journal articles per year and 31% of the faculty had won an award for professional contributions in the previous two years.
What’s New? This study provides a baseline for comparisons of behaviors from the paper era and for future studies in the electronic era by being from a time when both paper and electronic resources were accepted and available.
Limitations: The sample is limited to University of Tennessee faculty and staff.
Copyright © 2003, American Society for Information Science and Technology