Identifying Related and Same-Work Relationships in Large Digital Libraries
The rapid growth of scanned-work digital libraries presents a new opportunity for learning more about our collections. With digital access to text inside the books of a collection, content-based text mining methods can be leveraged to learn more about the relationships between works, helping correct inaccurate metadata, suggest classification information, recommend similar works, and label the nature of links between works.
This talk will introduce the Similarities and Duplication in Digital Libraries project, SADDL, a project identifying same-work relationships among the 17 million works seen in the HathiTrust Digital Library. SaDDL is identifying exact duplicates as well as traditionally difficult-to-identify relationships such as derivatives, different editions, abridgments, and whole or part relationships. We present the challenges of the problem, our project's approach to meeting them, and a new dataset for cataloguers and scholars to apply our outcomes.
Peter Organisciak is an Assistant Professor at the University of Denver. His work focuses on improving bibliographic collections through content-based computational analysis of massive digital libraries.
Lindsay Gypin is the Access Services Manager at the University of Denver Libraries. Her research interests include libraries as systems of oppression and improving user access through data analysis.
Margaret (Maggie) Ryan is a Reference Librarian at the National Renewable Energy Laboratory in Golden, Colorado. She intends to continue her career with NREL, focusing on integrating data analysis into reference and ILL services.