Collections as data; ML literacies in libraries

by Rafia Mirza

In our research technology workshop series at my home institution we discuss not only the technical skills that go into a project, but also the ethical considerations as well. In our workshops we often start with the conceptual questions–why might you do something a certain way–before we focus on the how. We do this by examining digital projects and what each project tells you about how it was created – is it a black box or does it explain how it was created and by whom? Is the process by which the project was done made clear? Does the project make labor visible through credit and attribution? What can be inferred when a project does not make the labor involved in its creation visible? Is the project explicit about its values and ethics?

I first became interested in applying to IDEA Institute on Artificial Intelligence to see how these questions can apply to projects using ‘black box’ AIs, and how that impacts the ways in which libraries and other GLAM institutions adapt ML. Initially I was considering using topic analysis to see what connections between our various collections might surface when analyzing across them. In our planning sessions at the IDEA institute we took the opportunity to interrogate what it means to make our collections available for use as data, and how we can best offer instruction and reference around using collections as data. What opportunities for collaboration and information literacy education are available through this process? What does ML/AI mean through a library lens? How can we facilitate analysis, discovery, and use? What conceptual models can information literacies bring to ML? How can we engage in this process as ‘data stewards’,(Collections as Data) centering ‘ethics, transparency, diversity, privacy and inclusion.’

In his LC Labs report, Cordell points out “…. that libraries have long been central to conversations about how to provide access to information at scale.” Discussions of how to adapt the use of emerging technologies to our commitments to search and discovery are also part of the library discipline, while discussion of data collection and annotation processes are part of the archival discipline (Jo and Gebru). You could see this GLAM tradition in the ways that participants’ project interests broke down roughly in the following ‘traditional’ GLAM categories: Search and Information Services, Education and Research Support, Collections and Management Data, and Metadata.

As we worked on our project plan in the institute, I narrowed my focus to working on an internal proof of concept project applying ML (NLP/NER) to some of our collections data This pilot will be an opportunity for me to gain practical experience in ML, and it will provide opportunities for internal conversations with our digital collections on how we want to make our collections available as data, as well as opportunities for conversations with our information literacy department on how we can incorporate these types of information literacies (data literacy, computational literacy, and ML/AI literacy) into our workshops and reference offerings. This will give us a chance to discuss internally what aligns with our mission and is sustainable for us, while gaining experience in ML literacies. We will use this pilot project as a way to explore how and in what ways we will choose to make our collections available, with a commitment to responsible operations (Padilla).

References:

Collections as Data. (2018). The Santa Barbara Statement on Collections as Data. Always Already Computational – Collections as Data. Retrieved July 15, 2022, from https://collectionsasdata.github.io/statement/

Cordell, R. (2020). Machine Learning + Libraries: A Report on the State of the Field . LC Labs, Library of Congress. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf

Jo, E. S., & Gebru, T. (2020). Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829

Padilla, T. (2020, August 26). Responsible Operations: Data Science, Machine Learning, and AI in Libraries. OCLC. https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html