Automating born-digital archival description

by Shelly Black

Many digital humanities and grant-funded projects have involved the application of machine learning techniques to analyze and reveal new insights from the historical record. These efforts often involve many collaborators and large collections. Can special collections and archives use these same tools to improve description, and consequently access, on a smaller scale in a sustainable and ethical manner?

Corpora of born-digital archival materials, such as a collection of email or an image of a hard drive, are not typically appraised or described at item-level. Donors may or may not provide details about the contents of media. Fortunately, textual collections are steps away from being machine-readable. Valuable descriptive metadata can be produced using word embedding, topic modeling, named entity recognition, or other natural language processing techniques. The results would help us better describe the born-digital materials in finding aids, benefiting researchers. Examples of tools designed specifically for these purposes have included BitCurator NLP and ArchExtract.

I would like to explore the integration of similar tools into our born-digital processing workflows at NC State Libraries. We currently use a homegrown application which standardizes and makes this work easier for both staff and student workers. This web-based software provides commands which can be copy and pasted into the terminal to run virus scans and forensic reports, as well as create preservation metadata. The next version of this application could support processors of all technical proficiencies in generating descriptive metadata through natural language processing scripts. Perhaps in the future it could also assist us in running an image recognition algorithm on visual corpora.

While machine learning brings efficiency to our workflows, the IDEA Institute on Artificial Intelligence has increased my understanding of the subjective human decisions and labor still required. While I was familiar with the ways in which algorithms can exacerbate biases, such as through problematic training data, I had not been aware of all the parameters which can be adjusted in probabilistic models and how different algorithms yield varying results. For instance, different topic models can produce different bags or clusters of words, to which humans assign meaning.

Not everything should or can be automated for the above reasons, but more importantly, ethical concerns. Private information subject to FERPA or other records laws, or other sensitive content not detected by forensics tools like Bulk Extractor, could still remain in collections and the metadata we extract. We need to continue ensuring that the privacy of donors, creators, or third parties is not put at risk during machine learning powered description.

Adding machine learning to existing ecosystems in cultural heritage institutions is exciting. I hope it can help alleviate backlogs of born-digital material and live up to its potential to increase discoverability of collections which are underused or from underrepresented communities. At the same time, the more acquainted I become with these tools, the more cognizant I am of the need to plan for their maintenance and the human work behind the scenes.