Big Data Management and Digital Privacy

Greasing the Wheels of Biological Big Data Analysis

High-throughput technologies in biological research such as next generation sequencing, proteomics, metabolomics and phenomics has created an avalanche of data with the typical challenges of big data such as volume, variety and veracity. While bioinformatics and computational biology have existed for at least 50 years, training in these areas has generally been limited to scientists that intend to specialize in these fields. Thus, many scientists that are able to design, execute and interpret the results of biological big data experiments, lack the skills to manage, process and analyze the data that they generate. In addition, data sharing mandates from funding agencies has created a wealth of published biological data prime for reuse for those with the necessary skills.

While attractive, software packages for biological big data analysis such as CLC Genomics Workbench, Geneious, and Ingenuity Pathway Analysis are cost-prohibitive for many institutions. Fortunately, most software required to process and analyze biological big data is free from sources such as GitHub, Bioconductor or the Python Package Index. Computer infrastructure can be a significant challenge, but many universities have parallel computing facilities. Alternatively, cloud computing can be purchased on an as needed basis from vendors such as Amazon Web Services or Microsoft Azure.

As the Molecular Biosciences Information Specialist for the Purdue University Libraries, I have made it a priority to develop strategies to grease the wheels of biological big data analysis. This includes developing and teaching a graduate-level bioinformatics course, one-on-one consulting with students and faculty, occasional workshops, and serving as an instructor for an NIH-sponsored, bioinformatics boot camp. In addition, I maintain close ties with the Purdue University Bioinformatics Core and Research Computing. My talk will provide an overview of these services which—while focused on bioinformatics—could be extended to other fields such as digital humanities.

Speaker:

Pete E. Pascuzzi is an Assistant Professor with the Purdue University Libraries and also has a courtesy appointment as Assistant Professor of Biochemistry. Pete is the Molecular Biosciences Information Specialist and liaison to the Department of Biochemistry and the Department of Medicinal Chemistry and Molecular Pharmacology. His primary interests are bioinformatics instruction and consultation and research data management. Pete earned his B.A. in Biology and Chemistry from Washington & Jefferson College and his Ph.D. in Biochemistry from Cornell. He was a postdoctoral research associate at North Carolina State University where he studied bioinformatics and plant genomics. Since joining Purdue Libraries in 2013, Pete has developed an introductory bioinformatics course for graduate students, held workshops on data resources for cancer biology research, and provided bioinformatics consulting to graduate students and faculty. Pete is an instructor and course developer for an NIH BD2K project, Big Data Training for Translational Omics Research.

Managing Electronic Theses and Dissertations (ETD) Data

Where should students store data after they have completed their Electronic Thesis or Dissertation (ETDs)? The MetaArchive Cooperative has created the ETD+ Toolkit as an approach to improving research output management. This session will cover best practices in data curation and digital longevity techniques that will help students and faculty identify and offset risks and threats to their digital research footprints. We will discuss what to do with the data, how to handle copyright, version control, data organization, file formatting and metadata as well as where should these things be stored.

Speaker:

Kayla Siddell serves as the Data Curation Librarian at Indiana State University where she manages the Institutional Repository, as well as Wabash Valley, Visions, and Voices (A Digital Memory Project). Her research interests include institutional repositories and non-traditional data. She also serve as Chair of Indiana’s Association for Information Science and Technology (ASIS&T) Chapter and represent Indiana State University in the MetaArchive Cooperative.

When Private Meets Public: Data, Ethics, and Librarians

As libraries move into research data management, we have unprecedented opportunities to influence the way that data science grows. How research data management impacts data security and compliance is a ripe area for education, and solutions to these problems can be found embedded in traditional ethics. This session will present an example from agriculture that illustrates the complexity and embryonic nature of this landscape and present current efforts to grow data ethics awareness. While it is not possible for libraries to provide sole solutions, selective partnerships across campus have been nurtured that enable us to contribute.

Speaker:

Amanda Rinehart joined the Ohio State University Libraries in 2014 as the OSU’s first Data Management Services Librarian. Amanda came to Ohio State from Illinois State University, where she was the Data Librarian and Head of the Digital and Data Services Department. Previously, she was E-Science Librarian at Brown University, and worked for 11 years at the U.S. Horticultural Research Laboratory of the USDA. She received her MLIS from the University of South Florida, MS at Michigan State University, and a BA from Kenyon College.