With data literacy an increasingly critical skill, dedicated instruction should start early in the educational process. By secondary school students should understand the distinction between data and information and the scope of the concept, and they should think critically about authorship, quality, source, collection, access, sharing and reuse. Instruction in data literacy requires access to large repositories of digital data, enabling students to develop skills in data description and tagging, prompting critical thinking about trustworthiness, reliability and validity, as well as audiences, temporal attributes, representations and evaluative sources. Data can be the focal point of a variety of classroom projects and integrated across disciplines, and there are many sources of data that can be readily accessed for classroom projects.
high school students
Data-Driven Society Begins with Data-Savvy Youth
by Zorana Ercegovac
In this article I define data as recorded representations of anything, including entities, properties, measurements, processes, events, behaviors, relationships and places, for the purposes of interpretation and communication. While data are the underpinnings of scholarly research, the notion of data goes beyond the labs and scholarly work. Data are streaming through wearable gadgets – smartphones and other widespread sensor-equipped collection devices. We are drowning in data, but can we define the notion of data? Do we teach K-12th grade students how to collect, describe, preserve, access, communicate and share data? High school and college students begin to work with some data in science, technology, engineering and mathematics (STEM) projects, but now is the time to develop data literacy programs for secondary schools across all subjects so that all people may one day become data-competent citizens.
The following anecdotal evidence influenced my emerging interest in data literacy. While teaching information literacy in secondary schools, I would ask students to draw a map of words related to the word research. Each student was handed a piece of paper with “RESEARCH” in the middle of that sheet. Then small groups of two to three students would work together to fill that page with all the words and concepts that in some way, in their minds, linked to it. A sample of the students’ associative maps is in Figure 1.
The word research was chosen because many secondary school projects use the word in their titles (for example, “biome research”). Students and teachers would sometimes use the word casually, asking students to look up a certain topic in reference sources or to interpret and summarize features of topographical maps. After examining students’ returned sheets, I was struck that many students (n=320) could elaborate to over 50 words and phrases related to the word research but data was not noted by any of the students. In other words, students do not make any mental connections between them. That finding was my initial motivation to address the notion of data in the context of secondary school programs across all K-12th grade subjects and not only in STEM-related projects.
Also, too often the word data is interchangeably taken to mean information, but these two notions are very different, and they need to be taught as such in secondary school curricula or earlier. The definition of data used above is intentionally broad to serve as a framework for different learning contexts and grade levels.
Diverse Data and Data Collection Techniques
Data is very diverse in content and medium as well as in the ways it is collected and used. An example of where multiple streams converge is at the Cornell Lab of Ornithology (www.birds.cornell.edu) with their mission to “interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.” A notebook from fieldwork might include drawings of birds (visual data representation); another, recordings of their sounds (aural representation). A textual annotation might give dates, places and special circumstances where all these data are collected, while a map (spatial representation) can provide an overall geographic boundary of this collective data gathering.
Another example includes climate data collected by the National Oceanic and Atmospheric Administration’s National Weather Service Forecast Office (www.drought.gov) on an ongoing basis for monitoring general trends. The Drought Monitor collects data from 50 different weather indicators, such as soil moisture, water quality, wildlife and temperature data.
More generally, scientists and funders are now widely and increasingly concerned with big data, the very large and growing databases gathered by instruments, including the now ubiquitous information-sensing mobile devices such as aerial sensors, software logs, cameras, radio-frequency identification (RFID) readers and wireless sensor networks. Big data continues to be traditionally associated with the hard sciences, such as astronomy, physics and the human genome project.
How big is big data? For instance, medical researchers funded by the National Institutes of Health will be able to find disease associations in the three billion base pairs in the human genome or in the estimated 86 billion neurons in the human brain .
We think of big data in terms of the sheer volume and pace of scientific data, but there are also large datasets gathered in the social sciences and the humanities, as well as in everyday life, captured by everyone everywhere and posted on social media. In addition to volume and pace, the notion of data has been related to issues of authorial responsibilities, open access to data and possibilities to share data. Hellerstein  discusses some of the challenges such as analysis, capture, curation, search, sharing, storage, transfer, visualization and privacy violations.
Recently, federal funding agencies such as NSF and NIH have moved toward an open public access policy in scholarly communication. These new requirements apply not only to reporting research results but also to submitting the accompanying data that was used to support them. Studies such as those by Wallis  and Tenopir  shed light on some of the current data-sharing practices in scholarly communication. These studies agree that better scholarly infrastructure is needed to allow for data curation, sustainable access and data sharing and reuse.
Suffice it to say that a variety of data is increasingly being made available for interpretation, sharing and reuse and that we need to start integrating data literacy into learning and teaching early in the educational process. Such a program would illuminate the difference between the commonsense notion of data (personal health data reports, educational aptitudes, GPA scores and so forth) and those used in support of evidence in research studies. It would demonstrate that the latter data are collected rigorously and systematically in order to allow researchers to describe and interpret data and to distinguish between chance and systematic effects. Regardless of the level of sophistication in statistical tools, when they are applied to incorrectly collected data, the study’s results will be invalid. Therefore, everything hinges on data: the purpose for collecting the data; which data were collected, their quantity and source; which extraneous factors were controlled and which conditions applied when the collected data was observed and measured. These are the criteria that form the basis by which we can evaluate the trustworthiness of information that uses data as its underlying evidence.
Data in the Context of Learning and Teaching
Data has been defined “to include any information that supports student inquiry and participation in the scientific method, including experimental or observational data as well as simulated data derived from models” [5, p.4]. According to these authors, students should be able to
- find and access data relevant to the topic they are investigating;
- evaluate the quality of this data;
- use appropriate tools and interfaces to manipulate and render data to answer questions;
- combine multiple and diverse datasets to solve a central problem …;
- generate visualization and representations that communicate interpretations and conclusions; and
- contribute, view and evaluate their own data in the context of larger datasets [5, p. 5].
Another useful definition of data for our purposes is the one that treats data as “the objects that researchers consider to be their source of evidence for a given study” [3, p. 3]. While it is essential for scientists and scholars in general to be data literate, investigators at the University of Illinois at Urbana-Champaign strongly believe that data literacy ought to be extended to the general public. They define data literacy as the ability to discover, analyze, think critically about data encountered in daily life and become comfortable with questioning sources and accuracy rather than feeling intimidated .
In order to build the capacity for data literate citizens, we need to start at the pre-college level connecting the dots between learning standards and the scholarly communications push for data sharing described above. Data literacy offers opportunities to explore intersections between the existing standards for the 21st-century learners and common core standards in secondary school learning experiences. In fact, the mission of library media programs is to “ensure that students and staff are effective users of ideas and information” [7, p. 6]. Accordingly, students are expected to “create new knowledge,” to “share knowledge and participate ethically and productively” and to “use social networks and information tools to gather and share information.” Data collected and maintained by agencies such as the National Center for Health Statistics, EPA, NOAA and the Census Bureau offer the opportunity to meet that mission.
Data in Lifelong Learning: A Five-Prong Vision for Learning and Teaching
We teach and practice various types of literacies, such as digital, visual, textual or technological, only to gloss over the meaning of the word data in various contexts [7, p. 24]. While the word data is sometimes hidden in the language of variables, data matrices and statistics, we seldom speak explicitly of data literacy in information literacy (IL) standards and programs. For example, throughout the lengthy 11-page table of “Math Crosswalk” between the Common Core Standard and AASL Learning Standard(s) for seventh grade, the word data is mentioned but twice under the heading of Statistics and Probability [8, 9]. The notion of data should be distinguished from those of sources, information, knowledge, facts, valid information and validity and accuracy of all information. We need to bridge the gap between secondary school learners and scholars in order to make the notion of data more seamless, aligned and better understood.
To meet this end, we propose a data literacy program by outlining a five-prong vision to ensure that all learners are empowered to
- understand the notion of data (versus that of information);
- describe data for easier discovery, sharing and re-use;
- think critically about data (rather than about an entire site, starting with acknowledging responsibility in data gathering);
- use data ethically through attribution and proper citation practices; and
- access publicly available datasets.
In rest of this article I will focus on prongs #2 and #3, the importance of describing data for easier discovery and thinking critically about data. Our earlier discussion gives some idea of prong #1, which includes reviewing literature on the notion of data. Using data responsibly and ethically (prong #4) is directed toward the notion of authorship, especially of a collaborative nature. For more information on this prong, please see Ercegovac 2010 . Prong #5 focuses on learning to access publicly accessible, data-rich sources like those previously discussed.
To Be Discoverable, Make Data Visible (Prong #2)
Data are collected massively and posted on social media outlets. Some of these data are unique resources, chronicling political uprisings, spread of epidemics, election updates, pop culture phenomena and other global issues, posted by millions of witnesses, including media. As potential evidence for analysis and understanding of current issues in numerous contexts around the globe, these recorded representations of images, videos, sounds and narratives need to be easily discovered and accessed. To facilitate organization of and access to vast repositories of digital data, we need to develop data stewardship in secondary school learning and teaching curricula.
Most of the large datasets on climate, space, biogenetics and environment are collected and curated by the federal agencies and large scientific teams, referred to earlier as big data. These data collecting sources attach metadata to the data for the purposes of administration, description, preservation, and technical and usage functions. In order to develop a culture for data stewardship programs, we need to engage data collectors, starting with secondary school students, to create metadata.
As an example, several middle school projects were designed to develop students’ appreciation of skills to describe and tag data, in this case certain archival digital photographs from the Online Archives of California . This resource (www.oac.cdlib.org/) was used to introduce the students to the life cycle of primary documents, including visual sources. Students were given the responsibility to design narratives for museum objects and to replicate processes for care of primary sources in their own personal collections (for example, to describe, tag and preserve old photographs of their family members or to catalog school graduation costumes from their own school). These exercises allowed the middle school students to experience important phases in the life cycle of data: from collecting objects through the processes of describing and tagging them, to searching and accessing the data in a given data digital repository. The process of tagging was introduced and applied on a variety of objects with flickr (www.flickr.com) for tagging visual materials. High school students were introduced to a multipurpose Dublin Core standard data element set.
This image tagging exercise was a part of the six-lesson information literacy course designed for all middle school students. Exposure to data tagging carried over to students’ increased sensitivity to caring for data as evidence, critical thinking about data trustworthiness and ethical use of data in everyday life.
Critical Thinking Starts with Essential Data Characteristics (Prong #3)
While data are plentiful, free and accessible to anyone anywhere with Internet connectivity, we know that not all data are equally trustworthy. This section lists essential characteristics of data that were derived by examining and using various datasets. Whether we use data on commodity prices, Ebola virus monitoring or to probe microbial life, radiation or water on Mars, we need assurance that the collected data are reliable and valid. These data characteristics are classroom tested, using the language for the secondary school programs. The characteristics are not ranked in any order; their applicability and sequence is typically determined by classroom usage.
- Collection method
Collection instrument or triangulation of several observation instruments determines accuracy, validity (measuring what is intended to measure), reliability and completeness. Data collection instruments vary:
- Data in big science are typically great in volume, consistent in structure, easier to discover, reuse and share (Wallis). Large datasets are collected and monitored by means of observatories, remote sensing, telescopic cameras and satellites .
- In “long tail” science, smaller teams or individuals collect data for specific projects. Data are local in character and more difficult to discover and reuse .
- Data in the social studies and the humanities are collected by means of triangulation of interviews, surveys, ethnographic observations, visual and sound-recorded data and other unobtrusive data collection methods.
- Data as bot-collected, via web crawlers and crowd-sourcing (for example, Wikidata at (wikidata.org).
- Data representation
- Varies depending on the type of data (geospatial, demographic, environmental, lab specimen, financial and so forth)
- Representational notation such as numerical, visual, choreographic, aural or textual
- Data about organic and inorganic matter interpreted by experts (for example, tree rings, metal tools as evidence of indigenous people)
- Data infrastructure
- Metadata (embedded in or attached to data)
- Policy, quality assurance and use of disparate data sources
- Classes of users
- Scientists only (language is technical)
- Different classes of users, such as scientists, K-12 schools, citizens
- Anyone having access to the Internet
- Access to data
- Accessible without having to run sophisticated tools and software applications
- Needs special tools for access
- Available formats in many different formats such as pdf, text, html
- Temporal attributes
- Periodicity of update (at regular intervals, random, one time, historical, continuous streaming, interactive)
- Time stamps, alerts, never announced
- Homogeneity of data
- Numerical data only
- Numerical, geospatial, demographic and so forth
- Formally evaluated data
- Evaluated in well-known reference sources, journals
- Evaluation in wikis, blogs and other social media, based on
- Entire portal (which in itself may include large datasets)
- Samples of datasets from a given data source (for example, EPA, CDC, NOAA, NASA, USGS)
- Stratified (several sites from each hierarchy)
- Purpose-specific (lesson plan, research project)
iii. Contacts and responsiveness to individual queries
- Training, free webinars, online tutorials
- Subscription plans
- Free of charge for anyone with the Internet access
- Subscription per use (institution, individual, consortium)
- Free basic level of data (for educational purposes and training)
Use of Data in Classroom Projects
Personal Data Collection. A powerful way to introduce a notion of data to young learners and to distinguish the notion of data from that of information is to begin with commonly observable and wearable wireless devices that collect personal data. An example is the fitbit mobile wristband many nowadays use to track daily level of fitness. It collects number of steps, records burned calories and tracks exercises. These devices collect valuable data representing one’s daily activities, and, ultimately, they may be used to inform health professionals in order for them to customize their patients’ treatments for desired outcomes. In other words, the data are valuable to those who know how to analyze and interpret them, but data themselves are not creating new knowledge for us.
Diverse Data: Collection and Interpretation. As an example of data use that may inspire students, collecting and interpreting diverse data may sometimes lead to big knowledge. Soho in London in the mid-19th century was the site of cholera epidemics that killed hundreds of people . Dr. Snow investigated the outbreak and discovered a more significant cause and effect relationship than he set out to discover. He tabulated data on daily number of deaths, mapped places where people died, recorded data on dates of death and triangulated all these data with physical examination of patients. He also recorded patients’ personal stories. As a result of all these collected data, Dr. Snow concluded that the cholera epidemics were caused by water pumps processing contaminated water.
Accessing Data in Interdisciplinary School Projects. There is plenty of available data everywhere, so students do not have to collect data themselves in the course of their mini-research projects. They can learn to efficiently find desired data; critically examine data sources; read data presented in tables and charts; and cite the sources which are responsible for collecting and presenting data in tables and charts.
However, few school projects use publicly available numerical and statistical data in their assignments to achieve these goals. Instead, many require students to access bibliographic databases via online vendors such as ProQuest, Ebsco Host and JSTOR. Having randomly examined dozens of school library web portals, we still find more bibliographic databases available through pricy annual subscriptions than numerical, visual and statistical datasets, many of which are free of charge. Based on the description of students’ projects that this author has examined in the period of 1998 to 2012, we are still operating in the bibliographic universe.
Under the rubrics for information learning for “Reading Standards for Lit in Science Tech” for 6th through 8th grades , it is recommended that students “make sense of information gathered from diverse sources by identifying misconceptions, conflicting information, and point of view or bias.” Here we have the opportunity to make a leap from data to information and use of publicly accessible datasets. In collaboration with teachers, this author developed several projects, which taught students to harness available data for several interdisciplinary projects. The physical education and science faculty in consultation with library media specialists and students’ interests framed topics. We used reports such as CDC’s Healthy People (www.cdc.gov.nchs/health_people) and EPA’s report America’s Children and the Environment  to identify emerging health and environmental topics, starting with nutrition and diet, food safety, environmental health, diabetes and substance abuse. Students used a variety of datasets, including those collected and maintained by CDC’s NCHS, EPA, NOAA and NIH.
This paper argues that there is a place and a need for data literacy to be fully integrated within a broader notion of literacies in learning standards for secondary schools. Work with data in general enhances students’ awareness that data exists and that we need to think critically about data. Authentic problem-solving projects might improve students’ statistical literacy, visual representation skills and their capacity to make use of various distributions of environmental, health and population data. Depending on the grade level, data might be used across subject disciplines from advancing public policy to observing atmospheric and genomic phenomena not visible otherwise.
More than ever before, we need to develop robust data literacy programs as early as in secondary schools to enable all learners to become data-savvy citizens. Embedding the five-prong data literacy vision described in this paper allows learners to gain skills necessary to seamlessly transition into college and graduate work.
Resources Mentioned in the Article
 Basken, P. (October 10, 2014). NIH awards $32-million to tackle big data in medicine. The Chronicle of Higher Education. Retrieved from http://chronicle.com/article/NIH-Awards-32-Million-to/149323/
 Hellerstein, J. (2008). Parallel programming in the age of big data [blog post]. Gigaom Blog. Retrieved from https://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/
 Wallis J. C, Rolando, E., & Borgman, C. L. (July 23, 2013). If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology. PLoS ONE, 8(7). doi:10.1371/journal.pone.0067332.
 Tenopir, C., Allard, S., Douglass, K., Aydinolgu, A. U., Wu, L., Read, E. . . . Frame, M. (June 29, 2011). Data sharing by scientists: Practices and perceptions. (2011). PLos One, 6(6). doi:10.1371/journal.pone.0021101
 Manduca, C. A., & Mogk, D.W. (2002). Using data in undergraduate science classrooms. Final report on an interdisciplinary workshop held at Carleton College, April 2002. Sponsored by National Science Foundation, division of undergraduate education (Grant NSF-0127298). Retrieved from http://d32ogoqmya1dw8.cloudfront.net/files/usingdata/UsingData.pdf.
 Twidale, M., Blake, C., & Gant, J.P. (2013). Towards a Data Literate Citizenry. Proceedings of iConference 2013, 247-257. doi:1-.9776/13189. Retrieved from http://hdl.handle.net/2142/38385
 U.S. Environmental Protection Agency (EPA). (2013). America’s children and the environment. (3rd Ed.) Retrieved from www.epa.gov/envirohealth/children/keyfindings.html.
Zorana Ercegovac is founder of InfoEN and an adjunct faculty at the University of California, Los Angeles, and Drexel University. She can be reached at zercegov<at>gmail.com