What is a dataset? Where can we find raw data? How do we know whether the data is “good” or if it needs to be cleaned before it can be analyzed? These are some of the questions that we need to answer before looking at examples of how to explore a dataset and transform it from its “raw” form into something that can be used for a Digital Scholarship project.
A quick exercise: write a one-sentence definition of data and compare it with that written by another team-member. What elements to your definitions share? Where do they diverge?
The question of "data" is deceptively simple (meaning, it's not simple at all!), so let's skip to something more immediate: How can we think of our own research in terms of structured data? How do we go about adjusting our objects of study (archives/notes/digital images/PDFs) into forms that allow for computer analysis or visualization? Whether we want to use machine learning to predict patterns or if we simply want to share maps and digital exhibits online, we need to have some working understanding of our research as data.
Most English language dictionaries will provide a defintion nearly identical to that used in the Oxford Advanced American Dictionary:
"data noun: 1. facts or information, especially when examined and used to find out things or to make decisions".
But what are facts? What is information? Indeed, the very idea of "data" comprises a slippery set of concepts, processes, technologies, and products. Take a look at the Wikipedia entry for data. How does this entry complicate what appears to be a rather stratightforward definition?
Take a look at this statement about the etymology of "data" by Daniel Rosenberg ("Data Before the Fact", Raw Data is an Oxymoron):
"Above all, it is crucial to observe that the term 'data' serves a different rhetorical function than do sister terms such as 'facts' and 'evidence.' To put it more precisely, in contrast to these other terms, the semantic function of data is specifically rhetorical".
This statement illustrates that even "mechanical" processes, such as creating and cleaning data, are themselves rhetorical and engaged in argument. Even before we reach a stage of visualization and interpretation of our data, we are already engaged in issues of representation and modeling. Confused? Reflect on how we go about asking and answering questions in our research. As we design a research question and engage in the work of doing research we are also engaged in a process of choosing what data, sources, voices, and viewpoints we reasonably believe will best address our question.
Most of the time when we talk about "data" we are really talking about a "dataset". Take a look at the Wikipedia entry for dataset. Very often the datasets that we encounter are tabular (think of a spreadsheet!) with columns that describe different variables (e.g. a person’s name, height, eye color, gender, etc.) and rows with individual records (e.g. a row for information about Bob, a row for Alice, etc.). Datasets can also appear as graph data in the form of a tree hierarchy or a network. No matter how complicated or simple, datasets should always consist of values that belong to predetermined variables. These variables, in turn, can describe either categorical, continuous, or discontinuous data.
A quick exercise: think of some datasets that you have encountered before, for a class, a project, or while keeping track of your own personal information. What forms did these datasets take? How were they organized?
Another quick exercise: what in your own research relies on grouping objects of study into meaningful categories? What might include quantitative data, whether discontinuous or continuous?