Skip to content

Technical Description of Wrangling the Data (WIP)

Ali Hanks edited this page Jun 19, 2019 · 3 revisions

I think anyone trying to understand the DoseNet code base or even our data will appreciate the tools I developed while analyzing the latter. Much as we would like our data to be clean and neatly organized, that is pretty much never the case with real-world sensors. A variety of factors often lead to poor data or data loss. Keeping that to a minimum is a challenge in itself, but we data scientists can also do our part to extract as much information as possible from the data we have. Let me explain the pipeline I used to analyze DoseNet data last summer. Note that I am also in the process of learning, and looking back six months later at what I did last summer, I notice many things that I could have done differently, which may or may not have had a significant impact on my conclusions. I will also discuss those things as they come up.

Getting Data from the DoseNet Server

This is the easiest part of the pipeline. The data can be downloaded simply by navigating to the downloads page, selecting a location, and then clicking on the links that helpfully say “Click here to download”. You can automate this using Python and selenium although I did not do it because I was scared of webscraping back when I did it. I just manually downloaded the files, and there aren’t too many files, so it was fine.
The data comes in .csv format. For anyone who does not know what that is, CSV stands for “comma separated values”, and it is a format to store data in which tables are stored with elements in the same row separated by commas and rows are separated by newline characters. For example, the table

+---------+-----------+
| Fruit | Price ($) |
+---------+-----------+
| Apples | 1.89 |
+---------+-----------+
| Oranges | 3.99 |
+---------+-----------+
| Kiwis | 2.89 |
+---------+-----------+

will be stored as follows:

Fruit,Price ($)
Apples,1.89
Oranges,3.99
Kiwis,2.89

The best part about having data in the .csv format is that you can read it using a text editor like Notepad, Vim, VS Code, etc. or using Microsoft Excel, or, most importantly, using the pandas library for Python. Once you have the data, it helps to see what you just got your hands on. Let’s use a Jupyter Notebook to explore our data to determine what kind of processing/cleaning we might want to do to the data.

Annoying (and interesting) Issues with the Data

The sensors are configured to make a measurement every five minutes and send the data to the DoseNet server. However, the sensors are not synchronized very well; their measurements are usually a few seconds apart. So we need to come up with some way of identifying, for example, which humidity measurement corresponds to a given radiation measurement if we want to calculate the correlation between radiation and humidity.
But one might wonder if it even makes sense to look at the temperature and radiation measurements every five minutes. Indeed it makes more sense to look at longer-term patterns in such variables, since most interesting variations and relationships manifest themselves when we zoom out and look at the data collected over longer periods of time.
Both of the aforementioned problems are solved by binning. We create time bins and then we assign each point to a bin. If there are multiple points in the same bin, we average their readings. If there are no points in a bin, we store a NaN for the value of that bin to defer error handling in later steps; at this step we want consistent time bins. While making the bins, we can specify the start and end times of the data as well as the length (in seconds) of each bin. This process is automated in the script multi_bin.py. This not only lets us average over any time interval we want (hours, days, or even weeks), but it also very conveniently gives us consistent time stamps for all of our sensor measurements, so we know which measurements correspond to each other for different sensors.

Binning by Time

This section explains the code in time_binning.py. This script takes in DoseNet data, performs binning as described above, and then saves them to disk. You can run python3 time_binning.py -h to see all the options. At this point it is important to note that the script expects the input data to come in the format provided by the DoseNet server. The rest of this section assumes that we are only trying to bin data that is in the format (i.e., same column names and data types) provided by the DoseNet server.
Let’s start with an example. Suppose we have downloaded the data from the DoseNet sensors on the roof of Etcheverry Hall at Cal, and the data is stored in the file ~/Downloads/etch_roof_weather.csv. This csv file contains the following columns: deviceTime_utc, deviceTime_local, deviceTime_unix, temperature, pressure, humidity. More generally, the data coming from the DoseNet server always has the columns deviceTime_utc, deviceTime_local, deviceTime_unix, followed by sensor data corresponding to the time. The first three columns (redundantly) contain the time at which the measurement was taken. The script only cares about the deviceTime_unix column, which contains the timestamp in UNIX epoch time, because it is an integer that is easy to work with. (The other two time columns contain human-readable strings which would need to be parsed into UNIX time with appropriate time zone conversions if we wanted to work with them, so we chose not to do so.) Let’s say we want to bin the temperature data in this file with an interval of 14 days (1209600 seconds; this is one of the reasons why it is probably not a good idea to call the time_binning.py script outside of another script like multi_bin.py: what if you goof up the conversion?) and write the binned data in the directory binned_data. The following command would accomplish that:

python3 time_binning.py ~/Downloads/etch_roof_weather.csv -c temperature -s binned_data/ -i 1209600 -l etcheverry

The script takes the location of the raw csv data as a positional argument. This does not necessarily need to be a file stored locally—it can be anything (like a URL) that pandas.read_csv can take as an argument. The -c or --col_name argument specifies the name of the column containing the sensor data to be binned. This must match the name of the column in the csv file containing the raw data exactly. The -s or --save_dir argument specifies the directory where the binned output file should be saved, and the -i or --time_interval flag specifies the length of each interval in seconds. The -l or --location argument specifies the location prefix to be used to name the saved file. Optionally you can also provide a --start_time argument to specify the start time of the first bin in the format YYYY-MM-DD HH:mm:ss. It is set to some time in November of 2015 by default because that is the earliest time we have data for. The binning is done until the current date. The binned data is saved to a file named {location}_data_{col_name}_{time_interval}.csv in the directory save_dir.
I have generously commented the code so you can feel free to take a look at it here. I would also like to point out here that the bins do not track the number of points contained in them, but for statistical analyses, it might be helpful to track that and store that as a third column in the generated csv file. It should be quite easy, and I would recommend doing so to anyone interested in doing serious statistical analyses of our data. This did not occur to me when I was working with the data, so this feature is absent.