Brown-bag session on how to find, obtain, clean and use data on your beat.
- Check annual/quarterly reports. Reports like this one from the police department can be a huge key. How does the department know average 911 response time? Because they record it for every call. Similar logic can lead to all sorts of neat datasets.
- Go up the food chain. The National Center for Education Statistics requires data from the states, which requires data from the districts, which requires data from the schools. This happens in beats we can't even imagine. Find out what federal or state agencies require, and you'll find a local dataset at the end of the rainbow.
- Read the rules. Government is fantastic at creating data. It is not so good at following up. Ask Chris Burbach, who found that a city department was supposed to file a quarterly report about its hiring practices... but wasn't.
- Request copies of blank forms. To find out what data a department or agency has, see what they're collecting from users.
- Chat up a frequent flier. Find somebody whose job requires them to interact with a department frequently, and see what sort of information they have to give up. (E.g., what's it take to pull a building permit? Talk to a builder or electrician.)
- Don't fear the formal request. Most worker bees don't even realize they're filling in a database, much less have the power to give you complete data. A records request is a surefire way to cut through the crap and get your questions before a person who can actually help you. Be stern and cite the law.
- But there's a catch: It's hard to make a request without knowing what data is stored. Try talking to the people who actually work with the data to understand what's there and how it can be used. Requesting information that doesn't exist is a good way to drive yourself to drink.
- A middle ground. In my requests, I admit when I have no idea if the data exist or not and ask for help. I don't know if it works.
- Ask for a csv. It's a universal format easily exported from any program. If they say otherwise, call the vendor. Some agencies will give you PDFs. Try to talk them out of it.
- List the fields you want. Some government databases are hugely complicated, with lots of tables that relate to each other. To get around their weirdness, specify exactly the records you're after. When possible, of course.
- Dealing with known unknowns. If you know there's a database, but you don't know exactly what's in it, you can request what is variously called a "record layout" or "data dictionary" — basically, a table of contents for your data file.
- Use our handy records request generator, which has sample language for Iowa, Nebraska and FOIA requests.
Don't make any assumptions about your data. Unless your assumptions are that the file is dirty or incomplete or otherwise imperfect, in which case assume away!
Make sure you understand how your data is created. Is it input by humans? Expect typos. By machines? Expect machine dumbness. Understand why your data exists and what it's supposed to do. Motives are an indicator of biases, and can evolve your understanding of your data considerably.
Approach data the same way you'd approach an unknown source who wants to meet you in a bad part of town: With caution. Maybe you'll end up with a great story. Or maybe you'll end up with a screwdriver in your neck and an empty wallet. Or — worse — a correction.
Point is, be paranoid. Run sanity checks on your data, then run them again. Sort, filter and Group By to look for outliers. Check datatypes using =isNAN() and other such functions.
If you find an outrageous value ("There's a home in Sarpy County worth $5 million?!"), check it out. Pick up the phone. Talk to the human who collects or analyzes the data. Sometimes, the outlier you find is a story. More often, it's a typo, or you don't understand the data yet. ("Oh. There's an entire subdivision in Sarpy County worth $5 million that, for tax purposes, is treated as a single parcel.")