The Problem of Too Much Data

A friend of mine is a jack of all trades. From tree work to auto repair, from farming to light construction, he just does whatever presents itself at the moment, and has become a pillar of the community in doing so. He called this afternoon to talk about burning off our brush pile if it’s not too windy tomorrow, and I got an unexpected lesson in research design.

He’s been chosen to be a test producer for a seed company this year, and he just got twenty packets of free seed, from tomatoes to lettuces to beans. There are seven hundred farmers in this program, which doesn’t sound like a lot, but due to the wonders of multiplication, we’re about to see how bad that can be.

Each farmer got somewhere between 15 and 50 different varieties of seeds in this program. Let’s say the average is 25. That’s 700 farmers with 25 varieties each, or 17,500 test plots around the country. And they’re asking each farmer to post a photograph every week, with a meter-stick in the photo, of each crop, to track date of emergence and rate of growth. Let’s say they average a twenty-week grow season, from plant to harvest. That’s 17,500 test crops times 20 photos… 350,000 photographs.

Posted onto the seed producer’s Facebook page.

Honest to god, they’re going to try to sort meaningful growth and production data out of hundreds of thousands of photos on Facebook, photos that will be completely undistinguishable from one another until the plants are at least somewhat mature. Photos from different lighting conditions, different skill levels and camera qualities, completely random file names. Just some guy or gal standing in a weedpatch with a measuring stick.

Just file names… It used to take me two or three weeks every semester to teach my Duke students how to name their homework files: section#.lastname.project#.draft#. It’s easy to do once you’ve gotten in the habit, to turn in a file labeled 35.childress.P1.D3.docx, but it takes practice and reminders even with Duke students. You think 700 farmers are going to label each photo with something like VT517.A9.0519.jpg and have that be a consistent protocol?

We spend tons of time in grad school on data analysis, learning statistical methods and applying them to artificially manufactured data sets to get better at doing the math. But none of that prepares you for how hard it is to collect and to manage the data in the first place. I had a job a long time ago, a tiny part of which was to figure out average length of juvenile detention in one county. I got to the probation office, and they wheeled in a cart of overstuffed file folders, random court and juvenile hall records in random order, and said, “Here you go. Let us know when you’re ready for the next batch.”

Data can be flawed by mis-definition, by mis-collection, by mis-transcription, by mis-categorization. It can be lost to a failed hard drive, lost to a programming error. It can also be artificially gained: the student records system PowerCampus creates a new record for a student who’s changed majors, for instance, or a faculty member who’s been promoted from assistant to associate professor. Every status change creates a new person, and it took hours to clean the duplicates before I could ever start doing the analytical work. (It took years to figure out how to write the queries in ways that reduced the duplications in the first place.)

Most of us aren’t Google or Cambridge Analytica, with teams of algorithmic designers. Most of us, most of the time, are trying to do relatively simple arithmetic—sums, averages, medians, quartiles, probabilities—with way too much data that we can’t always trust.

I’ll have a specific example of that tomorrow.